The Mars Rover is one of the most sophisticated systems ever developed in human history. Designed and built by NASA engineers, it involved break-through innovations in mechanical engineering, hardware design, and used software to achieve the audacious goal of having a rover land on a planet that is, at its closest point, about 54.6 million kilometers away from the earth, then powering itself up and exploring Mars by collecting pictures and other data. Unfortunately, this miracle of human innovation had a software glitch due to a complex programming defect know as a race condition.
As our dependence on software grows, so does the sheer size of software and its complexity. Software is used to achieve essential functionality in everything from the latest games on the smart phones to sophisticated control systems in airplanes. Consumers demand more from the devices and services and anticipate a seemingly endless expansion of capabilities from their software. For the last 45 years of computing, Moore's Law has held true, with the number of transistors available in an integrated circuit doubling approximately every two years.
This constant pace of innovation provided software developers with critical hardware resources, enabling them to expand the features and functionalities of their applications without concern for performance. However, while Moore's Law is still holding steady, single-core processing has reached its performance ceiling. In response, hardware manufacturers have developed new multi-core (or multi-processor) devices to accomplish the speed gains to which the world has grown accustomed.
One of the most fascinating things about software development is that every opportunity introduces a unique challenge that developers need to addressin order to fully exploit its capabilities. To fully leverage this new class of multicore hardware, software developers must change the way they create applications. By turning their focus to multi-threaded applications, developers will be able to take full advantage of multicore devices and deliver software that meets the demands of consumers and IT departments. However, this paradigm of multi-threaded software development adds a new wrinkle of complexity by making it possible to introduce concurrency errors such as race conditions and deadlocks that are unique to multi-threaded applications. Complex and hard-to- find, these defects can quickly derail a software project. To avoid catastrophic failures, software development organizations must understand how to identify and eliminate these deadly problems early in the application development lifecycle.
Code testing with modern static analysis provides a solution that not only allows developers to identify and fix concurrency defects in new code, it also provides a cost-effective and automated way to uncover such defects in software that is being ported to run on multi-threaded processors. Let's understand this using a few examples.
Single-threaded versus Multi-threaded: Increased complexity
Consider the following piece of code:
Example 1: Increased complexity when multi-threaded
void main() { int x = input_from_user(); If (x < 100) { printf("x is less than 100!\n"); } else { printf("x is at least 100 or greater!\n"); } }
It's a simple program that takes some input from the user and prints a message based on the value of the user input. When running in a single-threaded context, testing this for all possible inputs and expected outcomes would simply require a developer to make sure to feed the program a value that was less than 100 as well as a value that was greater than or equal to 100 (NOTE: This assumes that nothing interesting happens in the input_from_user function-admittedly a bad assumption in software development!).
Next, let's raise the stakes and imagine that the end-user wanted this program to input two values instead of one with the same logic after the input. This change would increase the complexity of the single-threaded application in such a way that the software developer would need to explore four different paths to fully test the program.
And to make it even more interesting, imagine that a requirement comes in from the field to make this application multi-threaded to take advantage of a dual- core processor. This could be very much the case when porting a legacy application to run on new multi-code hardware. The change would require that the code in Example 1 be executed simultaneously by two different threads to take in the two inputs. Assuming that each statement is made up of three machine instructions (likely an underestimate), the number of possible interleavings of the instructions in the two threads leads to a mind boggling 194,480 different execution possibilities.
It's a strong example that clearly demonstrates the issue of complexity in single-threaded versus multi-threaded software development and the required testing that goes along with it.
The Challenges With Debugging Concurrency Errors
As we noticed in the example discussed previously, a simple code is much more complex when executing in a multi-threaded environment. This makes testing the software a challenge due to various reasons:
- It is difficult to reproduce problems that occur in multi-threaded applications because they are not in control of how their application will be executed when it runs.The operating system and thread scheduler together decide when each thread will be executed as multiple threads execute.
- Tracing to the source of the defect that causes the error, usually unexpected behavior, is close to impossible through manual debugging. Check out this simple example.
Example 2: Race Conditionpublic class GuardedByViolationExample { int count; Object lock; public void increment() { synchronized(lock) { // lock_event count++; // guarded_access event } } public void decrement() { count--; // Defect: unguarded_access } }
In this short Java code example, the variable 'lock' guards 'count' in the GaurdedByViolationExample class. However, for the method decrement(), the access to count is not synchronized. This leads to race condition errors when this code runs in a multi-threaded environment.The value of count will be unexpected when more than one thread "races" to update the value of 'count'. Without knowing the source of the defect, which is the lack of synchronized access to count in decrement(), it would be very difficult to trace the source of the unexpected behavior error as the only observable error would be an unexpected value of 'count'.
The Deadly Deadlocks
Deadlocks (situations where a multi-threaded program cannot make any progress) are a particularly difficult concurrency issue to debug. They arise when the order of lock acquisition used by the program is inconsistent. A simple example is lock inversion, which happens when one thread attempts to acquire a lock (a), then holds on to the lock while attempting to acquire lock (b). If another thread tries to acquire the locks in the opposite order (acquiring b before a), neither thread can make progress, and the program is deadlocked. Let's understand this with a simple example.
Example 3: Deadlock Example
// Thread one enters this function void deadlock_partA(struct directory *a, struct file *b) { spin_lock(a->lock); // Thread one acquires this lock spin_lock(b->lock); // before it can acquire this lock... } // Thread two enters this function voiddeadlock_partB( struct directory *a, struct file *b ) { spin_lock(b->lock); // Thread two acquires this lock spin_lock(a->lock); // Deadlock }
The simple example shows the possibility of a deadlock defect. When two distinct threads call functions deadlock_partA() and deadlock_partB(), an unlucky scheduling might have the first thread acquire lock (a->lock), while the second thread acquires lock (b->lock). In this state, it is impossible for either thread to acquire the lock it needs to continue execution or release the lock it holds. More complicated deadlocks exist that involve the acquisition of multiple locks across multiple threads.
The Static Analysis Solution
Fortunately, by leveraging breakthroughs in technology, static analysis solutions can analyze code prior to run-time enabling developers to identify and eliminate onerous multi-threaded defects including race conditions, thread blocks and deadlocks.
In the example we discussed in Example 2, it is obvious that the access to the ' count' variable needs to be synchronized. With count accessed in a synchronized block, interleaved calls to increment on the same instance object are eliminated. Static analysis helps programmers avoid race conditions by inferring and enforcing 'guarded-by' relationships on shared fields in the program. A field (f) is guarded-by lock (l) if accesses to (f) must be protected by holding lock (l). The set of guarded-by relationships essentially define the locking discipline for each class in the program.When a preponderance of accesses to a field occurs with the same lock held, a guarded- by relationship is inferred between the accessed field and the held lock.
For deadlocks, one way that advanced static analysis technology detects deadlocks is to ensure that all locks in the program are acquired and released in a consistent order. More concretely, the static analysis can construct an acyclic lock graph for a program. Nodes in a lock graph represent lock names, and an edge between nodes (a) and (b) denotes that at some point, lock (a) was held while lock (b) was acquired. If a lock graph contains a cycle, the program may have a lock ordering or lock inversion deadlock. Innovative static analysis is now capable of finding these lock ordering defects inter-procedurally ( through method calls), as these are likely to be missed by conventional code review or other defect detection techniques. In the example discussed in Example 3, a static analysis solution would create the lock graph for method lock1.
The edge from a->lock to b->lock indicates that a->lock should always be acquired before b->lock. Analysis of method deadlock_partB, however, causes the analysis to add an edge from b->lock to a->lock.
This violates the requirement that the lock graph be acyclic. Therefore, a defect would be reported in lock2.
Beyond just pointing out the defect, the use of the technologies to detect concurrency defects allows static analysis to suggest a possible way to correct the discovered defect, such as enclosing the access in a synchronized scope with the suggested lock. Additionally, when the same defect exists in multiple code branches due to replication, advanced static analysis can detect such cases, and indicate to the developer such cases so that the defect can be fixed in all the places where it will manifest into a runtime concurrency error.
Static analysis is just a part of the puzzle
An obvious reason that might discourage developers from relying on automated static analysis to identify concurrency defects is that because of the very nature of runtime concurrency errors, the static analysis results might contain a higher-than-average number of false results. It is important to know that static analysis can be tuned to improve the accuracy of the results and is just a part of the broader range of solutions available to the developers for managing the complexity of writing code and debugging multi-threaded applications.
An example of tuning and the flexibility that is possible is using the technology of statistical checking. In Example 2, the example that discussed an instance of a field that is updated without locks causing potential race conditions, a flexible static analysis can allow defining the threshold that controls when a field needs to be guarded by synchronized access. For example, a defect will be reported only when the static analysis infers that a field is guarded at least 75% of the time in other places in the code.
Another tool in the concurrency defect detection solution set is Dynamic Analysis. Dynamic analysis works differently than static analysis in that it focuses on the paths that are exercised when the application runs. This allows dynamic analysis to report defects that it observers during runtime. Hence dynamic analysis reported defects, though fewer, tend to be more accurate for concurrency errors.
Example 3 is a defect reported by dynamic analysis of a multi- threaded Java application. The analysis observed two threads, 'upper_0' and ' downer_0' accessing the field 'race' while not holding any lock to prevent a race condition. Clearly this would result in an unexpected value of the field ' race'.
With accuracy being a definite advantage, dynamic analysis however has its limitations in that it only detects concurrency errors in the part of the code that is executed during the test and there is processing overhead when testing with dynamic analysis.
Conclusion
As this shift continues, many development organizations will transition to multi-threaded application development on the fly. This creates new liabilities for development teams in terms of application quality and security, because their code is now vulnerable to a host of concurrency defects-a class of defect that can easily slip past manual code review and traditional testing techniques.
Developers moving to multi-threaded applications need advanced new testing capabilities to help them control this new cause of software complexity. Because they are nearly impossible to replicate in dynamic testing environments, static analysis is uniquely suited to play an important role in eliminating concurrency defects early in the software development lifecycle. However, due to their underlying complexity, some concurrency defect types might escape detection or the analysis might report a high number of false positive results. However, by providing ways to customize the analysis and by combining static analysis and dynamic analysis techniques, the accuracy and speed of defect detection can be improved.
Today, sophisticated new technology is advancing the science of static analysis to help developers meet the challenge of creating multi-threaded applications. By arming developers with static analysis capabilities that detect complex concurrency defects early in the development lifecycle, organizations will accelerate their ability to produce and release multi-threaded applications with greater confidence.