Programming To Survive Multicore: Race Conditions

Non-deterministic programs generally have the worst reputation for parallel programming, which often leads to the frustration of having programs that don’t run the same way, given the same data inputs. Assuming the non-determinism occurs when it’s run on the same hardware, this sort of program most often results from data race conditions.

If you’re varying the hardware—say, the number of cores—you should still suspect a data race condition. But it’s also possible that the logic for decomposing a problem isn’t correct for all configurations. Hopefully this is less likely if you’re using an abstract programming model where you aren’t writing the decomposition yourself, which is one of the reasons why we need to program by relying on abstractions for parallelism such as OpenMP and Intel Threading Building Blocks (TBB).

Data races can happen if read and write operations to the same location by different threads aren’t controlled so the order of the operations is the same from run to run. They also can result from write operations in different threads to the same location when the order matters.

There are times when the order doesn’t matter, such as when the only write happening is to set a flag to true, so not all data-race conditions will cause non-determinism. Such data races have been call-benign since they don’t cause non-deterministic programs. They’re the exception, not the rule. When larger data structures are involved and multiple writes are needed to make the data consistent or otherwise correct, data races can occur if one thread starts reading in the middle of a series of writes from another thread.

Non-deterministic programs caused by race conditions are nasty to debug, because you cannot count on them failing the same way from run to run, let alone during the debugging. More than once I’ve invoked a debugger on such a broken program only to find that it worked every time while the debugger was running. If a failure only occurs one run in a hundred, it’s hard to feel confident about debugging the program with a standard debugger.

Making the program deterministic fixes race conditions by making sure that updates from one thread are observed (used) by another thread in a properly controlled manner, or similarly that multiple writes are done in a controlled sequence. In other words, use proper synchronization to make sure a read of a variable happens after it is written by another thread. Don’t leave it to chance. This is generally done with a lock.

No programming language that looks familiar to us is going to save us from writing race conditions. Using abstractions to program in, such as OpenMP, threaded libraries, or TBB, reduces the need for locks by encouraging implicit synchronization.

Implicit synchronization is really about coding in a style that’s free of data races in sections of code. The transitions from section to section prevent sections from creating races between themselves. I still have a hard time explaining this without examples, but this programming style becomes intuitive as we learn to “think parallel.” The art of writing code so it doesn’t set up data race conditions is part of learning to write elegant parallel code.

WHAT’S WRONG WITH THIS CODE? Consider this code written using TBB:class Body \{       Body() \{\}       operator() (const blocked_range &r) \{
           for (int i = r.begin(); i != r.end(); ++i) \{
                sum += d\[i\];
                sumsq += d\[i\] * d\[i\]; \} \}
\}
long d\[10000\];
fill_data(&d);
long sum = 0, sumsq = 0;
parallel_for (blocked_range(0, 10000), Body body, auto_partitioner());

The Intel Thread Checker would point out the data races here in the increments of both sum and sumsq. This is because the increment operations require a read, add and then a write of the new value. In this case, we need a write to finish before another thread reads the value to increment it. If multiple threads read, add and then write concurrently, the value will not be the sum of all the values that all the threads intended to add to the global sum.

A quick fix would be to declare the variables atomic. But simply adding:

tbb::stomic sum, sumsq;

before the code would not be efficient in this case because too much of the loop is atomic operations. (The available parallelism is greatly inhibited.) Locks would have a similarly poor result—worse, actually, since they have more overhead.

Continue to page 2

Interestingly, the right way to do this is to not be so dependent on a single global sum. Use the implicit synchronization (we can do this without locks and without atomic operations) of a “reduce” operation—at least not explicit ones in our code. (Any locks or synchronizations are hidden from us and implicit in the reduce operation of TBB.) The elegant and most efficient parallel coding is with the parallel_reduce operation:

class Body \{     long m_sum, m_sumsq;     Body() : m_sum(0), m_sumsq(0) \{\}     operator() (const blocked_range &r) \{
      for (int i = r.begin(); i != r.end(); ++i) \{
         m_sum += d\[i\];
         m_sumsq += d\[i\] * d\[i\]; \} \}
    void join(Body &rhs) \{ m_sum += rhs.m_sum; m_sumsq += rhs.m_sumsq; \}
\}
long d\[10000\];
fill_data(&d);
Body body;
parallel_reduce (blocked_range(0, 10000), body, auto_partitioner()); CAN WE ESCAPE DATA RACES SOMEHOW? In practice, the best advice today is to use abstractions such as TBB, OpenMP, or threaded libraries, as they tend to reduce the likelihood of subtle data race conditions. But they don’t eliminate the possibility. Data races are still easy to create, depending on the constructs you use.

General programming can lead to data races, but more specialized coding such as a purely data parallel construct cannot. The use of parallel_reduce() in TBB, for instance, steers us away from the sort of coding that would cause a data race. That’s because data races come from sharing data, and anything purely data-parallel does not need to share data. That means you can reduce the amount of code that can cause data races, and maybe we can use that in the future as a programming technique.

Looking down the road, transactional memory gets a lot of interest these days. It’s an interesting research topic with the alluring goal of doing for shared data what transactional databases have done for databases in terms of ease of programming and determinism. Many issues still prevent transactional memory from being a tamed beast, but it is very interesting to study.

LEARNING MORE: DOWNLOADS AVAILABLE For writing production code now, TBB can be downloaded for free from http://threadingbuildingblocks.org. The Intel Thread Checker, which can be evaluated for free at www.intel.com/software/products/eval, doesn’t use compile time instrumentation. Therefore, it produces fewer false positives than static checkers. It also can debug programs that use libraries that come pre-compiled.

However, the Intel Thread Checker only can detect issues in code that’s exercised while running the tool. Also, the time to test is related to the program’s runtime. In practice, the dynamic methods seem easier to use and more useful than static checking, which produces too many false error messages and cannot operate with precompiled code.

Designers also can take advantage of two implementations of note for software transactional memory: a functional language known as Haskell (www.haskell.com) and a C++ compiler from Intel at http://Whatif.intel.com.

IN CONCLUSION Non-deterministic programs aren’t fun to debug, and they’re too easy to write. But all is not lost. With proper diligence—and the proper tools when the diligence fails to be perfect—we can survive. And, we can always work on more innovation to help us avoid the traps that cause the problems.