Non-deterministic programs generally have the worst reputation for parallel programming, which often leads to the frustration of having programs that don’t run the same way, given the same data inputs. Assuming the non-determinism occurs when it’s run on the same hardware, this sort of program most often results from data race conditions.
If you’re varying the hardware—say, the number of cores—you should still suspect a data race condition. But it’s also possible that the logic for decomposing a problem isn’t correct for all configurations. Hopefully this is less likely if you’re using an abstract programming model where you aren’t writing the decomposition yourself, which is one of the reasons why we need to program by relying on abstractions for parallelism such as OpenMP and Intel Threading Building Blocks (TBB).
Data races can happen if read and write operations to the same location by different threads aren’t controlled so the order of the operations is the same from run to run. They also can result from write operations in different threads to the same location when the order matters.
There are times when the order doesn’t matter, such as when the only write happening is to set a flag to true, so not all data-race conditions will cause non-determinism. Such data races have been call-benign since they don’t cause non-deterministic programs. They’re the exception, not the rule. When larger data structures are involved and multiple writes are needed to make the data consistent or otherwise correct, data races can occur if one thread starts reading in the middle of a series of writes from another thread.
Non-deterministic programs caused by race conditions are nasty to debug, because you cannot count on them failing the same way from run to run, let alone during the debugging. More than once I’ve invoked a debugger on such a broken program only to find that it worked every time while the debugger was running. If a failure only occurs one run in a hundred, it’s hard to feel confident about debugging the program with a standard debugger.
Making the program deterministic fixes race conditions by making sure that updates from one thread are observed (used) by another thread in a properly controlled manner, or similarly that multiple writes are done in a controlled sequence. In other words, use proper synchronization to make sure a read of a variable happens after it is written by another thread. Don’t leave it to chance. This is generally done with a lock.
No programming language that looks familiar to us is going to save us from writing race conditions. Using abstractions to program in, such as OpenMP, threaded libraries, or TBB, reduces the need for locks by encouraging implicit synchronization.
Implicit synchronization is really about coding in a style that’s free of data races in sections of code. The transitions from section to section prevent sections from creating races between themselves. I still have a hard time explaining this without examples, but this programming style becomes intuitive as we learn to “think parallel.” The art of writing code so it doesn’t set up data race conditions is part of learning to write elegant parallel code.
WHAT’S WRONG WITH THIS CODE?
Consider this code written using TBB:
class Body {
Body() {}
operator() (const blocked_range<int> &r) {
for (int i = r.begin(); i != r.end(); ++i) {
sum += d\[i];
sumsq += d\[i] * d\[i]; } }
}
long d\[10000];
fill_data(&d);
long sum = 0, sumsq = 0;
parallel_for (blocked_range<int>(0, 10000), Body body, auto_partitioner());
The Intel Thread Checker would point out the data races here in the increments of both sum and sumsq. This is because the increment operations require a read, add and then a write of the new value. In this case, we need a write to finish before another thread reads the value to increment it. If multiple threads read, add and then write concurrently, the value will not be the sum of all the values that all the threads intended to add to the global sum.
A quick fix would be to declare the variables atomic. But simply adding:
tbb::stomic<float> sum, sumsq;
before the code would not be efficient in this case because too much of the loop is atomic operations. (The available parallelism is greatly inhibited.) Locks would have a similarly poor result—worse, actually, since they have more overhead.
Continue to page 2