[Technology Report]
The Multicore Era Seeks A Parallel Paradigm
Scalability, simpler debugging, and easier coding are essential to developing a successful parallel-programming approach.
Parallel programming is hard. But debugging it is even harder. Unfortunately, taking advantage of multicore solutions like Intel’s 80-core TeraScale prototype will require some type of parallel-programming technique (Fig. 1).
The first challenge is to find parallelism that can be exploited. The next is using a tool to exploit the parallelism. Another goal is bug-free code. Parallel programming opens the door to a range of more complex bugs, though, and time becomes even more critical. Finally, there’s the issue of targeting the host platform with these tools.
At this point, generic solutions don’t exist because of the range of multicore hardware. Tools primarily target only one class of hardware or even one vendor’s hardware. Programmers typically push these jobs off to the operating system or runtime. Eventually, though, parallel-programming constructs will make it into mainstream programming languages. Either way, developers will need multicore solutions to take advantage of performance improvements, since singlecore scaling is no longer an option in pushing the limits.
LET THE OPERATING SYSTEM DO IT Pushing the job of managing coarse-grain parallelism onto the operating system is a common task and easy to do. It works well if there’s a large number of programs, or if those programs are taking advantage of multiple cores. This requires no modification of the applications, but it’s of less value if there isn’t enough programs to exploit the hardware.
Server environments typically can have program loads that use the target hardware. Likewise, embedded application designers can latch onto virtual-machine (VM) products like Trango’s Hypervisor, Green Hills Software’s Integrity, VmWare’s namesake, and KVM or Xen on Linux to manage multicore solutions. These tools allow for better management and debugging of programs and systems in addition to providing features like load leveling.
VM architectures potentially open up other avenues for programmers. Thin operating systems or programs running alone in a VM may be given access to features previously restricted to the operating system, such as virtual memory management and peripheral access.
Virtual memory management could enable programmers to manage memory and interprocess and intra-application communication more effectively. For multicore utilization, communication is key to good use of the system. The big question is whether programming languages or runtimes will take this approach.
LET THE RUNTIME DO IT After VMs, runtimes are the most common method for exploiting multicore environments. Platforms like Intel’s Threading Building Blocks (TBB) require developers to explicitly use exposed function calls to utilize the runtime.
This approach forces developers to determine the type and utilization of parallelism in an application and meld it with the runtime. In turn, the runtime will also need to manage parallelism. The functional interface can help narrow the scope for finding parallelism that may put the onus on the programmer to use the right function.
Usually, the interface is implemented to the runtime strictly through function or class definitions, though customizing a compiler offers advantages as well. TBB employs a typical interface, much like the following definition for the parallel_do function:
template<typename InputIterator, typename Body> void parallel_do( InputIterator first, InputIterator last, Body body );
In general, parallel processing deals with data or control parallelism. The above definition takes advantage of TBB’s C++ support and C++ templates. Specifically, TBB addresses data parallelism over large data sets, such as matrices or streams of data.
Microsoft’s Concurrency and Coordination Runtime (CCR) (see “Software Frameworks Tackle Load Distribution” at www.elecronicdesign.com, ED Online 18813), which was released with Microsoft’s Robotics Studio (see “MS Robotics Studio,” ED Online 16631), also uses a functional interface and addresses control parallelism. In this case, CCR helps optimize asynchronous communication between threads that may be distributed among multicore platforms or even across networks.
As with any runtime, programmers must account for a mindset and an underlying architecture. They work with it all the time, since applications rarely are completely standalone or written solely by a single programmer. Consequently, there’s at least some level of black-box isolation within an application. On the other hand, complex frameworks like TBB or CCR require a good understanding of the underlying architecture.