[Technology Report]
Parallel Programming Is Here To Stay
One size does not fit all, and it never will. Parallel programming looks to level the playing field by leveraging multicore hardware.
It was easy to program applications in the days when one chip, one core were common. Single-chip solutions remain the target of many systems, especially for mobile applications. But these days, they’re likely to include more than one processing core. Programming these platforms can be a challenge.
High-end server platforms like Intel’s six-core Xeon 7460 use lots of transistors for very large, complex architectures. Systems with even more cores on a single chip are readily available as well. Chips like the 40-core Intellasys SEAforth 40C18, the 64-core Tilera TilePro64, and the 336-core Ambric AM2045 are just the beginning (see “Are You Migrating To Massive Multicore Yet?”).
Many PCs already include high-count multicore chips in the form of graphic processing units (GPUs). They’re now being made accessible for general computing and formalized with platforms like Nvidia’s 240-core Tesla C1060 (see “SIMT Architecture Delivers Double-Precision TeraFLOPS”).
Multicore solutions are on the rise because it’s becoming harder to scale single-core processors while trying to maintain the heat and power envelope necessary to make systems practical. Multicore is no longer a scaling issue, but rather a requirement to meet growing performance requirements.
Clock speed and core count don’t tell the whole story, though. Core interconnects constitute the real programming challenge. Many multicore chips don’t employ the shared-memory approach found in symmetrical-multiprocessing (SMP) platforms like the Xeon, where multithreaded applications can typically exist without regard to the number of underlying cores.
Non-uniform memory access (NUMA) architectures maintain the SMP approach. However, scaling to large numbers of cores can be difficult. For instance, the TilePro64 manages with 64 cores on-chip (Fig. 1).
Still, this is one reason why other approaches, such as mesh networks, are employed when cores start numbering into the hundreds or thousands. This allows designers to throw lots of hardware at a problem, though it requires a different approach to programming.
DISTRIBUTED COMPUTING FRAMEWORKS The OpenMP portable, scalable framework supports multiplatform, shared-memory parallel programming and targets SMP systems. It also supports C/C++ and Fortran and runs on popular platforms such as Linux and Windows. OpenMP is a thread-oriented approach that maps well to existing hardware architectures. Its core elements include thread management, synchronization, and parallel control structures.
The message-passing interface (MPI) standard, maintained at Argonne National Laboratory, can operate on SMP hardware and also span various networks. Several operating systems are based on message-passing communication.
OpenMPI is an open-source implementation of the MPI-2 standard. It can operate over a range of communication systems such as TCP/IP, Myrinet, and most communication fabrics found on multicore processors. OpenMPI also can be mixed with OpenMP.
Intel’s Thread Building Blocks (TBB) are another SMP-oriented framework compatible with OpenMP (see “Multiple Threads Make Chunk Change”). TBB is available as an open-source project as well. Like its name says, TBB is threadoriented, but it tends to utilize one thread per core. Each worker thread gets its work from a job queue. The application feeds the job queues.
TBB extends C and C++ using a limited number of keywords to designate blocks of code that can be performed in parallel. The same is true for data definitions that the parallel code will be working with. These blocks are typically arrays. The data and the processing jobs can be spread across the collection of cores via the worker threads. The queues may fill, but the idea is to keep the cores working instead of idle.
Built around TBB, Intel’s Parallel Studio includes Parallel Advisor (design), Composer (coding), Inspector (debug), and Amplifier (tuning). Parallel Advisor is a static analysis tool designed to identify sections of code in which TBB support will make a difference. It can also identify conflicts and suggest resolutions of these issues. This tool is especially useful for designers who are new to TBB.
Parallel Composer now brings TBB integration to platforms like Microsoft’s Visual Studio. It handles new lambda function support and is compatible with OpenMP 3.0. Parallel debugging support is also part of the package. Its “parallel Lint” capability helps identify coding errors.
Parallel Inspector is a proactive bug finder designed to augment the typical program debugger. It identifies the root cause of defects such as data race conditions and deadlock. The tool can also be used to monitor system behavior and integrity. The system is based on Intel’s Thread Checker tool.