[Technology Report]
Parallel Programming Is Here To Stay
One size does not fit all, and it never will. Parallel programming looks to level the playing field by leveraging multicore hardware.
William Wong
ED Online ID #20655
February 26, 2009
Copyright © 2006 Penton Media, Inc., All rights reserved. Printing of this document is for personal use only.
Reprints
It was easy to program applications in the days when one chip,
one core were common. Single-chip solutions remain the target of
many systems, especially for mobile applications. But these days,
they’re likely to include more than one processing core. Programming
these platforms can be a challenge.
High-end server platforms like Intel’s six-core Xeon 7460 use
lots of transistors for very large, complex architectures. Systems
with even more cores on a single chip are readily available as well.
Chips like the 40-core Intellasys SEAforth 40C18, the 64-core
Tilera TilePro64, and the 336-core Ambric AM2045 are just the
beginning (see “Are You Migrating To Massive Multicore Yet?”).
Many PCs already include high-count multicore chips in the
form of graphic processing units (GPUs). They’re now being
made accessible for general computing and formalized with platforms
like Nvidia’s 240-core Tesla C1060 (see “SIMT Architecture
Delivers Double-Precision TeraFLOPS”).
Multicore solutions are on the rise because it’s becoming harder
to scale single-core processors while trying to maintain the heat
and power envelope necessary to make systems practical. Multicore
is no longer a scaling issue, but rather a requirement to meet
growing performance requirements.
Clock speed and core count don’t tell the whole story, though.
Core interconnects constitute the real programming challenge.
Many multicore chips don’t employ the shared-memory approach
found in symmetrical-multiprocessing (SMP) platforms like the
Xeon, where multithreaded applications can typically exist without
regard to the number of underlying cores.
Non-uniform memory access (NUMA) architectures maintain
the SMP approach. However, scaling to large numbers of cores
can be difficult. For instance, the TilePro64 manages with 64
cores on-chip (Fig. 1).
Still, this is one reason why other approaches, such as mesh
networks, are employed when cores start numbering into the
hundreds or thousands. This allows designers to throw lots of
hardware at a problem, though it requires a different approach to
programming.
DISTRIBUTED COMPUTING FRAMEWORKS
The OpenMP portable, scalable framework supports multiplatform,
shared-memory parallel programming and targets SMP
systems. It also supports C/C++ and Fortran and runs on popular
platforms such as Linux and Windows. OpenMP is a thread-oriented
approach that maps well to existing hardware architectures.
Its core elements include thread management, synchronization,
and parallel control structures.
The message-passing interface (MPI) standard, maintained at
Argonne National Laboratory, can operate on SMP hardware and
also span various networks. Several operating systems are based
on message-passing communication.
OpenMPI is an open-source implementation of the MPI-2 standard.
It can operate over a range of communication systems such
as TCP/IP, Myrinet, and most communication fabrics found on
multicore processors. OpenMPI also can be mixed with OpenMP.
Intel’s Thread Building Blocks (TBB) are another SMP-oriented
framework compatible with OpenMP (see “Multiple Threads
Make Chunk Change”). TBB is available as an
open-source project as well. Like its name says, TBB is threadoriented,
but it tends to utilize one thread per core. Each worker
thread gets its work from a job queue. The application feeds the
job queues.
TBB extends C and C++ using a limited number of keywords
to designate blocks of code that can be performed in parallel.
The same is true for data definitions that the parallel code will be
working with. These blocks are typically arrays. The data and the
processing jobs can be spread across the collection of cores via
the worker threads. The queues may fill, but the idea is to keep the
cores working instead of idle.
Built around TBB, Intel’s Parallel Studio includes Parallel
Advisor (design), Composer (coding), Inspector (debug), and
Amplifier (tuning). Parallel Advisor is a static analysis tool
designed to identify sections of code in which TBB support will
make a difference. It can also identify conflicts and suggest resolutions
of these issues. This tool is especially useful for designers
who are new to TBB.
Parallel Composer now brings TBB integration to platforms
like Microsoft’s Visual Studio. It handles new lambda function support and is compatible with OpenMP 3.0. Parallel debugging
support is also part of the package. Its “parallel Lint” capability
helps identify coding errors.
Parallel Inspector is a proactive bug finder designed to augment
the typical program debugger. It identifies the root cause of
defects such as data race conditions and deadlock. The tool can
also be used to monitor system behavior and integrity. The system
is based on Intel’s Thread Checker tool.
Continue to page 2
Parallel Amplifier utilizes Intel’s Thread Profiler and VTune
Performance Analyzers to provide runtime analysis that can help
identify bottlenecks. These tools are designed to simplify the use
of the profiler and VTune by regular programmers.
PUBLISH OR PERISH?
OpenMPI and OpenMP distribute data for processing by using
arrays or communication links, but that isn’t the only mechanism
employed in parallel-programming environments, especially those
that are more dynamic and apt to change over time. Likewise, fixed
buffers, links, and sockets don’t always address environments
where the content is known ahead of time, while the suppliers/
publishers and consumers/subscribers are not. A number of implementations
exist to facilitate this type of environment.
One is the Object Management Group’s (OMG) Data Distribution
Service (DDS). Several commercial versions of DDS are
available, such as Real Time Innovations’ RTI DDS, PrismTech’s
OpenSplice DDS, and Twin Oaks Computing’s CoreDX. Open
Computing’s OpenDDS is an open-source option. Open Computing
provides training and support options.
DDS uses a publish/subscribe model familiar to many programmers,
but it tends to be built on a much larger environment than
a single system (Fig. 2). It has been used for applications ranging
from air traffic control to industrial automation.
DDS provides a loosely coupled parallel computing environment.
Individual publishers and subscribers are programmed in
a conventional sequential programming fashion. Publishers identify
the material they provide to the underlying DSS framework,
which distributes the data as necessary to subscribers that request
such information. In a simplified form, this is how the DDS system
operates.
Things get a little more complex when examining the details,
though, because options such as quality of service and connection
reliability can affect application design. One thing DDS systems
do better than most parallel-programming environments is handle
transient connections, because they support best-effort delivery.
In many applications, it’s sufficient to retain the latest piece of
information. Still, DDS systems must deal with many of the scaling
and complexity issues of any parallel-programming system.
Microsoft’s Concurrency and Coordination Runtime (CCR)
and Decentralized Software Services (DSS) fit somewhere in
between. CCR provides scheduling and synchronization within a
subsystem. These tools were initially released with the Microsoft
Robotics Studio, but have quickly moved to other .NET environments
unrelated to robotics (see “Software Frameworks Tackle
Load Distribution”).
CCR provides asynchronous and concurrent task management
with an eye to coordination and failure handling. It uses its own
message passing system. Ports and port sets are the endpoints
for messages.
CCR is designed for more tightly integrated connections like
OpenMPI. DSS, found on top of CCR, provides a lightweight,
state-oriented service model that uses representational state transfer
(REST), which is also used on a range of Internet communication.
In fact, XML-based communication runs nicely over TCP/IP
links, though this isn’t a requirement. The DSS Protocol (DSSP)
uses the XML Simple Object Access Protocol (SOAP).
DSS has some publish/subscribe semantics. As a result, it can
advertise the availability of a service or piece of information. It
also can have any number of controllers utilizing input from realtime
sensors.
TURNING GRAPHICS ON ITS SIDE
These parallel programming platforms target general-purpose
processing architectures. However, the multicore GPUs found in
most high-performance 3D video adapters from companies such
as AMD, ATI, and Nvidia are also readily available.
The ATI Stream Processing and Nvidia GeForce and Tesla platforms
allow the respective GPUs to find applications beyond just
video rendering. Many of these applications are graphics-related.
However, several others simply use the hundreds of cores in these
GPUs for other computational purposes.
GPU architectures tend to be unique since they were designed
for video rendering of 3D games, but they’re general enough
to handle other chores. For example, Nvidia’s single-instruction
multiple-thread (SIMT) architecture uses thread-processing
arrays (TPAs) of eight cores. These cores are grouped in three
TPA clusters called thread-processing clusters (TPC).
Nvidia developed a framework dubbed the Compute Unified
Device Architecture (CUDA) to handle its SIMT-based GPUs
(Fig. 3). CUDA support can be found in the company’s latest
device drivers, so any PC equipped with one of its GPUs is a
potential supercomputer—well, at least a little supercomputer.
CUDA programs are written in C. Other programming languages
like Fortran and C++ are also being added to the list.
CUDA hides much of the underlying complexity of the SIMT
architecture. In fact, it’s been generalized so that it can address
almost any memory-based multicore platform. CUDA now supports
the Khronos Group’s OpenCL. The Khronos Group is a
member-funded consortium that supports open standards such as
OpenCL and OpenGL. OpenGL is a 2D and 3D graphics application
programming interface (API).
Continue to page 3
Open Computing Language (OpenCL) is a standard for parallel
programming that supports, but is not restricted to, GPUs. It even supports IBM’s Cell processor (see “CELL
Processor Gets Ready To Entertain The
Masses”) found in Sony’s
PlayStation 3 and DSPs.
OpenCL can handle a heterogeneous
environment. Therefore, a mix of x86 chips,
GPUs, and DSPs could merrily crunch on
loads of data. It has garnered wide support,
so this scenario is actually feasible. It can
even fit on mobile platforms.
Also, OpenCL has a platform model
with a controlling host and multiple compute
units. The compute units execute kernels,
which are small chunks of code. This
model is seen elsewhere with Nvidia’s
SIMT architecture as well as Intel’s TBB.
Further, OpenCL uses a relaxed memory
consistency model. It doesn’t guarantee
consistency of common variables across a
collection of workgroup items, unlike an
SMP system, where a variable has one location
that’s equally accessible by any core.
This is because many of the target platforms
feature distributed memory with a core
often having its own local memory.
OpenCL puts some limitations on the
programming model. For example, pointers
to functions aren’t allowed. Data pointers
within a kernel block are allowed, but
they may not be an argument. The restrictions
make it possible to transparently map
the application to the wide range of architectures
supported by OpenCL.
PARALLEL ARCHITECTURES
Frameworks like OpenCL are likely
to be adopted to support new hardware
architectures. But initially, vendor-provided
programming tools will often be the
first step. Likewise, some architectures
work best when the developer can exploit
features within the architecture through
the programming tools designed to work
in the architecture.
One such example is Forth programming
support for the 40-core Intellasys SEAforth
40C18. Each core has only 512 words
of RAM and ROM. Each 18-bit word contains
four instructions. Unlike some other
multicore solutions, the SEAforth cores
aren’t designed to run one large program.
Instead, they run very small, cooperative
programs. In fact, three cores can be used
to handle the dynamic RAM interface.
The XMOS XS1-G4 has hardware
scheduling of up to eight tasks per core
with four cores per chip. The hardware
scheduling makes it easy to write drivers
for soft peripherals or handle the hard interfaces
such as 32 XLink channels. These are
used for communication between cores
and chips.
Channel communication is so ingrained
in the system that the XC compiler, an
extended version of C, brings channels
into the base language. Communication is
explicit, but XMOS uses a basic part of the
approach for parallel programming.
PARALLEL LANGUAGES
Parallel programming on SMP architectures
deals with virtual memory, pointers,
and multithreading facilities that have
been commonly used for decades using
languages like C, C++, and Java. Network
cluster programming using TCP/IP and
sockets has also been prevalent.
These programming techniques can be
used in many core environments. However,
explicit control and communication can
make programming tasks in these environments
difficult as the number of cores
increases. One area in which many cores
make fast work is array computation.
Programming languages like the Mathworks’
Matlab offer matrix manipulation
support. Many matrix computations map
very well to a range of hardware architectures,
though some architectures handle
some operations better than others. For
example, SMP architectures in which
cores have simultaneous access to all
memory can easily handle random access
operations, versus architectures with just
local memory.
These architectures have a high latency
for accessing information that isn’t local,
making operations like matrix inversion a
challenge. This is one reason why GPUs
and clusters of cores can handle some algorithms
exceptionally well while others will
work very poorly.
Matlab’s array-processing support is
something any runtime can provide. So
while this approach is applicable to any
programming language, it only addresses
some parallel-programming chores. For
other chores, there’s the Parallel Computing
Toolbox.
The Parallel Computing Toolbox adds
features such as parallel for loops, distributed
arrays, and message-passing funcintersiltions. Message-passing
functions address MPIstyle
programming,
but the other features
highlight the deficiencies
of conventional
programming languages.
Adding these types
of parallel computing
services illustrates how
programming languages
are changing.
Continue to page 4
In scatter-gather, a
typical parallel-programming
pattern, data
is distributed for processing.
Then the results
are gathered together,
often with additional
processing, to combine
the results. This dataflow
control can be a challenge
for conventional
control flow languages,
but it’s second nature for
National Instruments’ LabView.
The LabView graphical programming
language also is a dataflow language with
which programmers specify how data moves
through the system (here, sequencing is a
secondary issue). Not to say that sequential
programming isn’t part of LabView. In fact,
loop and conditional constructs will be part
of any LabView program.
Many designers will be interested in
how LabView works under the hood on
a conventional processor. In the simplest
case, pending operations are placed in a
job queue. A thread reads the queue and
performs the operation, potentially posting
new jobs in the queue.
This scenario is the same used by Intel’s
TBB. As with TBB, there may be multiple
worker threads. The number of worker
threads tends to match the number of cores.
Fewer of them will avoid the hardware.
More tend to result in idle threads.
Asynchronous I/O doesn’t delay the
working threads. Instead, an entry is added
to the queue when a background operation
is complete.
In theory, job distribution and processing
can be handled by a large number of
cores, potentially using other hardware like
GPUs. National Instruments is researching
these areas now—dataflow semantics
allow LabView to target more than conventional
single-core and SMP platforms.
FPGA application design is naturally
parallel. It also works well with graphical
design tools, so it’s no surprise that Lab-
View applications target FPGAs. LabView
applications can be split across FPGAs and
computing platforms.
Graphical dataflow languages like Lab-
View aren’t common, though a few are
available, such as the Mathworks’ Simulink
and Microsoft’s Visual Programming
Language (VPL).
ACTING PARALLEL
The dataflow approach can be seen as a
message-passing model. Implementations
like LabView operate with a fine-grain resolution.
Move to a coarser level of control,
and the actor model emerges. Actors are
objects that receive and send messages.
They tend to be components rather than
complete applications.
Ambric’s Am2045 Massively Parallel
Processing Array 336-core chip is programmed
using Java. Restrictions exist,
primarily on size, because of the memory
resources available within a core. Essentially,
Ambric implements a messagebased
actor model. Each core executes an
active object/actor with messages being sent and received using a straightforward
channel interface.
Actors and parallel programming are
old friends. Programming languages like
Erlang have been used to implement robust
distributed applications. Erlang was originally
designed by Ericsson with an eye
toward fault tolerance.
Scala is a newer programming language
that addresses the actor model (see “If Your
Programming Language Doesn’t Work,
Give Scala A Try”). It
originally was designed to run atop a Java
virtual machine. Scala also implements the
functional programming model.
FUNCTIONALLY PARALLEL
Functional programming is a bit more
than just calling functions. It’s a programming
model that avoids state and mutable
data. This turns out to be good for parallel
execution, but is at odds with most conventional
programming languages where variables
are designed to be changed at will.
Languages that incorporate functional
programming aspects, such as Scala, are
considered impure functional languages
because variable values can change. Pure
functional languages can’t modify the contents
of a variable. At this point, functional
programming is often more of an academic
than commercial concern when it comes to
implementation.
One advantage of a pure functional programming
language is its referential transparency.
Calling any function with a set of
parameter values will always generate the
same result. This is true for many of the
functions implemented in conventional
languages where it’s possible to use a functional
programming style. However, that’s
the case only if they don’t retain any state
or access information outside of the function
that can change.
Continue to page 5
Guaranteeing that a function always
returns the same value for a set of parameters
means the code can be replicated.
This type of distribution will be critical as
the number of cores rises to the thousands
and system-wide shared memory becomes
a special case rather than the norm.
Likewise, unchanging variables means
distribution of data can occur by copying
information without regard to its source.
This is akin to data that’s transmitted via a
message-passing environment.
Unfortunately, programming with a pure
functional programming language isn’t easy.
This is especially true with a programming
background in a non-functional programming
language. One of the more notable
pure functional programming languages is
Haskell, which is named for Haskell Curry,
a mathematician and logician.
The Haskell language appeared in the
1990s. Its features include pattern matching,
single assignment semantics, and lazy
evaluation. Lazy evaluation allows a list
to be returned as a result from a function
call but where the contents have not been
generated.
The value of the list entries is computed
when they are evaluated. This leads to the
concept of an infinite list. It’s similar to
generator functions or objects found in
conventional languages such as C++. However,
the next value isn’t returned through
an explicit function call but rather when a
value is evaluated.
Monads are an interesting abstract
data-type concept that Haskell supports to
address I/O, typically an area where side
effects are common in conventional programming
implementations. Monads are
similar to lazy infinite lists, as they generate
information on demand. Monads are
object/method-oriented in implementation,
though, making them easier to use in
many instances.
While functional programming can be
challenging, it can have significant benefits
for parallel programming.
PARALLEL DEBUGGING
Debugging needs to be addressed regardless
of the parallel programming approach.
Existing debuggers are simply the starting
point, because most don’t address many of
the features inherent in parallel programming,
such as messaging, data distribution,
and loading.
Tools like tracing, profiling, and optimizers
will need to handle lots of data as
well as provide insight into the parallel
nature of the application. Tools created in
academia are moving quickly to the production
side. Real-time monitoring tools
and declarative debuggers are just some
areas where new ideas can come into play.
Parallel programming will play an important
role in taking advantage of the multicore
hardware that’s being delivered.
|