[Technology Report]
The Multicore Era Seeks A Parallel Paradigm
Scalability, simpler debugging, and easier coding are essential to developing a successful parallel-programming approach.
William Wong
ED Online ID #18159
February 28, 2008
Copyright © 2006 Penton Media, Inc., All rights reserved. Printing of this document is for personal use only.
Reprints
Parallel programming
is hard. But debugging
it is even harder.
Unfortunately, taking
advantage of multicore
solutions like Intel’s
80-core TeraScale
prototype will require
some type of parallel-programming technique
(Fig. 1).
The first challenge is to find parallelism
that can be exploited. The next is using a
tool to exploit the parallelism. Another goal
is bug-free code. Parallel programming
opens the door to a range of more complex
bugs, though, and time becomes even more
critical. Finally, there’s the issue of targeting
the host platform with these tools.
At this point, generic solutions don’t exist
because of the range of multicore hardware. Tools primarily
target only one class of hardware or even one vendor’s hardware.
Programmers typically push these jobs off to the operating
system or runtime. Eventually, though, parallel-programming
constructs will make it into mainstream programming languages.
Either way, developers will need multicore solutions
to take advantage of performance improvements, since singlecore
scaling is no longer an option in pushing the limits.
LET THE OPERATING SYSTEM DO IT
Pushing the job of managing coarse-grain parallelism onto the
operating system is a common task and easy to do. It works
well if there’s a large number of programs, or if those programs
are taking advantage of multiple cores. This requires no modification
of the applications, but it’s of less value if there isn’t
enough programs to exploit the hardware.
Server environments typically can have program loads
that use the target hardware. Likewise, embedded application
designers can latch onto virtual-machine (VM) products
like Trango’s Hypervisor, Green Hills Software’s Integrity,
VmWare’s namesake, and KVM or Xen on Linux to manage
multicore solutions. These tools allow for better management
and debugging of programs and systems in addition to providing
features like load leveling.
VM architectures potentially open up other avenues for
programmers. Thin operating systems or programs running
alone in a
VM may be given access to
features previously restricted to the
operating system, such as virtual
memory
management and peripheral
access.
Virtual memory management could
enable programmers to manage memory
and interprocess and intra-application
communication more effectively. For multicore
utilization, communication is key to
good use of the
system. The big question is
whether programming languages or runtimes
will take
this approach.
LET THE RUNTIME DO IT
After VMs, runtimes are the most common
method for exploiting multicore environments.
Platforms like Intel’s Threading
Building Blocks (TBB) require developers
to explicitly use exposed function calls to utilize the runtime.
This approach forces developers to determine the type and
utilization of parallelism in an application and meld it with the
runtime. In turn, the runtime will also need to manage parallelism.
The functional interface can help narrow the scope for
finding parallelism that may put the onus on the programmer
to use the right function.
Usually, the interface is implemented to the runtime strictly
through function or class definitions, though customizing a compiler
offers advantages as well. TBB employs a typical interface,
much like the following definition for the parallel_do function:
template<typename InputIterator, typename Body>
void parallel_do( InputIterator first, InputIterator last,
Body body );
In general, parallel processing deals with data or control parallelism.
The above definition takes advantage of TBB’s C++ support
and C++ templates. Specifically, TBB addresses data parallelism
over large data sets, such as matrices or streams of data.
Microsoft’s Concurrency and Coordination Runtime
(CCR) (see “Software Frameworks Tackle Load Distribution”
at www.elecronicdesign.com, ED Online 18813), which was
released with Microsoft’s Robotics Studio (see “MS Robotics
Studio,” ED Online 16631), also uses a functional interface and
addresses control parallelism. In this case, CCR helps optimize asynchronous communication between threads that may be distributed
among multicore platforms or even across networks.
As with any runtime, programmers must account for a mindset
and an underlying architecture. They work with it all the
time, since applications rarely are completely standalone or written
solely by a single programmer. Consequently, there’s at least
some level of black-box isolation within an application. On the
other hand, complex frameworks like TBB or CCR require a good
understanding of the underlying architecture.
Continue on Page 2
Putting an additional level between the programmer and the
base system sometimes can help, too. This is the case with Microsoft’s
PLINQ (Parallel Language Integrated Query) technology,
which is an extension of LINQ. PLINQ and LINQ are designed
to simplify access to data sources such as SQL servers.
The difference between PLINQ and LINQ and SQL or other
interfaces like XPath and XQuery is that PLINQ is a data-source
agnostic, type-safe query language that’s embedded in a number
of Microsoft’s .NET-based languages (such as C#). Since database
use is ubiquitous in many applications, improving parallel
performance can significantly boost performance.
Again, finding parallelism is a cooperative process with programmers
needing to know what functions to utilize. The advantage
for programmers is that they only need to learn a single query
language regardless of the data source. PLINQ was designed to
maintain the programming model provided by LINQ while offering
additional parallel functionality.
Integrating LINQ/PLINQ functionality within the compiler
has advantages in the sense that syntactic changes are easier. It
wreaks havoc on portability, though, limiting the solution to
Microsoft platforms. New approaches like this also mean fighting
conventions like SQL with new syntactic ordering such as:
var q = from x in Y where p(x) orderby x.f1 select x.f2;
As with most programming syntax, one person’s sugar is another’s
salt. Still, being able to completely embed the solution with a
programming language can simplify a programmer’s job of learning
a system, and parallel constructs won’t be utilized if they’re
hard to use or remember.
Of course, playing with syntax and semantics does allow compiler
and systems designers to add features that would otherwise be
hard to incorporate by staying strictly within the bounds of a current
programming language definition. For example, PLINQ adds
the idea of lazy evaluation in the form of infinite streams.
Using a stream within a query lets the system access only those
items needed to complete the current transaction. A simple example
would be a stream query that has results being returned one at a time.
If the stream already supplied the data when a result is requested,
then the application continues. Otherwise, it waits and the calculation
of the next stream element occurs.
PLINQ provides a range of parallel-processing enhancements,
such as the ability to run multiple threads on a partitioned data space
as well as pipelining requests. Of course, each enhancement has its
own issues, such as whether physical or temporal locality of data is
critical to the application or the operation being performed.
Likewise, partitioning queries can have a major impact on the
resulting performance and efficiency (Fig. 2). As the number of
cores, threads, and communication methods increases, so does the
number of options. And regardless of whether you’re using TBB,
CCR, or something else, it’s difficult to get the costs right.
The number of cores in a system may be large, but runaway
computation can waste such a resource. This may not even be
apparent from a user’s perspective, since a result may be delivered
in a timely fashion. But developers will need more insight, including
more time-oriented diagnostics.
LET THE LANGUAGE DO IT
Mainstream languages like Basic, C, C++, C#, and Java include
multithreading support. However, all thread and data management
is explicit. They form the basis for the parallel runtimes, but
runtime designers often perform some interesting feats that most
programmers would rather forget or not even want to learn about.
Research projects like Unified Parallel C add to the syntax and
semantics of an existing language. Still, programmers loathe incorporating
new changes unless they can see widespread adoption, or
if a particular platform they must use supports the tools.
Another issue is the existing infrastructure and semantics for
most of the mainstream languages. For example, shared memory is the norm. Yet it’s a concept that doesn’t
scale well, while pointers and references are
central to languages like C or Java.
Several different approaches, such as
using futures for lazy function evaluation,
are similar to the PLINQ infinite stream
example noted earlier. This approach is
commonly used in functional programming
languages like Miranda and Haskell,
though these examples definitely aren’t
mainstream.
Continue on Page 3
Likewise, Scheme, a dialect of Lisp,
employs the functions delay and force to
implement the idea of futures. A function’s
computation can be delayed until a result is
forced, though it may only be the part of the
result that’s being examined. If the result is a
list and the value of the first item is forced,
then only that item needs to be computed.
This approach as well as other parallelprogramming
methods such as scattergather
are used in a range of applications
already, from database servers to
disk queuing to memory caching, with
the well-accepted look-ahead methods.
These features need to be incorporated
into programming languages, but deciding
how and when is a difficult task. Various
features do find their way into the mainstream
eventually. For example, lambda
expressions are cropping up in C# and Java
(see “Lambda: Reclaiming An Old Concept,” ED Online 18099).
While researchers may be scurrying
to move parallel language enhancements
into the mainstream, some platforms are
already there. National Instruments’ Lab-
VIEW has been supporting parallel dataflow
semantics since its inception, as well
as time-based programming aspects that
blend well because of LabVIEW’s graphical
nature.
LabVIEW isn’t the only graphical programming
language that supports dataflow
semantics, but it’s one of the more
mature products. It brings parallel processing
semantics down to the graphical
statement level. In fact, LabVIEW tends
to push parallel processing to the other
extreme, where hundreds of expressions
may be pending evaluation.
Prioritizing computation tends to be
more difficult compared to sequential textbased
programming languages, but that’s
the tradeoff. Every language has its own
advantages and disadvantages, and none of
them—not even LabVIEW—answers all
problems equally well.
One aspect handled well by National
Instruments with its LabVIEW implementation
is splitting a model/program across
platforms. This is critical for parallel programming
because many architectures are
hybrids with multiple instances of multiplecore
platforms. Multiple-core platforms are
normally linked by shared memory while
instances are normally linked using other
techniques. Many other approaches tend to
fall down in this area because they address
only a single architecture.
|