[Engineering Essentials]
Software Frameworks Tackle Load Distribution
Multiple-core designs can go in several directions when it comes to distributing the load
William Wong
ED Online ID #18113
January 31, 2008
Copyright © 2006 Penton Media, Inc., All rights reserved. Printing of this document is for personal use only.
Reprints
Networking and multicore chips inevitably create greater
complexity, which requires solutions replete with sophisticated
communication and thread management. Distributing
the load is the desired result, allowing more hardware to
be incorporated into a system in a coordinated fashion. The
development task isn’t easy, but the use of one or more software
frameworks can alleviate some of the stresses.
“Software frameworks,” a wonderfully vague term, covers a
lot of ground. It’s been applied to language-based platforms
like Java and Microsoft’s .NET as well as graphical interfaces,
Web server platforms, and a host of other software areas.
While distributed communications and process coordination
includes Intel’s Thread Building Blocks (see “Threads
Make The Move To Open Source” at www.electronicdesign.com,
ED Online 16538), it also includes Data Distribution Service
(DDS) for Real-time Systems, the Message Passing Interface
(MPI), and the Concurrency and Coordination Runtime
(CCR) that’s part of the Microsoft Robotics Studio (see “MS
Robotics Studio” at ED Online 16631).
In theory, all three could be used within a system since they
address different issues, though developers typically try to
minimize the number of frameworks within a design simply to
reduce complexity. Still, the use of frameworks can significantly
cut down the amount of new code required for a job. It also
can address features that otherwise might not be considered,
such as multilevel security and redundancy.
Likewise, these data-centric frameworks often provide scalability
that’s not available with other approaches. In fact, all
three are designed to scale to very large environments with
thousands of nodes. While these frameworks are commonly
deployed in large networks, they’re also of interest in more
compact solutions with ever-increasing amounts of cores on a
chip or board.
DISTRIBUTING DATA
The DDS for Real-time Systems Specification comes from
the Object Management Group (OMG), which brought such
favorites as UML (Universal Modeling Language) and the
Common Object Request Broker Architecture (CORBA).
CORBA is another data-distribution system, but it employs
direct connections and remote procedure calls (RPCs), whereas
DDS uses a publish/subscribe architecture (Fig. 1). Companies
such as Real-Time Innovations and PrismTech provide DDS
implementations. Other publish-subscribe models include the
Java Message Service (JMS).
Unlike direct connect systems like CORBA and Sockets,
DDS information may go nowhere. The concept of lifespan means the validity of information may expire before delivery.
The DDS system handles delivery, which is based on the type,
called topic, of data that’s generated. The ability to specify
different information and quality of service (QoS) by subscribers
and publishers allows designers to create large and robust
systems. The system uses a platform-independent model that
makes DDS easier to deploy in heterogeneous environments.
At the heart of DDS lies the Data Centric Publish-Subscribe
(DCPS) layer. This interface enables applications to
describe and publish data that’s automatically distributed to
subscribers, which specify the type of data they’re looking for.
Policies can be used to control parameters such as resource limits
and QoS. Subscribers can also track liveliness that indicates
whether the publisher is still alive if it hasn’t generated any data
in some time.
Data can be filtered by content or by time, which is a critical
divergence from the normal explicit multithreading constructs.
It provides a significant performance optimization, since tests
can be performed at the source with only the requested data
being sent to a subscriber. Deadlines, latency, and other timing
aspects are well defined with DDS, not implicit within application
code.
Like most communication systems, DDS defines support
for dataflow routing, discovery, and data typing. DDS also
provides data filtering, transformation, and connectivity monitoring
services plus the ability to specify redundancy and replication,
and delivery effort. It can also handle resource management
and status notifications.
The Data Local Reconstruction Layer (DLRL) is an
optional portion of the DDS specification. It provides seamless
integration with the native languages like C and C++. Typical
DDS runtimes are decentralized in their implementation.
This allows for a more robust solution and one that’s at home with transient connections. Overall, DDS
takes a relatively simple publish-subscribe
model and wraps it in a rather extensive
test of functions that can service a wide
range of applications.
MANY MESSAGES
MPI is typically used in high-performance
computing (HPC), though the architecture
applies to a range of applications with
a large number of processors. It provides
a number of communication mechanisms
that work on a range of platforms, from
shared-memory systems to networked
computers (Fig. 2). Designed for use in
applications where processes cooperate
with each other, it can handle groups or
collections of processes as well as hierarchical
process groups.
MPI implementations support the standard
MPI application programming interface
(API) in addition to the runtime support
necessary for both message passing
and distributed thread control. The system
supports blocking and non-blocking messaging.
It also provides fine-grain control
of remote threads and processes. MPI generally
is built atop sockets-based standard
protocols such as TCP/IP.
Processes are grouped into objects called
communicators. Communication can occur within the communicator as well
as between communicators. Each process
within a communicator has a rank. A communicator
can have one or more contexts
that partition the communication space. It
also controls the scope of communication.
MPI addresses a range of communication
methodologies, from basic pointto-
point messaging to parallel operations
such as scatter-gather distribution as well
as broadcasting and summation-style
operations. These operations include sum,
max, and min in addition to user-defined
functions.
Continue to page 2
The system can handle almost any logical
or physical communication topology.
Common topologies include rings and
matrices. MPI provides a way of defining
and controlling messaging within these
communication architectures. It also supports
matrix data mapping with respect to
messages. This support for matrix topologies
and data isn’t surprising, since MPI is
often employed in scientific applications
that do lots of matrix manipulation.
The latest version of MPI, MPI-2,
includes a number of features such as
remote memory access (RMA). RMA
bypasses the send/receive protocol, permitting
one-sided initiation and completion
of operations. MPI-2 also added parallel
I/O and dynamic process and thread
management.
MPI includes a profiling interface that
helps debug and tune a system. Applications
are typically written in C/C++ and
Fortran, and there are interfaces for a number
of other languages such as Java.
MICROSOFT ROBOTICS STUDIO
Microsoft Robotics Studio is built on
two main components: the CCR and
the XML-based Decentralized Software
Services Protocol (DSSP). The CCR is a
managed code library (DLL) accessible
from any language targeting the .NET 2.0
Common Language Runtime (CLR) (Fig.
3). This includes the Compact Framework,
Windows CE, Windows XP, and Windows
Vista in addition to the Windows
server products.
The CCR addresses robotics applications
that often attempt to exploit multiprocessor
and multicore platforms that
must deal with asynchronous operations
on a regular basis. But the CCR equally applies to service-oriented applications
that require a robust communication and
scheduling system.
Like DDS, CCR is designed for loosely
coupled systems where components are
often developed and deployed independently.
These systems frequently require
robust failure and isolation support. The
theory behind the CCR is to move programmers
away from conventional, errorprone
methods to explicitly synchronize
access to that shared memory. Typical
thread primitives, such as locks and monitors
for shared memory, give way to scheduling
and protocol design of the messaging
system, which tends to scale much better
than explicit memory management.
Ports are used to accept incoming messages.
These messages are queued and
passed along to arbiters that determine
what to do with the data and what code
will be dispatched to handle the information.
Arbiters, which can be combined, are
designed to handle multiple requests.
The CCR uses dispatcher objects with
thread and message queues to provide
synchronization and distribution of jobs.
Threads are reused, and communication
essentially employs a messaging model
that enables the environment to be distributed
more easily. The approach lends itself
to object-oriented programming languages
like Microsoft’s C#, where applications
may have thousands of objects requiring
thread services. This approach to pooling
threads isn’t unique to CCR. It’s common
in other frameworks, such as Java 2 Enterprise
Edition (J2EE).
CCR is designed to work closely with
C# and other .NET-based languages.
For example, code using C# iterators can
employ the yield statement that permits
implicit definition of an enumerator delegate.
The integration allows for implementation
of CCR-style iterators using
the same programming style.
None of these three systems answers all
of the problems that may be encountered
in a distributed multithreaded environment,
but each has its place. As frameworks,
one or more may be used as part of
an application or set of applications. The
key is to eliminate as much work for the
developer while providing services that
would otherwise need to be developed as
part of the application. Utilization of standard
frameworks provides access to developers
with framework experience, as well
as a support system.
TARGET SPECIFIC FRAMEWORKS
General frameworks like MPI and DDS
can be utilized on a range of processing
platforms. But a system’s architecture
often may require frameworks that are
more specific. Examples include Mercury
Computer Systems’ MultiCore Framework
(MCF) for the IBM Cell processor
(Fig. 4) and NVidia’s CUDA (Compute
Unified Device Architecture) for the G8X
GPUs (graphics processing unit), which
are found on NVidia’s Tesla computing
adapters (Fig. 5) as well as on a number of
NVidia video cards.
MCF takes advantage of the multiple
synergistic processor elements (SPEs) in
the Cell processor. Each SPE has its own
local memory. The Power processor element
(PPE) has access to the large main
memory. While the PPE supports
virtual memory, the SPEs do
not. The SPEs do the heavy
lifting, so it is important to
keep them doing real work.
That means the care and feeding
of memory is critical to efficient
SPE operation.
The MCF, which runs on the SPEs
and PPE, minimizes data access latency
while providing overlapped communication
and computation on the SPEs. It
enables designers to choose the computational
and storage granularity for a problem
with the idea of limiting SPE code to DMA setup and computational chores.
The system breaks memory up into blocks
called tiles.
A tile contains data as well as a descriptor
that includes details like its virtual memory
address on the PPE side. The MCF manager
runs on the PPE and feeds an input
channel with tiles that are then distributed
to worker tasks running on the SPEs.
Result tiles from the worker application
are placed in an output channel that
the PPE reads. The SPEs run a small,
12-kbyte kernel that manages the DMA
data exchange and the worker code used
on the SPE. Multiple tasks can run on a
single SPE.
Continue to page 3
The system supports sparse matrices,
including the ability to transpose result
data. Likewise, it lets developers write
applications for
the SPE that do
not have to deal
with the memory-
management
complexities of the
Cell architecture.
Furthermore, Cell processor
systems deal with stream
video processing—another area where
NVidia’s approach is used, but the chip
architecture is much different.
NVidia’s CUDA also is tasked with
optimizing a developer’s time in creating
applications that take advantage of a GPU.
CUDA includes a native C compiler along
with FFT and BLAS libraries, a profiler,
and a gdb debugger, as well as a host runtime
driver. The adapters typically plug
into a 16x PCI Express slot.
Sample applications address operations
such as image convolution, binomial option
pricing, and Sobel edge detection. CUDA
provides host support for Fortran and has a
MathWorks Matlab plug-in. It works with
the Tesla adapters in addition to a number
of NVidia graphics adapters.
CUDA has a similar memory-management
issue like the Cell processors, but the
interaction is more flexible. The complexity
comes in the type of processors, the data
flow, and their management. The chips
were designed for graphics processing, so
they are amenable to streaming algorithms
of this type. But the architecture can handle
more general algorithms as well.
Because of the limitations, though,
CUDA addresses thread and memory
management. For example,
a fast shared memory
region typically can be
used for texture lookups in
graphics applications. But it
also can be used for general communication
among threads. The architecture
uses a SIMD model so groups
of 32 threads, called thread blocks, run
simultaneously, executing the same code
but on different data.
The CUDA framework’s host processor
is physically separated from the computational
elements, unlike the Cell, which
incorporates the PPE on the same chip
as the SPEs. This means that additional
worker cores can be added in the CUDA
approach by adding more adapter boards.
It also means the framework must account
for this capability.
AMD also has its ATI line of graphics
cards and a FireStream GPU adapter on
par with NVidia’s Tesla, so it is not surprising
to find a similar framework to support
a similar streaming type of architecture.
Likewise, its video gaming roots are
apparent in some target applications such
as physics simulation support for games.
AMD calls its framework the Stream
Computing Software Stack. Its native
development tool is based on the Brook C/
C++ compiler. The Brook compiler incorporates
data parallel computing ideas and
targets a range of platforms. It also adds
new data types such as streams plus scatter/
gather operations. The AMD incarnation
addresses the BrookGPU subset that
accounts for the advantages and limitations
of a GPU like the FireStream.
Frameworks such as those mentioned in
this article will be critical in taking advantage
of multicore platforms, whether
they’re heterogeneous or homogeneous,
because they remove much of the management
complexity of the underlying system
from the programmer’s perspective.
NEED MORE INFORMATION |
| AMD |
|
| Brook Language |
|
| IBM |
|
| Mathworks |
|
| Mercury Computer System |
|
| Message Passing Interface Forum |
|
| Microsoft |
|
| NVidia |
|
| Object Management Group |
|
| PrismTech |
|
| Real-Time Innovations |
|
|