[Technology Report]
Match Multicore With Multiprogramming
Multiple cores deliver performance with lower power requirements, but processors can’t contribute much if they’re idle.
William Wong
ED Online ID #21341
June 25, 2009
Copyright © 2006 Penton Media, Inc., All rights reserved. Printing of this document is for personal use only.
Reprints
Across the embedded landscape, the design credo has become
“more cores.” However, challenges remain when it comes to the
software side. Some hardware architectures can deliver dozens of
cores, while others hit thousands of cores. Unfortunately, applications
don’t always port easily across different architectures.
For the low end of the embedded space, single-core solutions
will remain. It’s still possible to move up the power and performance
curve by moving to faster or wider processors. At the high
end, though, multiple cores is the way to go.
This is why double-precision floating point crops up often and
the reason these solutions often wind up in supercomputers. In
fact, desktop and rack-mount systems like ones from Nvidia bring
this level of processing power to the masses (see “Standard GPU
Cluster Provides High Performance In The Mid-Range”).
Another matter that crops up when discussing software and
multicore architectures is virtualization. Not all multicore platforms
support virtualization, but it opens up opportunities. And
while it does make hardware design more challenging, it typically
simplifies software and application management.
SMP SERVERS
The Xeon Nehalem-EX is the top-of-the-line octal core symmetrical
multiprocessing (SMP) platform from Intel. Multichip
solutions like an eight-chip, 64-core system utilize the highspeed
QuickPath point-to-point interconnect to link processor and
peripheral controllers together (Fig. 1).
This architecture will be familiar to those using AMD’s Opteron
processors with HyperTransport links. In both cases, the simplest
configuration is a single processor linked to a single peripheral
controller by a single high-speed link.
Both vendors implement a form of cache-coherent, non-uniform
memory access (ccNUMA) in addition to a distributed memory
subsystem. Each processor chip has its own memory controllers
plus L1, L2, and L3 caches. Any chip can access memory in
any other chip by using the high-speed links. Of course, data that’s
further from the requester will take more time to access.
These high-speed links are also used on consumer devices, but
a single link to an I/O hub is usually all that is necessary. On the
other hand, servers generate significant traffic between processor
chips for shared memory access. Chip-to-chip traffic and cache
management is critical to efficient operation.
One of the key features in AMD’s new Istanbul Opteron, HT
Assist, optimizes the memory request and response process to
minimize the number of transactions involved, freeing up bandwidth
for other traffic (Fig. 2). HT Assist actually tracks data
movement among cores and caches, allowing a request to be serviced
by the nearest core with the required data.
The worst-case scenario is when data must be accessed from
off-chip memory by the chip that owns that memory space. The
best-case scenario is finding the data in the cache of the chip running
the thread needing the data. Intermediate scenarios will have
cores getting the data from an adjacent chip’s cache.
The system becomes more complex when virtualization and
caches are brought into play, leading to variable latency for data
that’s hard to determine. This can be an issue in deterministic
embedded applications, but is less so in most server applications
where speed is desired over fine-grain determinism.
Programmers now seek these platforms because they greatly
simplify the programming task. Likewise, applications can exploit
the growing number of cores, assuming the application can utilize
enough threads efficiently.
Efficient use of multicore systems isn’t as easy at it might look.
Cache size and locality of reference within an application’s working
data set can affect how well a particular algorithm will run.
AMP APPLICATION PROCESSORS
Symmetric-processing (SMP) architectures come in handy for
many embedded applications, but asymmetrical multiprocessing
(AMP) can also be useful for other applications. AMP configurations
show up in a number of guises, from Texas Instruments’
OMAP (Open Multimedia Application Platform) to Freescale’s
P4080 QorIQ (Fig. 3).
TI’s OMAP 44xx combines an ARM Cortex-A9, a PowerVR
SGX 540 GPU, a C64x DSP, and an Image Signal Processor. Each
core has a dedicated function. Communication between processors
isn’t symmetrical.
The OMAP operates only in an AMP mode. The P4080, on the
other hand, is an SMP system at heart with the ability to partition
its cores into an AMP mode as well. The eight-core chip can run
as eight independent cores, or it can be combined in a number of
configurations (e.g., a pair of two-core SMP subsystems or four
single-core subsystems).
The main high-level architectural difference between the
OMAP and P4080 is that the OMAP functions are fixed and the
cores are optimized for their respective chores. This makes programming
significantly easier since partitioning the application to
particular cores will be based on matching functionality.
Continue to page 2
Each subsystem’s level of performance is limited by the architecture,
whereas the P4080 can adjust its partitioning, though this
is normally done when the system starts up. A system designer
can adjust the allocation of cores in the P4080, assuming sufficient
cores are available. QorIQ platforms with fewer cores are on
the market, too, allowing a more economical chip to be used.
IBM’s Cell processor fits somewhere in between (see “Cell
Processor Gets Ready To Entertain The Masses”). It incorporates a 64-bit Power core plus eight Synergistic
Processing Elements (SPEs). The SPEs are identical (each with
256 kbytes of memory), and they operate in isolation unlike the
shared-memory SMP systems already discussed. There is no virtual
memory support or caching within the SPEs.
This has advantages and disadvantages for hardware and software
design. The approach simplifies the hardware implementation
but complicates the software from a number of perspectives.
For example, memory management is under application control,
as is communication between cores. Data must be moved into the
local memory of an SPE before it can be manipulated.
Fully exploiting architectures like the Cell takes time because
they differ from the more conventional SMP or AMP platforms.
The improvement in software on Cell-based platforms like Sony’s
PlayStation 3 over the years highlights the changes in programming
techniques and experience.
DIVIDE AND CONQUER
Changing programming techniques is
key to success in using graphical processing
units (GPUs). GPUs from the likes of ATI
and Nvidia expose hundreds of cores in a
single chip. These GPUs can be combined
into multichip solutions, providing developers
with thousands of cores. For instance,
four Nvidia Tesla T10s packed into a 1U
chassis deliver 960 cores (Fig. 4).
Programming the Tesla or any of the other compatible Nvidia
GPU chips can be a challenge, but frameworks like Nvidia’s CUDA
or utilization of runtimes based on CUDA make the job much easier.
Part of the challenge is the single-instruction, multiple-thread
(SIMT) architecture of Nvidia’s GPU (see “SIMT Architecture
Delivers Double-Precision TeraFLOPS”). Like
many high-performance systems, it likes to work with arrays of
data. This works well for many applications but not all, which is
one reason why GPUs are often matched with a multicore CPU.
CUDA and OpenCL (Open Computing Language), another
parallel programming framework, match the GPU approach that
uses separate memory from the host processor. This means data
must be moved from one place to another before it can be manipulated.
The C programming language has some extensions, but
there are also restrictions. For example, it is recursion-free, and it
doesn’t support function pointers. Some of these restrictions come
from the SIMT approach.
Many applications utilize CUDA. Performance gains compared
to conventional SMP platforms varies quite a bit, though, from a
factor of two to 100. One reason for this variance is that threads
work best if they’re running in groups of 32. Branches don’t
impact performance, assuming the group of 32 threads follows the
same branch.
Specialized processors like GPUs are one approach to both
graphics and multicore processing. Another approach is to use
many conventional cores, such as Intel’s Larrabee (Fig. 5). Larrabee
uses x86-compatible cores that are optimized for vector
processing (see “Intel Makes Some Multicore Lemonade”).
In one sense, Larrabee is similar to IBM’s Cell processor. The
Larrabee cores only have access to the 32-kbyte L1 and 256-kbyte
L2 caches. If data isn’t in the cache, it must be requested from the
memory controller or another cache within the system. The data
is then placed into the core’s cache, and the application proceeds
on its merry way.
A ring bus is used to communicate between cores and controllers.
IBM’s Cell Element Interconnect Bus (EIB) is also a ring
bus connecting the SPEs to the memory controller and peripheral
interface. From a programming perspective, Larabee’s cache and
the Cell’s SRAM differ significantly.
Still, to programmers, Larrabee appears as an array of cachecoherent
x86 processors. Because of its GPU orientation, programmers
can take advantage of DirectX and OpenGL support.
MULTICORE NETWORKING
Multicore chips also are common pieces in a networking infrastructure
puzzle. Handling 10-Gbit/s networks is a challenge all by
itself for multicore chips. Analyzing and massaging data from a network
connection at line speeds requires lots of processing power.
Netronome’s NFP-3200 Network Flow Processor contains 40
1.4-GHz cores that can run eight threads each, providing 320
hardware-based threads in one chip. This is on the same order as
GPUs, but the processors are slanted toward packet processing.
Like IBM’s Cell, the NFP-3200 has a master CPU-style controller.
In this case, it’s an ARM11 core. Its 40 cores, also called
microengines, are compatible with Intel’s IXP28xx architecture,
which was developed for network processing. This compatibility
is important because a good deal of code targets this architecture.
Older chips had fewer cores, so in a sense the NFP-3200 offers
more of the same.
Of course, simply tossing more cores at the problem is just
one course of action. Netronome incorporates a host of improvements,
such as enhanced microblocks with TCP offload support.
The interconnect speeds are higher as well, running at 44 Gbits/s
between cores.
Continue to page 3
Netronome’s chip has a number of specialized processors,
including its cryptography system, which handles the ubiquitous
security protocols. Its PCI Express interface supports I/O virtualization
often used by x86 processors. It can be moved next to the
NFP-3200 instead of being separated by another network link.
Programming the NFP-3200 is often less of an issue compared
to other multicore chips because of the large amounts of existing
code for the IXP28xx family. In addition, Netronome provides
libraries that make the creation of network-processing applications
more a matter of tying together modules.
Cavium’s Octeon II is a more conventional SMP multicore
design with two to six 64-bit MIPS64 cores connected by a crossbar switch (see “Multicore Chip Handles Broadband Packet Processing”). Like Netronome’s chip, the Octeon
II is designed for network and storage devices.
Also, the Octeon II has a RAID 5/6 accelerator as well as Hyper
Finite Automata (HFA) support of regular expressions for packet
inspection. Programming the Octeon II is comparable to most
SMP systems. It can run operating systems such as Linux.
OTHER MULTICORE ARCHITECTURES
Moving to more radical multicore architectures adds to programming
chores. However, it opens opportunities for developers
that can take advantage of the new architectures.
The IntellaSys SeaFORTH 40C18 fits in this category (Fig. 6).
Its native programming language is VentureForth (see “Parallel
Programming Is Here To Stay”). Instructions
are actually five bits with four instructions packed into a single
18-bit word. (One instruction is only 3 bits long.) The 40C18’s
40 cores have identical processing units with 64 words of RAM
and 64 words of ROM. That’s not a lot of space, though this does
translate to 256 instructions.
Obviously, programming the 40C18 will be dramatically different
than a chip with more storage like Intel’s Larrabee or IBM’s
Cell. The 40C18 cores consume less than 9 mW, whereas the
other two chips don’t work well without massive heatsinks and a
fan or two. The 40C18 is designed for embedded and even mobile
applications.
Programming the 40C18 will be different for most developers,
and not just because Forth is the programming language. Each
core’s small memory space and the matrix interconnect changes
program design methods. Cores typically run small functions that
pass data onto one or more neighbors, so cooperative programming
is the way to go.
Even external memory accesses require three cores working
together. This makes sense when there are many cores to
work with. The 40C18 also has the unique ability to send a small
program of four instructions in a single word to be executed by
a neighboring core. That is actually enough space to perform a
block transfer.
The XMOS XS1-G4 is an interesting mix based on 32-bit integer
Xcores (see “Multicore And Soft Peripherals Target Multimedia
Applications”). Each Xcore can handle a number
of different threads with a hardware-based event system that
facilitates XMOS’s soft peripherals. Like the 40C18, the XS1-G4
can wait on an I/O port. The difference is that the XS1-G4 handles
multiple threads whereas the IntellaSys chip works with one.
Developers can use XC, an extended version of C, to get the
most out of XMOS hardware. Extensions provide shortcuts to the
hardware support, which also includes XLinks. The XLinks connect
the four cores in the chip and provide four off-chip links. As
a result, multiple chips can be connected. Internally, the chip uses
a switch for the XLink connection, but the hardware and software
provide a uniform interface for interprocessor communication.
Each core has 64 kbytes of memory. This is more than the
40C18 but less than some of the higher-performance chips covered
here. Still, it is sufficient for a significant amount of application
code, allowing a more conventional thread approach to programming.
The bulk of programming for the XMOS chip is likely
to be in conventional C or C++ rather than XC, which tends to be
used for communication and peripheral handling.
The chip won’t present a challenge to
double-precision floating-point GPUs or
other high-end systems, but its integer and
fixed-point DSP support lends itself to many
other audio- and video-processing functions.
Linked XMOS chips are already used in to
drive multiple large-screen LCDs.
Multicore architectures continue to proliferate.
Programming these cores efficiently
and choosing the right one isn’t necessarily
easy, but it will become more common even
for embedded developers. Legacy applications
will tend to migrate to architectures that
match their existing hosts. More radical
departures are possible when the applications
are being redesigned or created from
scratch.
|