[Technology Report]
Multicore Projects Mean Multiple Choices
Multicore solutions may be finding their way into more projects, but opinions vary on the best architecture to use.
Daniel Harris
ED Online ID #17695
December 13, 2007
Copyright © 2006 Penton Media, Inc., All rights reserved. Printing of this document is for personal use only.
Reprints
When it comes to multiprocessing, what’s good for the hardware
goose is not necessarily good for the software gander.
The ideal hardware architecture for a multicore design is a
heterogeneous (asymmetric) single instruction-set architecture
(ISA) that essentially includes both high- and low-complexity
cores to achieve lower power and higher throughput,
somewhat mitigating Amdahl’s Law1.
Now imagine that Amdahl’s Law (used to find a system’s
maximum expected improvement when only part of the system is improved) was of no concern
and we had unlimited die sizes. The ideal multicore from a programming perspective
would be homogeneous (symmetric), so dependence wouldn’t be built up on a specific ISA.
Courtesy of IBM, Sony, and Toshiba, the Cell microprocessor has a heterogeneous architecture—
though it isn’t a single ISA. Yet programming the device can be rather arduous,
leaving you with code that’s heavily architecture-dependent. According to Dave Haas, principal
architect at Raza Microelectronics, you should be careful not to pigeonhole yourself
into a given vendor or architecture when you can avoid it, making homogeneous architectures
a safer bet when given a choice.
Regardless of the best approach, there’s a limited number of options for
today’s embedded and general-purpose system designers. If you’re in the
embedded space, several of the multicore choices are heterogeneous. If
you live in a general-purpose world, you might only be able to get
a homogeneous multicore.
DECISIONS • When it comes to multiprocessing,
several tradeoffs exist that squeeze the most performance
out of your transistor (see the
table). For example, there’s the threadversus-
core tradeoff. According to
Kevin Kissell, MIPS principal
architect, you must start by
analyzing your system to determine
which applications can be
decomposed into a number of
constituent tasks or threads.
“Parallelization of monolithic applications is often possible,
but seldom easy, and it’s generally easier for a big scientific code
than a small embedded real-time application,” says Kissell. And
to save on area, consider utilizing a more thread-heavy architecture.
The idea is to maximize the performance per watt and
choose an architecture that will saturate the memory and power
envelope.
“To the extent that a single-threaded core cannot keep its
pipeline fully utilized because of delays from memory and slow
functional units, multithreading can extract throughput with a
relatively modest increase in area, and in many cases the payback
is superlinear,” he says.
For instance, you might achieve 30% more throughput for
15% more area in the CPU and cache subsystem. “This can be
converted into a power optimization if that recovery of lost
bandwidth allows the multithreaded core to run at a lower frequency
than an equivalent single-threaded core, and still meet
performance targets,” says Kissell.
So if your application doesn’t require significant amounts of
shared data or instructions, a distributed memory scheme is
probably the best candidate. “Each processing element’s memory
can be sized to its dedicated tasks,” Kissell says, “and one
can use different processor frequencies, different processor
models, and even different processor architectures for the different
processing elements to achieve the best area/power/performance
values.”
But if there’s an abundance of code and/or data sharing, a
symmetric configuration may be your best bet. According to
Kissell, this approach “adds complexity and loses a bit of peak
performance relative to a distributed memory model, because
there will be some contention for the shared memory array, and
because a cache-coherency protocol must be used among the
cores to ensure that they all see the same values at each memory
location, despite the presence of caches.”
But according to Chuck Moore, senior fellow for Advanced
Micro Devices, end users may have misaligned expectations
about multicore technology.
“Multicore is very good for throughput and responsiveness,
but given that most applications are still serial, these actually
won’t speed up on multicore,” says Moore. “Over time, there
will be an increasing number of parallel applications available,
but this is going to take more time than people seem to realize.”
DIFFERENT VIEWS • When it comes to multiprocessing, all
“coaches” believe their team has the best strategy for winning
(see “Multicore My Way” at www.electronicdesign.com, ED
Online 14631). Take AMD and Intel, which have gone public
about their opposite approaches to next-generation cores. Intel
believes homogeneous cores are the way to go, while AMD
believes the future lies in heterogeneous cores.
“Multicore solutions of tomorrow will be heterogeneous,”
says AMD’s Moore. “They will initially involve the use of
architecturally compatible cores with varying capabilities, but
will grow to include more special-purpose and power-efficient
hardware that is accessed through well-defined APIs (application
programming interfaces).”
Intel and Vivace Semiconductor also have radically different
views of the embedded space. “Intel’s Embedded and Communications
Group estimates the percentage of multicore designs
that will utilize asymmetric multiprocessing (AMP) in the next
three to four years of all Embedded and Communications
Group-deployed multicore platforms to be about 10%,” says
Edwin Verplanke, platform solution architect with Intel’s
Embedded and Communications Group.
Continue to next page
“Once the core count meets 32 and beyond, the adoption of
AMP may grow,” Verplanke adds. “Some of our customers
have proprietary, often real-time operating systems that are not
SMP-capable (symmetric multiprocessing). Those customers
may be interested in running specific functions on separate
cores. Those functions could include forwarding engines, cryptography,
pattern matching, etc.”
This is in stark contrast with what Cary Ussery, president
and CEO of Vivace Semiconductor, believes. He says that AMP
makes up about 90% of all embedded multicore designs. Should it be surprising us that two professionals from different
organizations have the exact opposite view of the market (see
“Symmetric Multiprocessing Vs. Asymmetric Processing” at
www.electronicdesign.com, Drill Deeper 17693)? Or is this just
another example of an industry segment plagued with semantic
problems (see “The Semantics Of Multiprocessing,” Drill
Deeper 17694)?
SYSTEM OPTIMIZATION • Once you’ve chosen the architecture
for your next system, assuming a multiprocessing
environment, you’ll likely need to review your code to determine
how to naturally take advantage of multiple cores
and/or threads.
Heterogeneous multiprocessing requires an up-front understanding
of how to best partition your application code to
exploit the available threads/cores. In other words, how can
your application best be broken up into smaller pieces? Homogeneous
multiprocessing generally has no such requirement,
since the operating system will handle most of the partitioning
based on some basic task definitions and up-front tweaks.
Part of parallelism today is virtualization and knowing when
to use it. According to Intel, if your legacy code has low performance
requirements, it may be a good candidate for virtualization.
But Rick Hetherington, Sun’s CTO of Microelectronics
for the Niagara program, offers a slightly different opinion.
“It doesn’t make sense to virtualize a single core,” says
Hetherington. Of course, Sun’s perspective is likely more relevant
in the general computing space. The embedded space
allows for virtualization of even a single core when the complexity
permits it.
If you’re new to a multiprocessing environment, consider
trying out incremental “what-if” scenarios to find bottlenecks
and candidates for parallelization. You may also find
the need to port your code to a standard operating system
that’s designed to take advantage of multiprocessing architectures,
such as Linux.
If porting millions of lines of code isn’t an option, a hypervisor
may be your best bet. Another approach is to offload common
tasks from cores, such as data encryption and decryption.
This will free up the core for more general-purpose tasks.
MULTICORE’S FUTURE • Anant Agarwal, professor at the
Massachusetts Institute of Technology and CTO of semiconductor
startup Tilera, said at this year’s Multicore Expo in Santa
Clara that the tools to program and debug multicore ICs are
in the “dark ages.” Apparently, quite a few unemployed cores
and threads are out there looking for work. But the problems
aren’t just related to tools.
“First-generation multicore processors have been a simple
integration of a group of cores into an SoC (system-on-achip),”
says Dan Bouvier, director of Solutions Architecture for
AMCC. This has translated to rather poor performance scaling
due to the overhead required to handle multiprocessing and
memory bottlenecks.
“The forthcoming generation of multicore processors will
need more attention toward interprocessor dynamics and how
they impact the software deployment and performance,” says
Bouvier. “The primary challenge in integrating upper-layer
(above layer 3) accelerators in asymmetric multiprocessor subsystems
is the lack of standard, agreed-to APIs.”
Such a standard exists for computer graphics in OpenGL,
which defines a cross-language and cross-platform API for producing
applications that produce 2D and 3D graphics. Unfortunately,
with no tool standards built around open-source APIs
driven by industry experts across multiple segments, we have to
work with what’s available today and perhaps rethink our
design strategies.
“The programming model and software stack are the key
enablers (or inhibitors) for taking multicore to the next level,”
says AMD’s Moore. “By working closely with our software colleagues,
we will come up with solutions that offer tremendous
value to our customers.”
And what’s happening on the software front? “There is a
fundamental shift in multiprocessor design, with an associated
change in the software paradigms and models used, as multicore,
coherence, and formal interprocessor communication schemes are adopted,” says John Goodacre, program manager
for multiprocessing at ARM.
So not only is this shift causing a general rift in the embedded
community, it also forces the systems engineer to rethink the
decision process. “There are principle changes across the hardware
and software as SoC designers consider the move from
ARM plus DSP to multicore plus DSP plus accelerators plus
RISC and the challenges of memory coherence, consistency, and
task synchronization,” says Goodacre.
Continue to next page
Part of this fundamental shift by systems engineers must be
to rethink their design approach. If it’s a bottom-up approach
in which the processing requirements are determined based on
performance, memory, and other system-related parameters, it
could spell disaster downstream.
“If you are thinking about business application software, you
think from the top down, from the software to the hardware. In
the embedded space, we still think from the bottom up, which
creates slow development processes and missed opportunity
because the product gets out too late,” says Michel Genard,
vice president of marketing for Virtutech.
According to Genard, around 50% of embedded designs
never see the light of day because of this flawed way of thinking,
and designs are driven based on performance parameters
and not business requirements. “Instead, we need iterative
hardware and software development that speeds overall timeto-
market,” says Genard.
To improve your chances, consider a system-level virtualizing
approach to the software rather than a componentlevel
approach (see the figure). When done right, Genard
notes, this approach “provides the speed, scalability, and
control necessary for successful concurrent software/hardware
development.”
HERE TO STAY • As more silicon is delivered with multicore
architectures, marketing departments for companies large and
small will continue to find uses for them. Therefore, we must
embrace multicore and continue to research the ideal architecture/
software mix to stay on the leading edge.
To that end, it would appear all paths for multicore lead to
parallel programming, along with more sophisticated architecture
solutions for intra-processor communications and the
promise of software transactional memory. According to Agarwal,
we must change how cores are connected and determine
the ideal size of resources, arguing that distributed meshes and
smaller cache sizes are the wave of the future.
With so many problems plaguing multicore, there’s a huge
potential for startups to take the lead and maybe one day find a
job for all of those unemployed threads and cores.
|