[Technology Report]
Software Rules The Day In Multicore SoC Design
With the number of on-chip processors set to explode, software-development issues loom for design teams. Yet D&V methodologies may evolve to avert any stumbling over parallelism.
David Maliniak
ED Online ID #18640
April 24, 2008
Copyright © 2006 Penton Media, Inc., All rights reserved. Printing of this document is for personal use only.
Reprints
Looking back over the past 10 years or so,
semiconductor process technology more or
less kept pace with the demand for functionality
in large-scale processor-based
ICs. When the next-generation set-top box
IC needed more horsepower, a move
from, say, a 180-nm process to 130 nm
would provide the necessary boost by adding
gates and the ability to run faster clocks. But that next-generation
chip would still carry a single processor.
Things have changed dramatically in the last few years. Simply
put, silicon scaling no longer meets functionality requirements.
Thus, designers turned to multiprocessor architectures,
which significantly up the ante in terms of processing power.
The number of processors per chip is taking off, already exemplified
several years ago by Cisco’s 192-processor engine for its
CRS-1 network router (Fig. 1).
With the rise in processing power and complexity comes a
host of issues that point largely toward the software side of the
system equation. Writing software for a single-processor system
is a relatively simple task, as a purely sequential approach
will do the trick. But there’s little point in multiple processing
engines if you’re not planning to have them execute instructions
in parallel. How parallelism is imposed is the crux of
the matter. Missteps can result in dire consequences, creating
debug nightmares.
Fortunately for those looking to move to multiprocessor
architectures in their system-on-a-chip (SoC) designs, tools
and methodologies are beginning to appear. Designers can take
steps to ensure that their parallelized application code won’t
cause memory-access deadlocks, race conditions, or other faults
that crash one or more processors or even their entire systems.
HOW MULTICORE LOOKS TODAY
Looking at a generic example of a multicore SoC can illustrate
both the complexity of the devices and the programming challenges
(Fig. 2). In a hypothetical transition from a 130-nm
SoC with a single processor to a multicore implementation
at 65 nm, designers would have roughly four times as many
transistors to work with.
Multicore architectures ramp up complexity in ways beyond
simply having multiple processors. The availability of more gates
brings added memory, which is required to handle the increasingly
large amounts of data, high-resolution video streams, and
other content. The increased bandwidth means more I/Os to deal
with all the data. More complex control processing is required by
a myriad of network stacks and more elaborate user interfaces.
“Designs are using more CPUs,” says Chris Rowen, CEO of
Tensilica. “But that
has only limited
potential because
of the way control
paths are written.”
When considering
multicore
SoCs, an important
distinction must be
made between control-plane and data-plane processing.
“In the data plane, there’s strong interest
in integrating more functionality,” says Rowen.
“Chips no longer process only audio, video, or
wireless baseband, but rather they process all of
them. Meanwhile, there’s growing complexity in
each of these various functions. This puts a lot of
pressure on a more programmable solution.”
Efforts to make the most of multiple processors
often run aground on the shoals of memory access
conflicts. “The old paradigm for multicore designs
using shared memory was if things were happening
in parallel, you’d want them to touch the
memory at different address spaces,” says Limor
Fix, general chair of the 45th Design Automation
Conference and associate director of Intel
Research Pittsburgh.
“The idea is for parallel threads not to interfere
with each other, and to minimize the number
of clocks required for the shared memory,”
says Fix. “If each of the parallel computations is touching a
different area in memory, there’s less collision and less locking
of the shared memory.”
The problem lies in the fact that visibility into the design is
extremely limited. “Typically, when working with RTL simulation
models of processors, software debug relies on the general-purpose
registers of the processor,” says Jim Kenney, product marketing manager
at Mentor Graphics. “These registers are usually exposed at the
top level for tracing in the waveform window of a logic simulator.”
Making matters worse is the fact that there may be only one
debug port for several processors. With all processors executing
instructions concurrently, it’s very difficult to control the speed of
any given processor.
Debugging is made even harder due to the absence of determinism.
“With multiple processors, you don’t control what’s running,”
says Michel Genard, vice president of marketing at Virtutech.
Rerunning code often is of no value because the results can be different
each time, making bugs hard to pin down. Then there’s the
notion of “Heisenbugs,” or changes introduced by probe insertion
that alter the system’s behavior.
GOING VIRTUAL
Fortunately, there are ways around these issues, most of which
come in the form of “virtualization” or “virtual platform” technology.
Many benefits can be derived from virtual platforms (see “Multicore Design Benefits from Virtual Prototyping,” www.electronicdesign.com, ED Online 18637).
Once a virtual platform is assembled from hardware models,
many of the issues concerning software debugging are addressed.
The designer gains a great deal of control over the system, hence a
return to a more deterministic scenario. The system configuration
is easily varied in terms of the number and speed of cores as well as
the software loads on each.
Virtual hardware offers a good amount of visibility in terms of
memory, processor registers, and device states. In addition, when
you synchronize the processors, you can synchronize everything at
once. It also affords much more control over system execution.
Continue to Page 2
When debugging requires a global system stop, all processors
stop simultaneously with no “skid” effect. When one processor is
stepped through instructions, others can be made to sit and wait.
Cores can be slowed or stopped entirely; communication latencies
increased; and timing disturbances from breakpoints disappear.
Having said all of that, a sticking point for those wishing to
assemble virtual platforms can be the models themselves. Where do
they come from? What level of abstraction should they embody?
If your multicore design is starting from a good amount of
legacy RTL, as most do, one answer to model creation comes
from Carbon Design Systems, whose tools compile RTL into an
executable software image. Compilation can be done on a blockby-
block basis, on subsystems, or even on an entire system.
According to Carbon’s Bill Neifert, CTO, the models enable visibility
into what’s happening in the system. “We provide some RTL
simulator-like features,” says Neifert. “You can look at waveforms
and see conflicts between processors contending for resources.”
Virtual platforms are also used by HW and SW development
teams to determine applicable use cases for the system. Such is the
case at Freescale Semiconductor, where extensive investigation of
use cases is critical to the the company’s multicore SoC design.
“We spend a lot of time with our various teams, including
marketing, verification, validation, software, hardware, and development
tools, to decide on the priorities for the use cases,” says J.T. Yen, Freescale’s verification manager.
“Then we take those use cases and drive
them back out into the teams to make
sure the hardware architecture is meeting
those use cases.”
Virtutech’s Simics 4.0 is a virtualization
environment that enables such usecase
exploration. Version 4.0, released this
month, adds APIs that support more use
cases as well as a repository of thousands
of models accrued since the initial release.
of Simics.
Further, Simics 4.0 is itself a multithreaded
application that enables, in a chickenand-
egg scenario, designers of multicore
SoCs to leverage all of the cores available
on their computing resources (laptop or
multiway server) to boost simulation speeds
and scalability. This capability, embodied in
Virtutech’s Simics Accelerator, enables one Simics session to simulate
several machines in parallel (Fig. 3).
Another option for platform creation comes from CoWare's
ESL 2.0 toolset. With CoWare’s tools, multicore SoC designers
can debug and benchmark the platform-level performance of their
IP and subsystem RTL at a cycle-accurate level of abstraction.
JUMPING THE HURDLES
Taking the virtual-platform route has its advantages as just outlined,
but there are also barriers to success. Building a virtual platform can
be a laborious process that must be undertaken in parallel with the
design process itself. Then there are the issues with interoperability
of hardware models among various commercial flows.
Imperas is a relatively new entity that’s taken a somewhat different
approach to its entry into the virtual-platform arena. Out of the
chute, the company made a major technology donation that carries
the promise of an open-source infrastructure for virtual platforms.
“When we started the company, we were targeting how to
program multicore SoCs,” says Simon Davidmann, Imperas’ president
and CEO. “But what we found was challenges in debugging.
There was no broad simulation infrastructure to support it. The
key is a modeling technology that would enable models to work
together no matter who makes them.”
To that end, Imperas made three technology components freely
available through its Open Virtual Platforms Web site at www.ovpworld.org, as well as at SourceForge. The first is C-language
modeling application-programming interfaces
(APIs) for processor, peripheral, and
platform modeling.
The second is an open-source library of
models written to the APIs. The models
can be obtained as either pre-compiled
object code or as source-code files. At present,
the library comprises processor models
of ARM, MIPS, and OpenRISC OR1K
devices, with others to follow. Also available
is a wide range of component and peripheral
models. In addition, there are several
example embedded platforms written in C,
C++, and SystemC.
Continue to Page 3
Rounding out the trio is a free OVP reference
simulator that runs processor models
at up to 500 MIPS. Called OVPsim, the
simulator comes with a GNU debugger
(GDB) interface.
OVPsim can be called from within other
simulators through a C/C++/SystemC
wrapper. It also can encapsulate existing
instruction-set simulator (ISS) processor
models (Fig. 4).
DEALING WITH COMPLEXITY
When it comes to the language used for
writing embedded code for multicore
SoCs, some designers feel that the existing
paradigm is entirely broken. In other
words, writing software in sequential
fashion using C or C++ can no longer be a
pragmatic approach. These days, new, fundamentally
parallel languages and methodologies
are required (see “Programming
Multicore Platforms: What’s Really Going
On?” ED Online 18639).
“Finding new design-entry languages
that address parallelism is a long-term goal
and is at least five to 10 years from being
realized,” says Frank Schirrmeister, director
of product marketing for system-level
solutions at Synopsys. Today’s users, says
Schirrmeister, are better served by virtual
platforms with analysis and debug capabilities
geared for multicore platforms.
Such languages and methodologies may
eventually be forthcoming. But for now, a
great deal of legacy sequential software is
being transformed into parallel code, however
laborious that process may be. However,
tools are available that can help determine
where opportunities for parallelism
lie in sequential code.
One such tool is Critical Blue’s Cascade,
which synthesizes reprogrammable coprocessors
that accelerate native binaries or C/
assembler source code. Recently, the company
extended Cascade into a multicore
version that does the same thing, only with
the addition of cross-core software partitioning,
task-dependency analysis, and
verification capabilities (Fig. 5).
“Multicore architectures are not new,
but in the past they were usually created
for a specific purpose,” says David Stewart,
Critical Blue’s founder and CEO. “What
we’re seeing now is multicore for the masses
in the form of hardware architectures
that can be used for multiple SoCs. That
means reprogramming, and that comes
down to software.”
In Stewart’s view, Multicore Cascade is
a pragmatic approach that can help make
today’s programming languages and techniques
viable for multicore architectures.
“When we only generated a single co-processor,
we were extracting instruction-level
parallelism,” he says. “Now, we are extracting
task-level parallelism. But it goes beyond
that and into analysis of where dependencies are in the code and what the benefits are of
breaking those dependencies.”
DEBUG IMPROVEMENTS
Functional verification of multicore SoCs is
largely accomplished using processor-based
tests. Verification engineers use full-functional,
signoff-accurate processor models
derived from RTL to drive bus cycles out to
the rest of the design's IP. This method can
be used for block-level verification or as a
final simulation to ensure that the hardware
will come out of reset and execute code.
The downside of processor-based testing,
though, once again lies in the limited
visibility for software debugging. Typical
debug flows provide only a view of the processors’
general-purpose registers.
“The only interactive view with which
to determine why a C test isn’t running
properly on hardware is the waveform
view,” says Mentor Graphics’ Kenney. “It’s
hard to correlate misbehavior in the waveform
view with where it’s happening in
the source code for the test.”
Mentor Graphics’ attempt at a solution
for this problem comes in the form
of Questa Codelink, an extension to the
Questa functional verification environment.
“What we’ve done is to build a classical
source-level software debugger into
the ModelSim Questa environment and
connected them both to the RTL processor
models used in verification,” says Kenney.
With Questa Codelink, users have the
advantage of interactive, graphical debug. In
a source-code view, they can see breakpoints
for an unlimited number of processors. Registers
are displayed, and variable values can
be tracked. A cursor in the source-code view
is in lock-step with a cursor in the debugger’s
waveform window. Moving the cursor
in either window takes the other window’s
cursor to the corresponding point.
According to Russ Klein, Mentor’s project
manager for Questa Codelink, an important
aspect of the tool for multicore SoC developers
is the non-intrusiveness of the process.
“You can see what’s going on with each of
multiple processors without introducing any
timing errors,” says Klein. “You can see it all
concurrently, with full visibility into the
states of each of them at exactly the same
time. The ability to step backward through
the code to the point where synchronization
errors occur is also very powerful.”
|