[Leapfrog: First Look]
Architecture Maps DSP Flow To Parallel Processing Platform
This divide-and-conquer parallel processing approach whips up a Storm.
William Wong
ED Online ID #15468
May 10, 2007
Copyright © 2006 Penton Media, Inc., All rights reserved. Printing of this document is for personal use only.
Reprints
Programming parallel processors isn't easy, especially when the number of processing elements is
large. No single technique applies to all situations.
But in its Storm-1 architecture, Stream Processors narrows the focus to make parallel-processing hardware
and software design significantly easier (Fig. 1).
One of the challenges of parallel processing is matching
the architecture to the problem. Storm-1 addresses this by
focusing on the signal processing of streaming data, which
includes streaming video or data from radar. In both cases, the kind of signal-processing work remains consistent,
and the chunks of data being processed at one time can
be brought on-chip.
While this architecture may not fit many applications, the number of applications it does fit is growing rapidly. In fact,
scalability is a key factor. Storm-1 is available in eight-lane and
16-lane versions. These lanes have no relation to the lanes used
with hardware interfaces like PCI Express or Serial RapidIO. With
Storm-1, a lane is a macro processing element (Fig. 2).
A pair of 64-bit MIPS 4KEc processors manages these lanes
and handles the housekeeping. The data parallel unit (DPU)
MIPS processor controls the DPU at a global level and drives the
DPU dispatcher. The dispatcher controls the code that's loaded
into the very-long-instruction-word (VLIW) instruction memory
that in turn is used by each lane. A scalar unit handles simple
chores that won't be accelerated if they're distributed to a lane.
Each lane operates on its own data and is independent of
each other lane. The data passing through each lane will get the
same general type of processing. Yet the data itself will affect
which algorithms are applied as well as any state provided when
the lane starts its processing.
WE DON'T NEED NO STINKING CACHE
Though the
Storm-1 is designed for streaming data, it doesn't grab new data
continuously. Instead, incoming data streams are moved into
main memory. A chunk of data is moved into a lane's local memory and then moved into operand register files as it is being
processed. The resulting data is moved back into main memory
once the lane is done processing.
This works well because typically the incoming and outgoing
streaming data is buffered and often moving through different
channels at different data rates. If compression or decompression is being performed, the input and output stream sizes will
differ significantly. This buffering approach is quite common
even with conventional architectures. The direct memory access (DMA) can move data directly
into the lane's local memory
if necessary.
Caches are complex and
take up lots of space. Eliminating the cache can provide
key performance and power
advantages. On-chip accesses via the cache are often a
hundred times more expensive than register accesses—
and it's worse for off-chip
accesses.
In a conventional system with many DSPs, each DSP will cache
information from main memory. With Storm-1, there are no caches,
only very large local memory or register banks. This has several
advantages, especially when it comes to determinism.
Primarily, it lets compilers generate very good code that will be
executed consistently since stalls will never occur. In fact, the
communication and memory subsystems complement each other and eliminate or reduce bottleneck effects. The current architecture can handle the eight- or 16-lane architectures.
The processing system within each lane is simpler than many
DSP architectures because of this memory architecture. The five
processing units have their own register files and ALUs. They also
operate on the lane's data in parallel. Each lane operates in parallel to minimize cross-lane communication.
The ALU architecture mirrors most
DSPs with multiple operand instructions
as well as specialized multiply-accumulate (MAC) hardware. The single-instruction multiple-data (SIMD) architecture is
tailored for applications such as video
manipulation. Scatter/gather operations
within a lane are also supported when
accessing local memory.
Chunks of data can be moved from one
lane to another, and the size of the chunks
is chosen to fit into the confines of each
lane. A scatter/gather DMA transfer
approach enables logical data streams to
be split among multiple lanes or even
multiple chips.
STREAMLINED PROGRAMMING
Developers program the Storm-1 lanes
using C. The original work was done using
C++, but it was discarded in lieu of C,
which provided a more elegant and efficient solution because it matched the way
many stream processing applications
were designed.
One set of C functions, called kernel
functions, runs on the lanes. These functions are used as necessary and process
data in parallel in each lane regardless of
how many lanes are involved. Limits are
based on the physical number of lanes
and the data loaded into the lanes.
If only one lane is needed, only one will
operate. The others can idle, conserving
power. The eight-lane version consumes
about half the power of the 16-lane version when all lanes are operational. Running fewer lanes at a higher speed is more
efficient than running more lanes at a
slower speed.
Kernel functions operate only on local
lane data. They're used after the stream
data has been moved into the lane's memory. One kernel function will be applied to
all lanes at a single time. Kernel function
execution can be conditional on a per lane
basis. Libraries of kernel functions are
available for common transformation and
processing requirements.
Kernel functions don't depend on the
number of lanes involved, so the architecture can be scaled up and down. This may
lead to additional chips in the family or
architectures that use multiple chips. In
this case, the code to handle the lanes
will be replicated but remain the same
from chip to chip. Communication
between lanes in different chips will be significantly more expensive, but this
won't affect many applications.
The RapiDev Development Environment supports Storm-1. It includes the
SPC compiler for Linux and Windows
hosts and the cycle-accurate Target Code
Simulator (TCS), which includes MIPSsim
for the control processors. The Eclipse IDE
ties everything together, including the simulator and VLIW profiler support.
Image processing, DSP, and general
math libraries are included. The MIPS
processors run Linux and can be programmed using any conventional set of
programming tools. Libraries are provided
for managing and load-balancing the
memory, streams, and lanes.
Available individually, the SP16-G160
costs $99, and the SP8-G80 costs $59. A
PCI board is available with a 16-lane version. The board has a Gigabit Ethernet
interface, analog audio in/out, 512
Mbytes of SDRAM, and 32 Mbytes of
flash. It can operate in standalone mode
or be controlled by a host processor.
The Storm-1 architecture is just one of
many. Architectures such as IBM's Cell
processor or even symmetrical multiprocessing (SMP) systems will remain important in their niches using different parallel
programming tools and techniques.
Stream Processors
www.streamprocessors.com
Storm-1
Versions: eight-lane SP8-G80 and
16-lane SP16-G160
Speed: 500 MHz
Memory: 128-bit DDR2
Stream I/O pins: 72 or 108
programmable pins, 165 MHz
Peripherals: 1-Gbit Ethernet,
serial, 32-bit, 66-MHz PCI
Package: 31- by 31-mm 896-pin
plastic ball-grid array (PBGA)
|