Programming parallel processors isn't easy, especially when the number of processing elements is
large. No single technique applies to all situations.
But in its Storm-1 architecture, Stream Processors narrows the focus to make parallel-processing hardware
and software design significantly easier ().
One of the challenges of parallel processing is matching
the architecture to the problem. Storm-1 addresses this by
focusing on the signal processing of streaming data, which
includes streaming video or data from radar. In both cases, the kind of signal-processing work remains consistent,
and the chunks of data being processed at one time can
be brought on-chip.
While this architecture may not fit many applications, the number of applications it does fit is growing rapidly. In fact,
scalability is a key factor. Storm-1 is available in eight-lane and
16-lane versions. These lanes have no relation to the lanes used
with hardware interfaces like PCI Express or Serial RapidIO. With
Storm-1, a lane is a macro processing element ().
A pair of 64-bit MIPS 4KEc processors manages these lanes
and handles the housekeeping. The data parallel unit (DPU)
MIPS processor controls the DPU at a global level and drives the
DPU dispatcher. The dispatcher controls the code that's loaded
into the very-long-instruction-word (VLIW) instruction memory
that in turn is used by each lane. A scalar unit handles simple
chores that won't be accelerated if they're distributed to a lane.
Each lane operates on its own data and is independent of
each other lane. The data passing through each lane will get the
same general type of processing. Yet the data itself will affect
which algorithms are applied as well as any state provided when
the lane starts its processing.
WE DON'T NEED NO STINKING CACHE
Though the
Storm-1 is designed for streaming data, it doesn't grab new data
continuously. Instead, incoming data streams are moved into
main memory. A chunk of data is moved into a lane's local memory and then moved into operand register files as it is being
processed. The resulting data is moved back into main memory
once the lane is done processing.
This works well because typically the incoming and outgoing
streaming data is buffered and often moving through different
channels at different data rates. If compression or decompression is being performed, the input and output stream sizes will
differ significantly. This buffering approach is quite common
even with conventional architectures. The direct memory access (DMA) can move data directly
into the lane's local memory
if necessary.
Caches are complex and
take up lots of space. Eliminating the cache can provide
key performance and power
advantages. On-chip accesses via the cache are often a
hundred times more expensive than register accesses—
and it's worse for off-chip
accesses.
In a conventional system with many DSPs, each DSP will cache
information from main memory. With Storm-1, there are no caches,
only very large local memory or register banks. This has several
advantages, especially when it comes to determinism.
Primarily, it lets compilers generate very good code that will be
executed consistently since stalls will never occur. In fact, the
communication and memory subsystems complement each other and eliminate or reduce bottleneck effects. The current architecture can handle the eight- or 16-lane architectures.
The processing system within each lane is simpler than many
DSP architectures because of this memory architecture. The five
processing units have their own register files and ALUs. They also
operate on the lane's data in parallel. Each lane operates in parallel to minimize cross-lane communication.
The ALU architecture mirrors most
DSPs with multiple operand instructions
as well as specialized multiply-accumulate (MAC) hardware. The single-instruction multiple-data (SIMD) architecture is
tailored for applications such as video
manipulation. Scatter/gather operations
within a lane are also supported when
accessing local memory.
Chunks of data can be moved from one
lane to another, and the size of the chunks
is chosen to fit into the confines of each
lane. A scatter/gather DMA transfer
approach enables logical data streams to
be split among multiple lanes or even
multiple chips.