Electronic Design

  
Reprints     Printer-Friendly    Email this Article    RSS        Font Size     What's This?


[Leapfrog: First Look]
Architecture Maps DSP Flow To Parallel Processing Platform
This divide-and-conquer parallel processing approach whips up a Storm.

William Wong  |   ED Online ID #15468  |   May 10, 2007


Programming parallel processors isn't easy, especially when the number of processing elements is large. No single technique applies to all situations. But in its Storm-1 architecture, Stream Processors narrows the focus to make parallel-processing hardware and software design significantly easier (Fig. 1).

One of the challenges of parallel processing is matching the architecture to the problem. Storm-1 addresses this by focusing on the signal processing of streaming data, which includes streaming video or data from radar. In both cases, the kind of signal-processing work remains consistent, and the chunks of data being processed at one time can be brought on-chip.

While this architecture may not fit many applications, the number of applications it does fit is growing rapidly. In fact, scalability is a key factor. Storm-1 is available in eight-lane and 16-lane versions. These lanes have no relation to the lanes used with hardware interfaces like PCI Express or Serial RapidIO. With Storm-1, a lane is a macro processing element (Fig. 2).

A pair of 64-bit MIPS 4KEc processors manages these lanes and handles the housekeeping. The data parallel unit (DPU) MIPS processor controls the DPU at a global level and drives the DPU dispatcher. The dispatcher controls the code that's loaded into the very-long-instruction-word (VLIW) instruction memory that in turn is used by each lane. A scalar unit handles simple chores that won't be accelerated if they're distributed to a lane.

Each lane operates on its own data and is independent of each other lane. The data passing through each lane will get the same general type of processing. Yet the data itself will affect which algorithms are applied as well as any state provided when the lane starts its processing.

WE DON'T NEED NO STINKING CACHE
Though the Storm-1 is designed for streaming data, it doesn't grab new data continuously. Instead, incoming data streams are moved into main memory. A chunk of data is moved into a lane's local memory and then moved into operand register files as it is being processed. The resulting data is moved back into main memory once the lane is done processing.

This works well because typically the incoming and outgoing streaming data is buffered and often moving through different channels at different data rates. If compression or decompression is being performed, the input and output stream sizes will differ significantly. This buffering approach is quite common even with conventional architectures. The direct memory access (DMA) can move data directly into the lane's local memory if necessary.

Caches are complex and take up lots of space. Eliminating the cache can provide key performance and power advantages. On-chip accesses via the cache are often a hundred times more expensive than register accesses— and it's worse for off-chip accesses.

In a conventional system with many DSPs, each DSP will cache information from main memory. With Storm-1, there are no caches, only very large local memory or register banks. This has several advantages, especially when it comes to determinism.

Primarily, it lets compilers generate very good code that will be executed consistently since stalls will never occur. In fact, the communication and memory subsystems complement each other and eliminate or reduce bottleneck effects. The current architecture can handle the eight- or 16-lane architectures.

The processing system within each lane is simpler than many DSP architectures because of this memory architecture. The five processing units have their own register files and ALUs. They also operate on the lane's data in parallel. Each lane operates in parallel to minimize cross-lane communication.

The ALU architecture mirrors most DSPs with multiple operand instructions as well as specialized multiply-accumulate (MAC) hardware. The single-instruction multiple-data (SIMD) architecture is tailored for applications such as video manipulation. Scatter/gather operations within a lane are also supported when accessing local memory.

Chunks of data can be moved from one lane to another, and the size of the chunks is chosen to fit into the confines of each lane. A scatter/gather DMA transfer approach enables logical data streams to be split among multiple lanes or even multiple chips.


<-- prev. page     [1] 2     next page -->

Reprints   Printer-Friendly  Email this Article  RSS    Font Size   What's This?



POST YOUR COMMENTS HERE
Name:

Email:
Your Comments:

Enter the text from the image below


Please refresh the page if you have trouble reading this text.

Search Electronic Design
     
  
 
Web Seminar
Sponsored By:
Title: Read Pacing: A Performance Enhancing Feature of PCI Express Gen 2 Switch Devices
Speakers: 
Date: 07/01/08
Register: 

Electronic Design Europe Electronic Design China EEPN Power Electronics Auto Electronics Microwaves & RF
Mobile Dev & Design Schematics Find Power Products Military Electronics EE Events Related Resources