• Channels
Part Inventory
Go
 
powered by:

 
  • Quick Poll
What Social Networking site do you use the most?



VOTE VIEW RESULTS
Previous Polls

Premium Content

New Signal Chain Technical Papers from Texas Instruments:

 

 

 

Optimizing Code, The SHARC Vs. The Minnow (Part II): The SHARC's View

Switching to a DSP offers many advantages over developing signal processing code for legacy algorithms.


Contributing Author

October 16, 2000

Print
Reprints Comment Subscribe

This article is the second of a two-part series. Part I appeared in the Sept. 18 issue.—ED

Techniques of adding DSP capability to an existing CISC system were discussed in the first part of this article. 68k code for part of a frequency analysis system was developed. A procedure that will generate both instantaneous and average power of a complex-valued array is shown in Listing 1.

It was revealed that with this DSP code developed on the Software Development Systems' 68k environment, the compiler could be persuaded to produce code almost as efficiently as hand optimization. A key statement is "the compiler could be persuaded." The developer had to rewrite the code in the format that's shown in Listing 2. Explicit pointer operations are used to force the compiler to generate the faster auto-incrementing addressing modes available on the 68k processor. Other speed improvements include the evaluation of constant expressions not recognized by the compiler optimizer, and the introduction of a faster "down-counting" form of the loop.

Changing code writing from the explicit indexing into an array (Listing 1) to pointer operations (Listing 2) to gain performance isn't particularly onerous. Depending on the application, however, it may not be sufficient. In this part of the article, we discuss advantages that could arise from using a processor customized for DSP operations. Examples are taken for code generated for the Analog Devices ADSP-21061 SHARC processor using the White Mountain VisualDSP development environment. Techniques for increasing the employment of parallel operations to produce faster performance are discussed too.

A primary consideration in adding new material to 68k legacy code isn't one of speed, but rather if the code will work. Typically, the decision to switch to a DSP processor is associated with speed. But, the developer should also ask whether other advantages will result from switching.

A comparison of the time that it takes to execute a program on any processor can be obtained from the formula:

Execution time
= (instructions/program)
* (average cycles/instruction)
* (time/cycle)

Updating from a 16-bit older CISC processor to a newer 32-bit DSP processor allows a switch to more recent technology. This leads to a decrease in execution time as the time/cycle gets smaller, especially with faster implementation of the multiplication logic.

Switching to a 32-bit DSP offers other advantages. The wider data bus means no penalties are associated with implementing 32-bit data operations. DSP algorithms involve repeated multiplications and additions which can quickly lead to the overflow of a 16-bit number representation and inaccurate results. Many 32-bit processors have pipelined floating-point (FP) operations. The pipelining means that there are no speed penalties when using FP operations instead of integer operations. FP algorithms are easier to design as the automatic renormalization of FP numbers removes the problems associated with number overflow. The designer, however, should realize that a 32-bit FP operation has roughly the same precision as a 24-bit integer operation. See "Are you damaging your data through a lack of bit cushions?" by M. Smith and L.E. Turner, which will be published in the December edition of Circuit Cellar.

A 32-bit DSP offers other advantages. The wider data bus allows wider instructions to be fetched, which improves performance by decreasing cycles/instruction. Plus, the wider data bus could provide enough bits to allow the description of parallel operations to decrease the necessary number of overall executed instructions.

Particularly useful characteristics of the DSP are the presence of a hardware loop, hardware circular buffers, and even zero-overhead bit-reverse addressing that's useful during FFT operations. Other features available with the more recent DSPs include alternate register banks (faster interrupt handling) and large on-board data and instruction caches.

The SHARC processor has a single-cycle instruction where three memory accesses (two data and one instruction), two memory address adjustments, and an FP multiplication can be performed at the same time as parallel addition and subtraction operations. That provides for a 4000% improvement over the 68k processor without even changing the clock speed! The problem then is how to code so potential speed improvement will be true speed improvement.

Customization Variables In Registers
As with the 68k processor, the first step in customizing a SHARC routine for speed is to move frequently used variables into registers rather than leaving them in external memory. This is particularly important because the SHARC has a basic LOAD/STORE architecture, which doesn't support the direct memory-to-memory operations available on a CISC processor.

The stages of establishing a stack frame by the VisualDSP 68k compiler are shown in Listing 3. Even this simple task uncovers some of the basic differences architecturally between the 68k CISC and 21k DSP architecture. There's no 21k equivalent to the complex MOVE Multiple operation that describes the storage of many nonvolatile registers to memory in a single instruction. Instead, each register is individually moved to the 68k stack. This difference eats up more program space on the SHARC than the equivalent 68k ROM space. But, there's no speed penalty as the SHARC memory access is more efficient.

Note the two-stage operation necessary to store the SHARC index registers, I0, I1, I2, and the single-stage operation for storing the data registers, R3, R6, etc. This is a consequence of the data address generators (DAG) block on the SHARC. The DAGs are separate ALUs dedicated to address calculations that can occur in parallel with the standard data COMPUTE operations.

Two DAGs are needed because the SHARC can parallel three data operations—one along the "data" data bus and another along the "program" data bus. The third is from the "instruction-cache" data bus controlled by the program sequencer logic unit. The architecture to support the multiple data accesses means that there isn't a direct path for the registers of a DAG to be saved into the memory block controlled by that specific DAG. The lack of data path isn't critical, as such operations aren't frequently required.

Average (0 Ratings):

Subscribe
Subscribe to Electronic Design and start receiving more articles like this one
Filed Under:

Check for price and availability on Source ESB:

Go
powered by  
    There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here
Acceptable Use Policy

Sponsored Links