With the on-board SHARC memory, there isn't the penalty in doing a memory-to-register transfer rather than a register-to-register transfer that there would be on the 68k processor. Which one of the three approaches is taken depends on the parallel operations that are available for employment without penalty in other parts of the loop.
We have seen that it's a fairly straightforward procedure to reduce the 14-cycle loop that's produced by the optimizing compiler to seven cycles. But what's the theoretical maximum speed of this loop, and how can this speed be achieved in practice? The resource usage for the instructions required to calculate the instantaneous and average power can be seen in Figure 1. Registers have been assigned to allow parallelization of a number of these calculations.
Maximum speed is achieved when a resource is used to a maximum. There are two cycles, needed for additions, multiplications, and program memory operations. In theory, if the data memory accesses could be ignored, the loop cycle count could be reduced from seven cycles per power calculation to just two. This optimum coding sequence is shown in Figure 2, with a number of power calculations occurring in parallelunrolling the loop.
The simplest way to avoid the dm data-memory-access conflict is to move two of these accesses to a separate line. This would produce code where it would take the average of three cycles to produce the instantaneous and average power levels of the complex-valued arrays. It's possible, however, to duplicate the register values by using an additional memory access combined with a COMPUTE operation. This technique is demonstrated in Figure 3. There, the final code "rerolled" back into a loop can be viewed.
Note the stages of the optimized algorithm. First, there's a series of instructions used to "prime" the computational pipeline prior to the loop. Then come the operations within the loop itself. Finally, there are instructions used to "empty" the computational pipeline. For this particular set of code, it also proved possible to account for situations when Npts was odd, something that wasn't straightforward to optimize with the original 68k code.
In this second part of a two-part series, we briefly looked at the SHARCAnalog Devices' ADSP-2106x series of DSP processors. We compared the characteristics of the SHARC with a simple CISC processor, the Motorola 68k, whose equivalent can be frequently found in embedded systems. We compared these two processors to show why, even with equal clock speeds, the SHARC had the capability to outperform the CISC processor by around 4000%. This was a theoretical speed improvement rather than an actual improvement, though.
It was shown that with the optimizer activated, the White Mountain Visual-DSP development environment generated 2106x assembly language code with a fair degree of speed. By hand optimization of "unrolled" code, it was fairly straightforward to activate the parallel operations available with the SHARC architecture to improve the code speed by another 200%. A detailed analysis of the code at the "resource" level revealed techniques that allowed a further 300% speed improvement. The final code has a speed that was close to the theoretical maximum processor speed.
Editor's note: The author has been in contact with Analog Devices' David Levine, who is developing an optimizing compiler that generates tight code for the TigerSHARC, which effectively has two 21k CPUs on one chip. The author was offered the chance to try out the beta release of the new compiler. He has also been "told" he needs to go back to school to learn about some alternative optimizing technologies that are not so compiler-specific. So if you found these articles useful, stay tuned for further updates on optimizing compilers.