Premium Content

New Signal Chain Resources from Texas Instruments:

Boost Performance By Vectorizing Your DSP Software

Whether you do it manually or use a special compiler, vectorization can speed up code by as much as 200%.

Date Posted: March 20, 2000 12:00 AM

If the performance of this calculation is to be improved, the wait states accessing the data from memory must be eliminated. Achieving this demands changing the order in which the operations are performed (Fig. 2). If the processor were to prefetch all of the data in the b vector at once, loading it into the L1 cache, the memory accesses would tend to be in the same page. With a good memory controller and processor, these accesses can then be performed with few, and perhaps even no, wait states. The data will stream into the processor at the maximum bandwidth possible from the memory.

The next step in processing this function is to access all of the c vector, loading it into the L1 cache as well. Room must be made for the result vector. Some processors provide a mechanism that performs this function without using any memory bandwidth at all. The final step is to actually do the calculation.

The overall result is a function that performs faster than the original scalar code, as long as the source and destination vectors all fit into L1 cache simultaneously. If they don't, the cache will start thrashing. For example, if each individual vector is larger than the cache, loading the c vector will discard all of the b vector from cache. Making room for the result vector will discard the b vector from cache. By the time the computation is being performed, the processor's behavior will revert back to the scalar case. All of the work involved in attempting to prefetch the data ends up being overhead without any benefit. But an additional optimization can deal with this issue.

If the source and destination vectors don't fit into cache, the problem must be broken down into pieces. To achieve maximum performance, the pieces have to fit into cache. This technique, called "strip mining," is illustrated in Figure 3.

The process involves prefetching a portion of the two source vectors and making room for a portion of the destination vector. The computation is performed and the result is flushed to memory. The next portions of the input and output vectors are then set up in cache. Again, the computation takes place and the result gets flushed. This continues until the entire computation has been performed.

Do Some Strip Mining
The vector portions are called "strips," hence the name "strip mining." Ideally, the size of the strips is computed so that they just fit into the L1 cache. Obviously, the bookkeeping must be done carefully. If the strip size is too large, the strips will overflow the cache, inducing memory-bandwidth overhead. If they're too small, additional overhead will come from managing the smaller strips. The pointers into the vectors also must be managed correctly to produce the right result with minimum overhead.

Additional optimizations should be considered when performing a sequence or chain of vector computations. If intermediate results can stay in the cache, they don't take any memory bandwidth at all when used in the later computation. If a vector can somehow get marked as a temporary when it won't be used elsewhere in the program, it's also possible to avoid using memory bandwidth to store it back into memory by invalidating those cache lines.

It's difficult to impossible for a run-time library to know how to perform strip-mining and function-chaining optimizations in all cases. Some library vendors have provided additional arguments to their library functions to enable the application programmer to specify optimizations. For example, an argument can be provided to indicate whether a given vector should be in memory or can be assumed to be in L1 cache. If a vector is indicated to be in L1 cache, it doesn't need to be prefetched.

Very efficient programs can then be written, but the user must manage the strip mining manually. The application becomes architecture-dependent. If the cache size changes, or an additional cache is introduced (L2 or L3), the optimization techniques change as well. Even upgrading to a later product within the same product family can require modifying the application to get the best performance.

A few companies have produced compilers that can automatically vectorize applications and target their proprietary hardware architectures. Cray (now SGI) was the first. Digital Equipment (now COMPAQ) and SKY Computers also have produced vectorizing compilers. Pacific-Sierra Research has more generic vectorizing technology. These compilers can significantly ease the task. They're able to identify program loops and convert them to appropriate calls to vector libraries.

SKY's compiler goes further in that it's able to perform a number of memory-bandwidth optimizations, including strip mining. It also can perform more global optimizations, even for programs using vector libraries. The compiler can eliminate vector loads to cache when not needed, discard vector flushes to memory for the same reason, and do automatic strip mining. The catch is that the user needs SKY's hardware to take advantage of these capabilities.

Dramatic Improvements Possible
The bottom line is that to gain the best performance for vector-oriented applications, the final executable must be optimized not only for the processor architecture, but also for the memory architecture. The optimizations required are well understood. Strip mining and careful cache manipulation can make dramatic improvements in application performance.

It is possible to write generally useful library routines in such a way that the user is able to take advantage of these optimizations. But an advanced vectorizing compiler will dramatically reduce the amount of effort required to achieve the desired performance level. It also will enable the resulting application to be easily ported to future generations of hardware. With the rapid introduction of hardware products today, minimizing and preserving the software investment is critical to program success.

Part Inventory
Go
powered by:
 

 
You must log on before posting a comment.

Are you a new visitor? Register Here
    There are no comments to display. Be the first one!