Electronic Design

  
Reprints     Printer-Friendly    Email this Article    RSS        Font Size     What's This?


[Design Application]
Boost Performance By Vectorizing Your DSP Software
Whether you do it manually or use a special compiler, vectorization can speed up code by as much as 200%.

Stephen Paavola  |   ED Online ID #1313  |   March 20, 2000


The high-performance computing arena still has an insatiable appetite for performance. The more processing power semiconductor vendors produce, the more developers try to squeeze out every drop of that performance. This demand is an upward spiral that keeps new products launching into the market at unprecedented rates. It's hard to get that last drop of performance before just giving in and getting the next big thing. But if you want to try, the long-standing but misunderstood tool called vectorization can increase DSP application performance by as much as 200%.

Vector libraries have been around since computers were first used for scientific computations. Initially, they were produced as a convenience. Programmers didn't have to rewrite and debug simple functions every time they needed to do a vector-oriented computation. As processors became powerful and complicated, effort was made to optimize these libraries by writing them in assembly language.

Today, processors aren't just complicated. With multiple execution units, superscalar pipelined architectures, VLIW, vector-processing units, and other tricks, it's nearly impossible for mortals to take full advantage of them. As a result, an increasing number of companies are creating optimized libraries. General-purpose processors are now so fast that the performance of many applications is limited by the memory bandwidth, not the processor speed. Advanced cache architectures are attempting to help this bottleneck. But proper vectorization can be the key to a cost-effective solution.

What Is Vectorization?
Many common algorithms are optimized by the process of vectorization. It maximizes processor utilization and memory bandwidth. There are two basic ways to vectorize an application. The most commonly used approach is for an engineer to write a vector library. The functions may be written in a high-level language, such as C or Fortran, or written in assembly language for a particular processor. The application programmer then hand vectorizes the application by calling these vector-library routines.

The other approach is for the compiler to perform automatic vectorization. A few compilers are able to recognize loops in the program as vector functions. They can automatically convert these loops to calls to a vector library. Or, they can directly use hardware vector functionality.

A wide range of applications, but especially DSP ones, can be optimized through vectorization to enable additional performance gains. This range includes basically all signal- and image-processing applications. They're characterized by having their data stored in contiguous memory locations as vectors or arrays. Storing it in this form makes it possible to optimize common algorithms, maximizing processor utilization and memory bandwidth.

Several reasons exist to hand vectorize an application. The possible performance gains certainly count. Accuracy and reliability gains also may be achieved. Some algorithms require careful analysis by a mathematician in order to obtain satisfactory results.

Another reason to hand vectorize is that the code becomes more self-documenting. A subroutine call, in which the name of the subroutine describes the function with the arguments clearly specified, is easier to understand than trying to decode what a loop is doing.

Memory-Performance Problem
Modern compilers do a very good job of scalar optimization. Some even do some advanced loop optimizations, like loop unrolling. Unfortunately, though, they don't have any real understanding of the memory architecture and what's involved in optimizing for memory bandwidth.

Take a simple function like a vector multiply as an example. In C, a vector multiply would look like:

for (i=0; i<length; i++)  {    a[I] = b[I] * c[I];  }

If the source and destination vectors are in cache, a good C compiler will generate code that's pretty efficient for this function. But if the vectors are in memory, the scalar compiler will actually generate code with the worst possible utilization of memory bandwidth. Understanding this requires knowledge of DRAM behavior.

DRAMs have two types of sequences to access their contents. If the current access is in the same page as the previous one, the memory location can be selected very rapidly. The data can potentially be accessed without wait states. But if a page change is required, the page must be selected before the location with the page is accessed. Changing pages takes several clocks, and induces wait states.

Looking at the vector-multiply example again, it's seldom the case that the three vectors are in the same page. As Figure 1 shows, the typical sequence of functions performed is to open the page containing b(0). Next, load the cache line containing b(0) into the L1 cache. Then, open the page containing c(0) and load its cache line into the L1 cache. A multiply is performed. But before the result can be stored, the page where a(0) will be kept must be opened. That cache line must be loaded into the L1 cache. Finally, the location in L1 cache representing a(0) is modified, with the remainder of that cache line being left unmodified. In processor terms, this has all taken a long time because a lot of wait states were required to get the data out of memory.

The processor can now proceed to compute a(1). This should happen very quickly, because b(1) and c(1) were probably fetched into L1 cache as a byproduct of the previous computation. The processor will rapidly finish with these cache lines, however, and go through the process of loading additional cache lines, again with lots of wait states. The time consumed by the wait states can be twice the time used to actually transfer data. This process continues until the entire computation is performed.


<-- prev. page     [1] 2     next page -->

Reprints   Printer-Friendly  Email this Article  RSS    Font Size   What's This?


  • Network-On-Chip Tools Arrive for The Masses
  • Tackling System Design Challenges Through Early Verification
  • ESL Tools Take Center Stage As Designers Move Up
  • Parasitic Extraction Tool Targets Next-Generation Custom ICs
  • Synopsys Jumps Into ESL-Synthesis Pool
  • Verify Control Systems Before Committing To Hardware
  • You're Using How Many FPGAs?
  • Tool Up For The FPGA Blitz
    1) Build A Smart Battery Charger Using A Single-Transistor Circuit
    (189 views today)
    2) Hot Hands For Some Cool Rock: Motion Sensing Meets Audio Engineering
    (173 views today)
    3) GPS-Derived Grandmaster Clock Delivers Ultra-Precise Time And Frequency Sync
    (90 views today)
    4) Science Fiction Meets Science Fact In Today's Robot Research
    (84 views today)
    5) What's All This Transimpedance Amplifier Stuff, Anyhow? (Part 1)
    (81 views today)
    ALL TOP 20



    Reader Comments

    Dear Sirs,

    I’m Zada come from Art-Hollow company . I take the liberty of writing to you with a view to building up business relations with your firm.

    Our company trade mainly in Embroidery, Vector Artwork, and Hot fix Rhinestones , we also trade in Heat Transfer, Patches & Emblems, Labels and do the website.

    digitizing: ?2.9 for each 1000 stitches. free edit and free quote

    vectorizing artworks: ?13 each artwork (The lowest prices will $9,the first order is free)

    Turnaround time: 6-12 hours.

    We recommend you to order a small quantity for trial. We assure you of our best services at all times.

    Pls believe that our high quality services and competitive prices will make your business more competitive than before.

    Looking forward to your early reply.

    Zada

    website: www.art-hollow.com

    zada -August 10, 2009

    POST YOUR COMMENTS HERE
    Name:

    Email:
    Your Comments:

    Enter the text from the image below


    Please refresh the page if you have trouble reading this text.

    Search Electronic Design
         
      
     
    Web Seminar
    Sponsored By:
    Title: Read Pacing: A Performance Enhancing Feature of PCI Express Gen 2 Switch Devices
    Speakers: 
    Date: 07/01/08
    Register: 

    Electronic Design Europe Electronic Design China EEPN Power Electronics Auto Electronics Microwaves & RF
    Mobile Dev & Design Schematics Find Power Products Military Electronics EE Events Related Resources