Electronic Design

  
Reprints     Printer-Friendly    Email this Article    RSS        Font Size     What's This?


[Design View / Design Solution]
Programming The CUDA Architecture: A Look At GPU Computing
GPUs have quickly surpassed CPUs in terms of computation speed. Now programmers can use the CUDA architecture to help simplify their implementation.

Brent Oster  |   ED Online ID #20903  |   April 9, 2009


For a Jacobi relaxation, results shouldn’t be written back to the same buffer being used as input during the parallel computation. Therefore, all input values for the 16x16 block are first loaded into shared memory (which is local to the SM), then block relaxation is performed, and finally the results are written back to the array in device memory. This also boosts performance by reducing the number of device memory fetches, as seen in the CUDA implementation (see “CUDA Kernel Code For Jacobi Relaxation).

In the CUDA implementation, the CUDA kernel function is identified by the type qualifier __global__. The built-in variables threadIdx and blockIdx provide the indices for the local thread and block. They also allow a unique index to be computed for each 2D array element. Threads within the same block have access to 16k of on-chip common shared memory (declared with the __shared__ type qualifier), which has much lower latency than device memory.

The __syncthreads() command provides a barrier such that all threads within a block synchronize before execution continues. In this example, the block of data is loaded from device memory to shared memory, then __syncthreads() is called to ensure that the data is all loaded before performing the relaxation computation and writing back directly to device memory. Using shared memory reduces the number of device memory reads by a factor of four, which is important because reads from device memory can experience several hundred cycles of latency compared to just a few cycles of latency for reads from shared memory.

Also, memory coalescing can be exploited. Here, adjacent threads cooperate to load a contiguous segment of up to 128 bytes of device memory in one read operation.

Now that a kernel is written in C for CUDA, it must be called from the host CPU. First, allocate memory on the GPU device and copy the array data from the host RAM to the GPU device memory. This is done using the cudaMalloc(…) and cudaMemcpy(…) commands (see “CUDA Setup Code For Jacobi Relaxation”). Note that cudaMalloc(…) returns a pointer to an address in device memory. It also can be used in kernel calls that will access that device memory. However, it can’t be accessed from the host code, because the address it points to is not in the host system RAM, but in device RAM on the GPU.

Via the PCI-e bus, cudaMemcpy(…) initiates a DMA transfer from host RAM to device RAM, with the last parameter controlling direction of the transfer between host and device memory.

Next comes the setup of an execution configuration. Algorithmically, this involves dividing the problem into blocks of threads so that it can be computed in parallel. In this example, we subdivide the 2D array into blocks that are 16x16 elements in size (with a two-element overlap boundary shared between blocks). To set this up in CUDA, use dimGrid (a variable of type dim3) to specify how many blocks there are in X and Y and dimBlock to specify how many threads are in a block (one thread per array element).

The kernel with the execution configuration is called by using the <<<dimGrid, dimBlock>>> qualifier between the function name and parameter list. Because kernel calls are non-blocking, the CPU could run other code while the kernel computation completes.

The computation results are then moved from device to host memory using the cudaMemcpy(…) command. From here, it can be output to a file or formatted for visualization. Once completed, the device memory is freed using the cudaFree(…) command.

In a real application, the kernel gets called multiple times to iterate on the array in device memory, converging toward the solution much faster than if we had to move the memory back and forth between device and host. The PCI Express bandwidth limits host-to-device transfers to about 5 Gbytes/s, while the device memory’s bandwidth is on the order of 100 Gbytes/s. Thus, it’s advantageous to keep computations within the 4 Gbytes of device memory on the Tesla C1060.

Although this is a simple example, it shows how to rewrite a typical serial computation to execute in parallel and demonstrates an implementation on the GPU using C for CUDA. With minimal coding effort, it went from having one CPU core serially processing each element of an array to having a GPU use 240 cores to process the array elements in parallel.

In addition to having more cores to accomplish the computation, the GPU has higher memory bandwidth than a CPU. Thus, a significant performance increase is expected.

Figure 6 shows how the performance scales on a Tesla C1060 GPU with the code in this example. For larger arrays, the single-precision version can process nearly 5 billion array elements per second and the float-precision version about 2.5 billion elements per second.

Although there’s eight times the number of single-precision processing cores, performance only increases twofold from double to single precision. This is due to the application being memory bandwidth bound and double-precision data being twice the size of single precision. This simple example of Jacobi relaxation of the Poisson equation demonstrates a significant performance increase over CPUbased implementations. It also shows how one can achieve at least an order of magnitude boost in speed over a CPU for many similar applications.

For more computationally intensive applications, such as molecular dynamics and n-body problems, even greater acceleration occurs because memory bandwidth is no longer the bottleneck and the 240 cores can run at full occupancy. In these cases, 20X jumps in speed are possible and have been demonstrated in academic and commercial codes such as Nanoscale Molecular Dynamics (NAMD) from the University of Illinois at Urbana-Champaign.

Computations such as finite difference methods can also scale to multiple GPUs and multiple nodes. It’s possible to configure a workstation with up to four Tesla cards, giving it a total of 960 cores, 16 Gbytes of device memory, and almost 4 TFLOPS of peak computing power.


<-- prev. page     1 [2]     next page -->

Reprints   Printer-Friendly  Email this Article  RSS    Font Size   What's This?


  • Network-On-Chip Tools Arrive for The Masses
  • Tackling System Design Challenges Through Early Verification
  • ESL Tools Take Center Stage As Designers Move Up
  • Parasitic Extraction Tool Targets Next-Generation Custom ICs
  • Synopsys Jumps Into ESL-Synthesis Pool
  • Verify Control Systems Before Committing To Hardware
  • You're Using How Many FPGAs?
  • Tool Up For The FPGA Blitz
    1) Build A Smart Battery Charger Using A Single-Transistor Circuit
    (185 views today)
    2) Hot Hands For Some Cool Rock: Motion Sensing Meets Audio Engineering
    (170 views today)
    3) GPS-Derived Grandmaster Clock Delivers Ultra-Precise Time And Frequency Sync
    (90 views today)
    4) What's All This Transimpedance Amplifier Stuff, Anyhow? (Part 1)
    (88 views today)
    5) Downconverting Mixers Lower Power Consumption While Improving Performance
    (71 views today)
    ALL TOP 20



    Reader Comments

    good, very good

    rolando perez -July 15, 2009

    Hi, anyone to confirm this implementation not working?

    Lx -May 03, 2009

    POST YOUR COMMENTS HERE
    Name:

    Email:
    Your Comments:

    Enter the text from the image below


    Please refresh the page if you have trouble reading this text.

    Search Electronic Design
         
      
     
    Web Seminar
    Sponsored By:
    Title: Read Pacing: A Performance Enhancing Feature of PCI Express Gen 2 Switch Devices
    Speakers: 
    Date: 07/01/08
    Register: 

    Electronic Design Europe Electronic Design China EEPN Power Electronics Auto Electronics Microwaves & RF
    Mobile Dev & Design Schematics Find Power Products Military Electronics EE Events Related Resources