Electronic Design

  
Reprints     Printer-Friendly    Email this Article    RSS        Font Size     What's This?


[Design View / Design Solution]
Programming The CUDA Architecture: A Look At GPU Computing
GPUs have quickly surpassed CPUs in terms of computation speed. Now programmers can use the CUDA architecture to help simplify their implementation.

Brent Oster  |   ED Online ID #20903  |   April 9, 2009


Graphics processing units (GPUs) were originally designed to perform the highly parallel computations required for graphics rendering. But over the last couple of years, they’ve proven to be powerful computing workhorses across more than just graphics applications. Designed with more resources devoted to data processing rather than flow control and data caching, GPUs can be leveraged to significantly accelerate portions of codes traditionally run on CPUs, ranging from science and engineering applications to financial modeling.

CUDA is Nvidia’s parallel computing architecture, which manages computation on the GPU in a way that provides a simple interface for the programmer. The CUDA architecture can be programmed in industry-standard C with a few extensions. One takes a traditional C code that runs on the host (CPU) and offloads dataparallel sections of the code, or kernels, to the device (GPU). The CUDA architecture also supports a device-level application programming interface (API) and supports the upcoming OpenCL and DirectX 11 Compute standards.

This article presents an example that introduces various concepts in programming the CUDA architecture with standard C. Only the basic ideas are discussed here. For full coverage of the CUDA architecture, consult the CUDA Programming Guide. The process of programming the CUDA architecture follows a relatively simple path for users familiar with the C programming language. The GPU is viewed as a compute device capable of executing a large number of threads in parallel, and it operates as a coprocessor to the CPU, or host. Functions executed on the GPU are referred to as compute kernels.

Code to be executed on both the host and device can be contained in a single source file with a “.cu” extension. The code is compiled using “nvcc,” the CUDA C-compiler. When the resultant binary file is run, the C Runtime for CUDA coordinates the execution of host and device components.

This article will describe source code that demonstrates:

• Allocation of memory on the GPU device and moving data to/from that memory.
• A kernel—a function callable from the host and executed on the device by many threads in parallel.
• Calling the kernel from the host using an execution configuration—a means of specifying the number and grouping of parallel threads used to execute the kernel.
• Performing parallel computations using the built-in variables threadIdx and block- Idx to subdivide the problem domain.

In our example, CPU and GPU code is used to solve the Poisson equation by employing a first-order Jacobi finite difference scheme. The code simply iterates over every element in a 2D array, computing the value of the element from the four neighboring values and the source term at that point. This is done iteratively, gradually relaxing the array toward a convergent solution of the Poisson equation:

?u = f

discritized to:

ux,y = ¼( ux-1,y + ux+1,y + ux,y-1 + ux,y+1 + h2 × fx,y)

where u is the value being computed, f is the source term (which is known), and h is the grid spacing (assumed to be equal in X and Y).

This type of finite difference computation is used in many more sophisticated simulations, including multigrid, thermal analysis, RF modeling, and transport simulation.

In the CPU example of this computation (Fig. 1), every element in the 2D array must be looped over in a serial manner, one element at a time.

Although there are more optimal methods for iterating over a 2D array with a CPU using cache-blocking and vectorization, this basic implementation facilitates porting our simple application to CUDA (see “CPU Code For Jacobi Relaxation).

On a CUDA-enabled GPU, many processors can be executed at once, which enables developers to conveniently take advantage of a data parallel-programming methodology. On the Nvidia Tesla C1060 hardware, each of 30 streaming multiprocessors (SMs) contains eight single- precision processor cores and one double-precision core (Fig. 2). In total, there are 240 singleprecision processing cores and 30 double-precision cores on-chip. Each SM also has a small amount of on-chip shared memory, as well as specialized hardware for caching data formatted as textures or cast as constants.

The CUDA architecture is exposed to software applications as a set of grids, blocks, and threads (Fig. 3). Grids are of thread blocks that get executed on the device. Thread blocks are groups of threads that execute on individual SMs. Each thread has a unique ID and effectively maps to a single processor of an SM.

On the Tesla C1060, each thread has access to its own local register set, 16k of shared memory, and 4 Gbytes of device memory. All threads can read/write to any location in device memory and read/ write to any location in their block’s shared memory. Shared memory has much faster access latency than device memory, so it’s often used as a scratchpad to store intermediate results for computations, for buffering reads and writes to achieve optimal memory-access patterns, and for providing inter-thread communication within a block.

To port the Jacobi relaxation code to CUDA, we need to re-formulate the relaxation computation so that it can be computed in parallel. In the CPU code, we had to iterate an index over every element in serial. In CUDA, we assign a thread to each element in the 2D array. During each relaxation, each thread will take the neighboring four elements and the local value of f as input and compute one element as output, writing that element back to the array (Fig. 4).

Mapping this parallelism to the CUDA software architecture requires subdividing the array into 2D blocks, which are composed of 16x16 elements (with corresponding 16x16 threads). These will automatically be queued up and executed on the SMs. Blocks are overlapped to enable communication between them, creating a ‘halo’ of boundary source elements (blue in Fig. 5) loaded along with the elements to be computed (green in Fig. 5).

Continue to page 2


<-- prev. page     [1] 2     next page -->

Reprints   Printer-Friendly  Email this Article  RSS    Font Size   What's This?


  • Network-On-Chip Tools Arrive for The Masses
  • Tackling System Design Challenges Through Early Verification
  • ESL Tools Take Center Stage As Designers Move Up
  • Parasitic Extraction Tool Targets Next-Generation Custom ICs
  • Synopsys Jumps Into ESL-Synthesis Pool
  • Verify Control Systems Before Committing To Hardware
  • You're Using How Many FPGAs?
  • Tool Up For The FPGA Blitz
    1) Build A Smart Battery Charger Using A Single-Transistor Circuit
    (188 views today)
    2) Hot Hands For Some Cool Rock: Motion Sensing Meets Audio Engineering
    (172 views today)
    3) Science Fiction Meets Science Fact In Today's Robot Research
    (88 views today)
    4) GPS-Derived Grandmaster Clock Delivers Ultra-Precise Time And Frequency Sync
    (88 views today)
    5) What's All This Transimpedance Amplifier Stuff, Anyhow? (Part 1)
    (78 views today)
    ALL TOP 20



    Reader Comments

    good, very good

    rolando perez -July 15, 2009

    Hi, anyone to confirm this implementation not working?

    Lx -May 03, 2009

    POST YOUR COMMENTS HERE
    Name:

    Email:
    Your Comments:

    Enter the text from the image below


    Please refresh the page if you have trouble reading this text.

    Search Electronic Design
         
      
     
    Web Seminar
    Sponsored By:
    Title: Read Pacing: A Performance Enhancing Feature of PCI Express Gen 2 Switch Devices
    Speakers: 
    Date: 07/01/08
    Register: 

    Electronic Design Europe Electronic Design China EEPN Power Electronics Auto Electronics Microwaves & RF
    Mobile Dev & Design Schematics Find Power Products Military Electronics EE Events Related Resources