Premium Content

New Signal Chain Resources from Texas Instruments:

Programming The CUDA Architecture: A Look At GPU Computing

GPUs have quickly surpassed CPUs in terms of computation speed. Now programmers can use the CUDA architecture to help simplify their implementation.

Date Posted: April 09, 2009 12:00 AM
Author: Brent Oster

For a Jacobi relaxation, results shouldn’t be written back to the same buffer being used as input during the parallel computation. Therefore, all input values for the 16x16 block are first loaded into shared memory (which is local to the SM), then block relaxation is performed, and finally the results are written back to the array in device memory. This also boosts performance by reducing the number of device memory fetches, as seen in the CUDA implementation (see “CUDA Kernel Code For Jacobi Relaxation).

In the CUDA implementation, the CUDA kernel function is identified by the type qualifier __global__. The built-in variables threadIdx and blockIdx provide the indices for the local thread and block. They also allow a unique index to be computed for each 2D array element. Threads within the same block have access to 16k of on-chip common shared memory (declared with the __shared__ type qualifier), which has much lower latency than device memory.

The __syncthreads() command provides a barrier such that all threads within a block synchronize before execution continues. In this example, the block of data is loaded from device memory to shared memory, then __syncthreads() is called to ensure that the data is all loaded before performing the relaxation computation and writing back directly to device memory. Using shared memory reduces the number of device memory reads by a factor of four, which is important because reads from device memory can experience several hundred cycles of latency compared to just a few cycles of latency for reads from shared memory.

Also, memory coalescing can be exploited. Here, adjacent threads cooperate to load a contiguous segment of up to 128 bytes of device memory in one read operation.

Now that a kernel is written in C for CUDA, it must be called from the host CPU. First, allocate memory on the GPU device and copy the array data from the host RAM to the GPU device memory. This is done using the cudaMalloc(…) and cudaMemcpy(…) commands (see “CUDA Setup Code For Jacobi Relaxation”). Note that cudaMalloc(…) returns a pointer to an address in device memory. It also can be used in kernel calls that will access that device memory. However, it can’t be accessed from the host code, because the address it points to is not in the host system RAM, but in device RAM on the GPU.

Via the PCI-e bus, cudaMemcpy(…) initiates a DMA transfer from host RAM to device RAM, with the last parameter controlling direction of the transfer between host and device memory.

Next comes the setup of an execution configuration. Algorithmically, this involves dividing the problem into blocks of threads so that it can be computed in parallel. In this example, we subdivide the 2D array into blocks that are 16x16 elements in size (with a two-element overlap boundary shared between blocks). To set this up in CUDA, use dimGrid (a variable of type dim3) to specify how many blocks there are in X and Y and dimBlock to specify how many threads are in a block (one thread per array element).

The kernel with the execution configuration is called by using the <<<dimGrid, dimBlock>>> qualifier between the function name and parameter list. Because kernel calls are non-blocking, the CPU could run other code while the kernel computation completes.

The computation results are then moved from device to host memory using the cudaMemcpy(…) command. From here, it can be output to a file or formatted for visualization. Once completed, the device memory is freed using the cudaFree(…) command.

In a real application, the kernel gets called multiple times to iterate on the array in device memory, converging toward the solution much faster than if we had to move the memory back and forth between device and host. The PCI Express bandwidth limits host-to-device transfers to about 5 Gbytes/s, while the device memory’s bandwidth is on the order of 100 Gbytes/s. Thus, it’s advantageous to keep computations within the 4 Gbytes of device memory on the Tesla C1060.

Although this is a simple example, it shows how to rewrite a typical serial computation to execute in parallel and demonstrates an implementation on the GPU using C for CUDA. With minimal coding effort, it went from having one CPU core serially processing each element of an array to having a GPU use 240 cores to process the array elements in parallel.

In addition to having more cores to accomplish the computation, the GPU has higher memory bandwidth than a CPU. Thus, a significant performance increase is expected.

Figure 6 shows how the performance scales on a Tesla C1060 GPU with the code in this example. For larger arrays, the single-precision version can process nearly 5 billion array elements per second and the float-precision version about 2.5 billion elements per second.

Although there’s eight times the number of single-precision processing cores, performance only increases twofold from double to single precision. This is due to the application being memory bandwidth bound and double-precision data being twice the size of single precision. This simple example of Jacobi relaxation of the Poisson equation demonstrates a significant performance increase over CPUbased implementations. It also shows how one can achieve at least an order of magnitude boost in speed over a CPU for many similar applications.

For more computationally intensive applications, such as molecular dynamics and n-body problems, even greater acceleration occurs because memory bandwidth is no longer the bottleneck and the 240 cores can run at full occupancy. In these cases, 20X jumps in speed are possible and have been demonstrated in academic and commercial codes such as Nanoscale Molecular Dynamics (NAMD) from the University of Illinois at Urbana-Champaign.

Computations such as finite difference methods can also scale to multiple GPUs and multiple nodes. It’s possible to configure a workstation with up to four Tesla cards, giving it a total of 960 cores, 16 Gbytes of device memory, and almost 4 TFLOPS of peak computing power.

Part Inventory
Go
powered by:
 

 
You must log on before posting a comment.

Are you a new visitor? Register Here
  • dw
    2 years ago
    Mar 01, 2010

    u_sh[tx][ty] = u_d[x+y*ArraySizeX]; // I don't understand here. Since x: [1, 1024] and y: [1:1024], if x=1024 and y=1024, and ArraySizeX=1024, the index at this moment exceeds the boundary of the matrix u_d upperbound 1024*1024!

  • dw
    2 years ago
    Mar 01, 2010

    u_sh[tx][ty] = u_d[x+y*ArraySizeX]; // I don't understand here. Since x: [1, 1024] and y: [1:1024], if x=1024 and y=1024, and ArraySizeX=1024, the index at this moment exceeds the boundary of the matrix u_d upperbound 1024*1024!