[Design View / Design Solution]
Programming The CUDA Architecture: A Look At GPU Computing
GPUs have quickly surpassed CPUs in terms of computation speed. Now programmers can use the CUDA architecture to help simplify their implementation.
Brent Oster
ED Online ID #20903
April 9, 2009
Copyright © 2006 Penton Media, Inc., All rights reserved. Printing of this document is for personal use only.
Reprints
Graphics processing units (GPUs)
were originally designed to perform
the highly parallel computations
required for graphics
rendering. But over the last couple of years,
they’ve proven to be powerful computing
workhorses across more than just graphics
applications. Designed with more resources
devoted to data processing rather than flow
control and data caching, GPUs can be leveraged
to significantly accelerate portions
of codes traditionally run on CPUs, ranging
from science and engineering applications
to financial modeling.
CUDA is Nvidia’s parallel computing
architecture, which manages computation
on the GPU in a way that provides a
simple interface for the programmer. The
CUDA architecture can be programmed
in industry-standard C with a few extensions.
One takes a traditional C code that
runs on the host (CPU) and offloads dataparallel
sections of the code, or kernels, to
the device (GPU). The CUDA architecture
also supports a device-level application
programming interface (API) and supports
the upcoming OpenCL and DirectX
11 Compute standards.
This article presents an example that
introduces various concepts in programming
the CUDA architecture with standard
C. Only the basic ideas are discussed here.
For full coverage of the CUDA architecture, consult the CUDA Programming Guide.
The process of programming the CUDA
architecture follows a relatively simple
path for users familiar with the C programming
language. The GPU is viewed
as a compute device capable of executing
a large number of threads in parallel, and
it operates as a coprocessor to the CPU, or
host. Functions executed on the GPU are
referred to as compute kernels.
Code to be executed on both the host
and device can be contained in a single
source file with a “.cu” extension. The
code is compiled using “nvcc,” the CUDA
C-compiler. When the resultant binary file is run, the C Runtime for CUDA coordinates
the execution of host and device
components.
This article will describe source code
that demonstrates:
• Allocation of memory on the GPU device
and moving data to/from that memory.
• A kernel—a function callable from the
host and executed on the device by many
threads in parallel.
• Calling the kernel from the host using
an execution configuration—a means of
specifying the number and grouping of
parallel threads used to execute the kernel.
• Performing parallel computations using
the built-in variables threadIdx and block-
Idx to subdivide the problem domain.
In our example, CPU and GPU code
is used to solve the Poisson equation by
employing a first-order Jacobi finite difference
scheme. The code simply iterates
over every element in a 2D array, computing
the value of the element from the four
neighboring values and the source term at
that point. This is done iteratively, gradually
relaxing the array toward a convergent
solution of the Poisson equation:
?u = f
discritized to:
ux,y = ¼( ux-1,y + ux+1,y + ux,y-1 + ux,y+1 + h2 × fx,y)
where u is the value being computed, f is
the source term (which is known), and h
is the grid spacing (assumed to be equal
in X and Y).
This type of finite difference computation
is used in many more sophisticated simulations, including multigrid, thermal
analysis, RF modeling, and transport
simulation.
In the CPU example of this computation
(Fig. 1), every element in the 2D array
must be looped over in a serial manner,
one element at a time.
Although there are more optimal
methods for iterating over
a 2D array with a CPU using
cache-blocking and vectorization,
this basic implementation
facilitates porting our simple
application to CUDA (see “CPU
Code For Jacobi Relaxation”).
On a CUDA-enabled GPU,
many processors can be executed
at once, which enables developers
to conveniently take
advantage of a data parallel-programming
methodology. On the
Nvidia Tesla C1060 hardware,
each of 30 streaming multiprocessors
(SMs) contains eight single-
precision processor cores and
one double-precision core (Fig.
2). In total, there are 240 singleprecision
processing cores and 30
double-precision cores on-chip.
Each SM also has a small amount
of on-chip shared memory, as
well as specialized hardware for
caching data formatted as textures
or cast as constants.
The CUDA architecture is
exposed to software applications
as a set of grids, blocks,
and threads (Fig. 3). Grids are of thread blocks that get
executed on the device. Thread
blocks are groups of threads
that execute on individual
SMs. Each thread has a unique
ID and effectively maps to a
single processor of an SM.
On the Tesla C1060, each
thread has access to its own
local register set, 16k of shared
memory, and 4 Gbytes of
device memory. All threads
can read/write to any location
in device memory and read/
write to any location in their
block’s shared memory. Shared
memory has much faster access
latency than device memory, so
it’s often used as a scratchpad to store intermediate
results for computations, for buffering
reads and writes to achieve optimal
memory-access patterns, and for providing
inter-thread communication within a block.
To port the Jacobi relaxation code to
CUDA, we need to re-formulate the relaxation
computation so that it can be computed
in parallel. In the CPU code, we
had to iterate an index over every element
in serial. In CUDA, we assign a thread
to each element in the 2D array. During
each relaxation, each thread will take the
neighboring four elements and the local
value of f as input and compute one element
as output, writing that element back
to the array (Fig. 4).
Mapping this parallelism to the CUDA
software architecture requires subdividing
the array into 2D blocks, which are
composed of 16x16 elements (with corresponding
16x16 threads). These will automatically
be queued up and executed on
the SMs. Blocks are overlapped to enable
communication between them, creating a
‘halo’ of boundary source elements (blue
in Fig. 5) loaded along with the elements
to be computed (green in Fig. 5).
Continue to page 2
For a Jacobi relaxation, results shouldn’t
be written back to the same buffer being
used as input during the parallel computation.
Therefore, all input values for the
16x16 block are first loaded into shared
memory (which is local to the SM), then
block relaxation is performed, and finally
the results are written back to the array in
device memory. This also boosts performance
by reducing the number of device memory fetches, as seen in the CUDA
implementation (see “CUDA Kernel Code
For Jacobi Relaxation”).
In the CUDA implementation, the CUDA
kernel function is identified by the type
qualifier __global__. The built-in variables threadIdx and blockIdx provide the indices
for the local thread and block. They also
allow a unique index to be computed for
each 2D array element. Threads within the
same block have access to 16k of on-chip
common shared memory (declared with the __shared__ type qualifier), which has
much lower latency than device memory.
The __syncthreads() command provides
a barrier such that all threads within
a block synchronize before execution
continues. In this example, the block of
data is loaded from device memory to
shared memory, then __syncthreads() is
called to ensure that the data is all loaded
before performing the relaxation computation
and writing back directly to device
memory. Using shared memory reduces
the number of device memory reads by a
factor of four, which is important because
reads from device memory can experience
several hundred cycles of latency compared
to just a few cycles of latency for
reads from shared memory.
Also, memory coalescing can be exploited.
Here, adjacent threads cooperate to load
a contiguous segment of up to 128 bytes of
device memory in one read operation.
Now that a kernel is written in C for
CUDA, it must be called from the host
CPU. First, allocate memory on the GPU
device and copy the array data from the
host RAM to the GPU device memory.
This is done using the cudaMalloc(…) and
cudaMemcpy(…) commands (see “CUDA
Setup Code For Jacobi Relaxation”). Note
that cudaMalloc(…) returns a pointer to
an address in device memory. It also can
be used in kernel calls that will access
that device memory. However, it can’t be
accessed from the host code, because the
address it points to is not in the host system
RAM, but in device RAM on the GPU.
Via the PCI-e bus, cudaMemcpy(…) initiates
a DMA transfer from host RAM to
device RAM, with the last parameter controlling
direction of the transfer between
host and device memory.
Next comes the setup of an execution
configuration. Algorithmically, this
involves dividing the problem into blocks
of threads so that it can be computed in
parallel. In this example, we subdivide the
2D array into blocks that are 16x16 elements
in size (with a two-element overlap
boundary shared between blocks). To set
this up in CUDA, use dimGrid (a variable
of type dim3) to specify how many blocks
there are in X and Y and dimBlock to
specify how many threads are in a block
(one thread per array element).
The kernel with the execution configuration
is called by using the <<<dimGrid,
dimBlock>>> qualifier between the function name
and parameter
list. Because kernel
calls are
non-blocking, the
CPU could run
other code while
the kernel computation
completes.
The computation results are then
moved from device to host memory
using the cudaMemcpy(…) command.
From here, it can be output to a file or
formatted for visualization. Once completed,
the device memory is freed using
the cudaFree(…) command.
In a real application, the kernel gets
called multiple times to iterate on the
array in device memory, converging
toward the solution much faster than if we
had to move the memory back and forth
between device and host. The PCI Express
bandwidth limits host-to-device transfers
to about 5 Gbytes/s, while the device
memory’s bandwidth is on the order of
100 Gbytes/s. Thus, it’s advantageous to
keep computations within the 4 Gbytes of
device memory on the Tesla C1060.
Although this is a simple example, it
shows how to rewrite a typical serial computation
to execute in parallel and demonstrates
an implementation on the GPU
using C for CUDA. With minimal coding
effort, it went from having one CPU core
serially processing each element of an
array to having a GPU use 240 cores to
process the array elements in parallel.
In addition to having more cores to
accomplish the computation, the GPU
has higher memory bandwidth than a CPU. Thus, a
significant performance
increase is
expected.
Figure 6 shows
how the performance
scales on a
Tesla C1060 GPU
with the code in
this example. For
larger arrays, the
single-precision
version can process
nearly 5 billion array elements per
second and the float-precision version
about 2.5 billion elements per second.
Although there’s eight times the number
of single-precision processing cores,
performance only increases twofold from
double to single precision. This is due to
the application being memory bandwidth
bound and double-precision data being
twice the size of single precision. This
simple example of Jacobi relaxation of the Poisson equation demonstrates a significant
performance increase over CPUbased
implementations. It also shows how
one can achieve at least an order of magnitude
boost in speed over a CPU for many
similar applications.
For more computationally intensive
applications, such as molecular dynamics
and n-body problems, even greater acceleration
occurs because memory bandwidth is
no longer the bottleneck and the 240 cores
can run at full occupancy. In these cases,
20X jumps in speed are possible and have
been demonstrated in academic and commercial
codes such as Nanoscale Molecular
Dynamics (NAMD) from the University
of Illinois at Urbana-Champaign.
Computations such as finite difference
methods can also scale to multiple GPUs
and multiple nodes. It’s possible to configure
a workstation with up to four Tesla
cards, giving it a total of 960 cores, 16
Gbytes of device memory, and almost 4
TFLOPS of peak computing power.
|