Graphics processing units (GPUs)
were originally designed to perform
the highly parallel computations
required for graphics
rendering. But over the last couple of years,
they’ve proven to be powerful computing
workhorses across more than just graphics
applications. Designed with more resources
devoted to data processing rather than flow
control and data caching, GPUs can be leveraged
to significantly accelerate portions
of codes traditionally run on CPUs, ranging
from science and engineering applications
to financial modeling.
CUDA is Nvidia’s parallel computing
architecture, which manages computation
on the GPU in a way that provides a
simple interface for the programmer. The
CUDA architecture can be programmed
in industry-standard C with a few extensions.
One takes a traditional C code that
runs on the host (CPU) and offloads dataparallel
sections of the code, or kernels, to
the device (GPU). The CUDA architecture
also supports a device-level application
programming interface (API) and supports
the upcoming OpenCL and DirectX
11 Compute standards.
This article presents an example that
introduces various concepts in programming
the CUDA architecture with standard
C. Only the basic ideas are discussed here.
For full coverage of the CUDA architecture, consult the CUDA Programming Guide.
The process of programming the CUDA
architecture follows a relatively simple
path for users familiar with the C programming
language. The GPU is viewed
as a compute device capable of executing
a large number of threads in parallel, and
it operates as a coprocessor to the CPU, or
host. Functions executed on the GPU are
referred to as compute kernels.
Code to be executed on both the host
and device can be contained in a single
source file with a “.cu” extension. The
code is compiled using “nvcc,” the CUDA
C-compiler. When the resultant binary file is run, the C Runtime for CUDA coordinates
the execution of host and device
components.
This article will describe source code
that demonstrates:
• Allocation of memory on the GPU device
and moving data to/from that memory.
• A kernel—a function callable from the
host and executed on the device by many
threads in parallel.
• Calling the kernel from the host using
an execution configuration—a means of
specifying the number and grouping of
parallel threads used to execute the kernel.
• Performing parallel computations using
the built-in variables threadIdx and block-
Idx to subdivide the problem domain.
In our example, CPU and GPU code
is used to solve the Poisson equation by
employing a first-order Jacobi finite difference
scheme. The code simply iterates
over every element in a 2D array, computing
the value of the element from the four
neighboring values and the source term at
that point. This is done iteratively, gradually
relaxing the array toward a convergent
solution of the Poisson equation:
?u = f
discritized to:
ux,y = ¼( ux-1,y + ux+1,y + ux,y-1 + ux,y+1 + h2 × fx,y)
where u is the value being computed, f is
the source term (which is known), and h
is the grid spacing (assumed to be equal
in X and Y).
This type of finite difference computation
is used in many more sophisticated simulations, including multigrid, thermal
analysis, RF modeling, and transport
simulation.
In the CPU example of this computation
(Fig. 1), every element in the 2D array
must be looped over in a serial manner,
one element at a time.
Although there are more optimal
methods for iterating over
a 2D array with a CPU using
cache-blocking and vectorization,
this basic implementation
facilitates porting our simple
application to CUDA (see “CPU
Code For Jacobi Relaxation”).
On a CUDA-enabled GPU,
many processors can be executed
at once, which enables developers
to conveniently take
advantage of a data parallel-programming
methodology. On the
Nvidia Tesla C1060 hardware,
each of 30 streaming multiprocessors
(SMs) contains eight single-
precision processor cores and
one double-precision core (Fig.
2). In total, there are 240 singleprecision
processing cores and 30
double-precision cores on-chip.
Each SM also has a small amount
of on-chip shared memory, as
well as specialized hardware for
caching data formatted as textures
or cast as constants.
The CUDA architecture is
exposed to software applications
as a set of grids, blocks,
and threads (Fig. 3). Grids are of thread blocks that get
executed on the device. Thread
blocks are groups of threads
that execute on individual
SMs. Each thread has a unique
ID and effectively maps to a
single processor of an SM.
On the Tesla C1060, each
thread has access to its own
local register set, 16k of shared
memory, and 4 Gbytes of
device memory. All threads
can read/write to any location
in device memory and read/
write to any location in their
block’s shared memory. Shared
memory has much faster access
latency than device memory, so
it’s often used as a scratchpad to store intermediate
results for computations, for buffering
reads and writes to achieve optimal
memory-access patterns, and for providing
inter-thread communication within a block.
To port the Jacobi relaxation code to
CUDA, we need to re-formulate the relaxation
computation so that it can be computed
in parallel. In the CPU code, we
had to iterate an index over every element
in serial. In CUDA, we assign a thread
to each element in the 2D array. During
each relaxation, each thread will take the
neighboring four elements and the local
value of f as input and compute one element
as output, writing that element back
to the array (Fig. 4).
Mapping this parallelism to the CUDA
software architecture requires subdividing
the array into 2D blocks, which are
composed of 16x16 elements (with corresponding
16x16 threads). These will automatically
be queued up and executed on
the SMs. Blocks are overlapped to enable
communication between them, creating a
‘halo’ of boundary source elements (blue
in Fig. 5) loaded along with the elements
to be computed (green in Fig. 5).
Continue to page 2