[Leapfrog: First Look]
SIMT Architecture Delivers Double-Precision Teraflops
William Wong
ED Online ID #19280
July 10, 2008
Copyright © 2006 Penton Media, Inc., All rights reserved. Printing of this document is for personal use only.
Reprints
NVidia’s T10 architecture brings double-precision
floating point to the company’s massively parallel
computing platform. This graphics processing
unit (GPU) architecture also is used in NVidia’s
consumer graphics boards. Both are supported by the
Compute Unified Device Architecture (CUDA). The Tesla
S1070 1U rack-mount system incorporates four of the
Tesla T10 boards, each with a single chip containing 240
cores (Fig. 1). The Tesla C1060 resembles these boards,
but it plugs into a wide PCI Express x16 slot.
The T10 brings a number of new features to the Telsa
line (Fig. 2). It doubles performance, moving to double-precision
floating point and packing 4 Gbytes onto the
board. It also uses one of the largest production
chips around, with 1.4 billion transistors in 240
cores that can churn out 1 teraflop.
The architecture is the same as the earlier
single-precision Telsa 8/G80. It is based aroundv
a single-instruction, multiple-thread (SIMT) execution
model where groups of up to eight threads
will execute the same instruction in a thread processing
array (TPA). Cores in a TPA share fast access to 16 kbytes
of TPA memory. Three TPAs are grouped into thread processing
clusters (TPC), and thread contexts are collected
into groups of 32 threads.
The thread dispatch unit matches active threads that will
execute the same instruction to use as many cores as
possible at one time. The maximum throughput
is attained when all cores are active each cycle. Different
groups of threads can execute their matching instruction
on the same TPA in an alternating, sequential fashion.
The hardware handles thread scheduling and dispatch. All
of the threads of the same priority are either active or waiting
for activation.
Execution is very efficient if threads remain in lock-step.
This is very common when dealing with arrays. If a group
of 32 threads running together hits a branch point and half
take the branch, the eight-core TPA can still work nicely
with the resulting sets of threads. Obviously, algorithms or
data that require many individual threads running different
code will not fare as well with this architecture.
Code can incorporate synchronization points
where
threads will come together on the
same instruction. Synchronization can be
important if early termination tests can be
performed. Otherwise, the chip tends to run a
group
of threads as long as possible.
The latest implementation can have a computation
and a one-way data
transfer occurring at the same time.
The older G80 architecture could perform only one action
at a time. The off-chip interface is PCI Express (PCIe)
Gen2 with 16 lanes
delivering a maximum transfer rate of
102 Gbytes/s.
Data can be moved more quickly with the faster PCIe
links, but most users are more impressed by the 4 Gbytes
of on-board storage since most data needs to be in that
memory to be used efficiently. Multiboard
solutions work well if the data can be spread
across the boards with minimal crossboard
communication being required.
CUDA hides most of the underlying hardware
complexity from the programmer.
It depends upon a few C annotations
so the CUDA C compiler can better
address the multithreading aspects of
an application. The system does not
handle recursion, and loops and arrays
are the norm. This isn’t surprising given
the original target for GPUs.
Also, CUDA provides access to cuDPP
(Data Parallel Primitives) as well as a
number of vector libraries that support
the usual suspects, such as fast Fourier transforms (FFTs). Several third-party companies and
projects provide similar libraries. For example, Tech-X’s
GPULib provides hooks for Java, Python, Matlab, and IDL,
allowing a wide range of applications to take advantage of
NVidia’s GPU.
Still, the application space is much wider than just graphics,
though 3D visualization and analysis are often high on
the list. One design in the medical industry from TechniScan
performs analysis for the Whole Breast Ultrasound scanner.
Four Telsa T10 boards can analyze a scan in 15 minutes,
compared to a much more expensive, 16-core cluster that
takes three times as long to handle the same job.
The CUDA C compiler is a free but not open-source
download, though many of the projects in NVidia’s CUDA
Zone are open-source. The interface specifications are
open. CUDA can also generate code for conventional multicore
platforms, though usually with lower performance
benefits than a GPU can provide.
Developers can develop applications using CUDA and run
them on platforms such as NVidia’s GeForce 8 series. These
are only single-precision platforms and the new Tesla boards
bring more memory and cores to bear, but it can run the same
applications. The current drivers from NVidia for the company’s
graphics boards will all support CUDA applications
and development.
Several universities already use CUDA in parallel programming
classes and projects. It should be interesting to see how
parallel processing grows now that many developers can tap
the power in their NVidia multicore graphics boards.
WILLIAM WONG
NVIDIA
www.nvidia.com
Continue on Page 2
|