SIMT Architecture Delivers Double-Precision Teraflops

NVidia’s T10 architecture brings double-precision floating point to the company’s massively parallel computing platform. This graphics processing unit (GPU) architecture also is used in NVidia’s consumer graphics boards. Both are supported by the Compute Unified Device Architecture (CUDA). The Tesla S1070 1U rack-mount system incorporates four of the Tesla T10 boards, each with a single chip containing 240 cores (Fig. 1). The Tesla C1060 resembles these boards, but it plugs into a wide PCI Express x16 slot.

The T10 brings a number of new features to the Telsa line (Fig. 2). It doubles performance, moving to double-precision floating point and packing 4 Gbytes onto the board. It also uses one of the largest production chips around, with 1.4 billion transistors in 240 cores that can churn out 1 teraflop.

The architecture is the same as the earlier single-precision Telsa 8/G80. It is based aroundv a single-instruction, multiple-thread (SIMT) execution model where groups of up to eight threads will execute the same instruction in a thread processing array (TPA). Cores in a TPA share fast access to 16 kbytes of TPA memory. Three TPAs are grouped into thread processing clusters (TPC), and thread contexts are collected into groups of 32 threads.

The thread dispatch unit matches active threads that will execute the same instruction to use as many cores as possible at one time. The maximum throughput is attained when all cores are active each cycle. Different groups of threads can execute their matching instruction on the same TPA in an alternating, sequential fashion. The hardware handles thread scheduling and dispatch. All of the threads of the same priority are either active or waiting for activation.

Execution is very efficient if threads remain in lock-step. This is very common when dealing with arrays. If a group of 32 threads running together hits a branch point and half take the branch, the eight-core TPA can still work nicely with the resulting sets of threads. Obviously, algorithms or data that require many individual threads running different code will not fare as well with this architecture.

Code can incorporate synchronization points where threads will come together on the same instruction. Synchronization can be important if early termination tests can be performed. Otherwise, the chip tends to run a group of threads as long as possible.

The latest implementation can have a computation and a one-way data transfer occurring at the same time. The older G80 architecture could perform only one action at a time. The off-chip interface is PCI Express (PCIe) Gen2 with 16 lanes delivering a maximum transfer rate of 102 Gbytes/s.

Data can be moved more quickly with the faster PCIe links, but most users are more impressed by the 4 Gbytes of on-board storage since most data needs to be in that memory to be used efficiently. Multiboard solutions work well if the data can be spread across the boards with minimal crossboard communication being required. CUDA hides most of the underlying hardware complexity from the programmer. It depends upon a few C annotations so the CUDA C compiler can better address the multithreading aspects of an application. The system does not handle recursion, and loops and arrays are the norm. This isn’t surprising given the original target for GPUs.

Also, CUDA provides access to cuDPP (Data Parallel Primitives) as well as a number of vector libraries that support the usual suspects, such as fast Fourier transforms (FFTs). Several third-party companies and projects provide similar libraries. For example, Tech-X’s GPULib provides hooks for Java, Python, Matlab, and IDL, allowing a wide range of applications to take advantage of NVidia’s GPU.

Still, the application space is much wider than just graphics, though 3D visualization and analysis are often high on the list. One design in the medical industry from TechniScan performs analysis for the Whole Breast Ultrasound scanner. Four Telsa T10 boards can analyze a scan in 15 minutes, compared to a much more expensive, 16-core cluster that takes three times as long to handle the same job.

The CUDA C compiler is a free but not open-source download, though many of the projects in NVidia’s CUDA Zone are open-source. The interface specifications are open. CUDA can also generate code for conventional multicore platforms, though usually with lower performance benefits than a GPU can provide.

Developers can develop applications using CUDA and run them on platforms such as NVidia’s GeForce 8 series. These are only single-precision platforms and the new Tesla boards bring more memory and cores to bear, but it can run the same applications. The current drivers from NVidia for the company’s graphics boards will all support CUDA applications and development.

Several universities already use CUDA in parallel programming classes and projects. It should be interesting to see how parallel processing grows now that many developers can tap the power in their NVidia multicore graphics boards.

WILLIAM WONG

NVIDIA
www.nvidia.com

Continue on Page 2