William Wong
|
ED Online ID #20640 |
February 4, 2009
GPUs: Not Just For Graphics Anymore While Nvidia has a number of products, including a mobile embedded platform, it is primarily known for its graphics boards. Lately, it has also been known as a purveyor of its multicore GPUs into the more general computing space. Its GPU chip uses a single instruction multiple thread (SIMT) architecture (read “SIMT Architecture Delivers Double-Precision Teraflops”), which is unique and it lends itself to group task-style multithreading. That is a good match for a range of applications beyond graphics rendering.
This has led to a whole line of boards from Nvidia that are dedicated to computing like the Tesla C1060 PCI Express board. The C1060 is much like the GTX 260 I tested except that it has no display output. The GTX 260 contains 192 1.2 GHz processing cores plus 896 Mbytes of RAM. It is one of the heftier PCI Express-based graphics adapters that requires the extra PCI Express power cable. The new top of the line is the GTX 295. If you have the right motherboard you can put in more than one board and connect them together in a cooperative fashion for driving one or more graphics displays. This same, over the top, cable connection is not required if the units are being used as compute platforms.
I won’t get into a detailed review of the GTX 260 since there are more gaming magazines and Web sites that have done more detailed overviews of this and its later siblings. It can handle 3D games with true 3D output via LCD shutter glasses, if you have the right display hardware. The results are great but, that’s for another article.
What I did look at was the CUDA support provided by this board. CUDA is an acronym for Compute Unified Device Architecture but no one ever uses that mouthful so CUDA it is. I took a look at CUDA alone but in the future OpenCL will likely be the support that most will talk about. OpenCL is Apple’s brainchild, but it’s being adopted by Nvidia as part of CUDA. The big question will be what will wind up where.
For this discussion I concentrated on exercising CUDA.
So what is CUDA? Well, check out Nvidia’s Web site for the details, but essentially it is a programming environment that provides controlled access to the GPU on the GTX 260 or any other late-model Nvidia graphics chip or platform like Tesla. Typically a GTX 260 would handle graphics display chores but, even running Vista’s Aero it usually has more horsepower left over for other things. It is only when playing computationally or graphically demanding 3D games that the GTX 260 really gets a workout. The rest of the time the chip is like most PCs, idle.
In the past, the graphics processor was off limits, hidden behind a device driver. Well, the device driver is still there, but now it lets a lot more in and out. In particular, it can support CUDA applications and runtime libraries. It can do this to the exclusion of the graphics display support hence the Tesla line with no graphics output connectors or it can be used on demand as in a multi card GTX 260 installation.
This had advantages in 3D gaming as well as other areas because those 192 cores can be used for other chores such as physics simulation support in games or computations that analyze 3D ultrasonic information from a breast scanning machine. The latter turned a multi-day processing chore on a multicore x86 platform into something like a fifteen minute wait, providing nearly real-time medical feedback using four Tesla boards.
So, to take advantage of CUDA I simply popped the x16 PCI Express GTX 260 video board into a desktop PC; installed the latest drivers; downloaded the CUDA development software, and I was in business. I was able to test some CUDA applications simply by having the board and device driver installed. Nothing special is required. This will be the same kind of support that is needed for incorporating more advanced physics engines in games just as an example.
Things get more interesting from a programmers standpoint when writing applications to take advantage of a CUDA-enabled system. The first step was to use the array manipulation runtime support available with the CUDA tools. This is relatively easy, as you can essentially ignore the SIMT architecture, but you will have to get used to the memory allocation functions and other runtime support provided by the library, because data must be moved into the GPU’s memory for processing. This is something that always happens with a graphics display, but it is effectively hidden from the programmer. Using CUDA requires this to be a little more explicit.
Other than that, you have access to filters and other array processing functions, just as you would with a conventional math library. This same type of support is utilized by tools such as The Mathwork’s Matlab when using CUDA as a backend processor for array operations.
The next step was to move closer into the CUDA framework and take advantage of the extensions to the C compiler. This took a bit more investigation into the documentation. CUDA is not hard to work with, but it is involved. I don’t purport to have any great insights into the extensions, given the limited amount of time I have used the system, but it will quickly become second nature to any programmer. For the most part, the extensions are easy to understand and utilize.
As with TBB, the first thing to master is memory management, including the use of CUDA arrays. While TBB works with arrays in memory, CUDA must assume that data might have to be copied to a work area such as the GPU’s onboard memory. As noted, CUDA can run on platforms that support TBB, but the idea is to make the interface transparent so the application can run on either platform. Next comes the parallelization and synchronization syntax and semantics. There are also the runtime library functions if you plan on using those.
Plan on a couple days to get a system set up and to read through the docs. This is definitely a time when reading helps. There are plenty of demo apps on the CUDA Web site and incorporation of the CUDA compiler and support is no more than a make file configuration away.
I did not get into the debugging aspects in much detail since I was primarily working with the demo code and applications. There is an emulator where debugging is significantly easier than on the hardware. Otherwise single stepping is not an option. The emulator is great for small applications and debugging a small collection of functions, but it would be interesting to get some feedback from Nvidia developers or others that have worked on larger applications and what debugging techniques were used on large applications. Issues like deadlock are just the tip of the iceberg that will require new tools and techniques to be mastered those moving into parallel programming.
CUDA is going to make a major difference in a number of areas from gaming to medical technology. The fact that it can often be found on a desktop is a definite plus and the performance increase can be substantial. Still, the SIMT architecture is amenable to only some parallel processing chores but it is definitely a platform worth investigating if you have to crunch numbers. There is now a reason to get a great gaming board on your PC other than to play games.
Little Cores. Big Ambitions I took a look at the XMos XDK development kit (see “Multicore And Soft Peripherals Target Multimedia Applications”) last year. It is still the platform of choice for those looking to take full advantage of the chip, but the new XC-1 contains the same XS1-G4 chip. It also costs significantly less at $99.
The same Eclipse-based development tools come with this new platform. The main difference will be some of the peripheral support including external connections. The peripheral ports are brought out but the headers are not installed. Luckily we have a couple soldering irons here in the lab, so adding standard headers and creating ribbon cables is easy.
The board has a USB interface to link it to the PC-based development tools. If you want to use the built-in LEDs that encircle the chip and on-board switches and concentrate on software development, then unpacking and plugging in the system will take minutes. Getting the software installed will take a little longer but you should easily have the test applications up and running within an hour or two.
Getting up and running with the XC-1 is trivial with a Windows PC. Plug in the USB connection—which also powers the board; install the drivers and IDE should be up and running in under an hour.
The challenge will comes after trying the demo applications and running through the online documentation and Quick Start Guide. Communication between tasks is done using channels. These can be mapped to the hardware channels that link cores and chips in a multicore environment. The interface ports will be a bit more familiar to embedded developers, but the XS1-G4 is designed for bit-banging with soft peripherals. It is a bit much to explain here so check out the XMos Web site and downloadable documentation. This approach takes advantage of the task hardware scheduling provided by the chip.
Programming the XC-1 tends to be easier than something like the GTX 260 because you use a regular C or C++ compiler for most work. If you are working on device drivers or you want to take advantage of the scheduling or communication hardware then the XC compiler will come into play. There are limited parallel programming and communication functions to deal with—so there is a relatively small learning curve.
XMos also has a simulator (xsim and xiss) that you can take advantage of. The chip and interface are fast enough that hands-on work is easily done with the board.
The XC-1 is the best way to check out the software and tools for the XS1-G4 chip. Pull out the soldering iron if interfacing comes into play or check out the more expensive XDK.