[Engineering Feature]
Games Flourish In A Parallel Universe
Multicore processors accelerate games if developers can take advantage of the features and live with the limitations.
WE DON'T NEED NO STINKIN' CACHE The PlayStation 3 uses IBM's Cell processor (Fig. 2). The Cell has a Power architecture core, called the Power Processor Element (PPE), similar to the ones used in Microsoft's multicore solution (see "Cell Processor Gets Ready To Entertain The Masses"). But the Cell's core is designed to manage a set of eight synergistic processor elements (SPEs).
The PPE is a Power architecture core that includes caching. It often runs a typical operating system like Linux. Applications running on the PPE coordinate the operation of the SPEs in addition to executing parts of the application that may not benefit from the SPEs' multithreaded nature.
The Cell chip's architecture and layout are very different from the Xbox chip, primarily due to the lack of caches on the SPEs. As such, designers are able to add 256 kbytes of RAM with each of the eight SPEs. This RAM is used for code and data storage.
The Xbox chip can run six threads on three processors, while each SPE is single-threaded. One big difference between the two approaches is that the SPE operation is deterministic, a feature not possible with a cache. Determinism is key in some environments, especially gaming.
There's a tradeoff, though. Programmers or programming tools need to account for the different access times. Access to the 256 kbytes of local memory is the fastest. There's a small overhead to access another SPE's memory, but a significant amount of overhead to access main memory. In addition, DMA transfers between main memory and SPE's memory are fast, but it takes time to set up the transfers.
The SPE architecture affects how applications are coded to take advantage of the multiple units as well as contend with the memory limitations. Overall, programmers take what Peter Hoffstee, Distinguished Engineer with IBM, calls a shopping-list technique when it comes to scheduling. SPEs are given jobs from a list and come back for another upon completion of their current job.
Two different approaches can be used to deliver code and data from the list. The first essentially splits an application into chunks that will fit into an SPE. DMA is used to bring in the necessary code and working set data (Fig. 3). The SPE code may access other memory, but most of the data is loaded when the chunk starts. Data may be written back to main memory when the task is done. Often, the task runs to completion, and then the next chunk is loaded.
The chunking approach can be used to handle streams of data. For example, game programs typically process data for a display frame. This processing can be split into chunks, and the chunks are then distributed among the SPEs. A single frame may be broken up into more chunks than SPEs. Consequently, it's simply a matter of running the chunks through the SPEs at a rate fast enough to complete a frame in time to display it.
The other approach is similar to chunking, but either the code or data stays in place. For instance, an application applies the same algorithm to a stream of data. The code is loaded once into an SPE, and then data is moved in and out as it's processed. The flip side is a chunk of data that's transformed by some code and then another and another. Double-buffering reduces the amount of data or code that can be swapped, but it may improve efficiency.
This swapping approach was quite common in the past when memory was at a premium. Think back to Fortran COMMON statements or Basic program switching on mini computers.
Code and data can be pushed into SPEs or pulled in by code running on the SPEs. This type of software-based caching can vary significantly from one application to another, but it puts the control into the hands of programmers instead of the hardware. Huge benefits can be derived from software caching and SPE communication if the hardware is brought to bear on a problem. IBM optimized a ray tracing program that caches seven different kinds of data blocks among multiple SPEs. The end result: performance improved by almost an order of magnitude.
Of course, the goal of compiler designers like those at CodePlay is to automate the partitioning, swapping, etc. (see "C/C++ Compiler Targets Multicore Chips,"). The company's Sieve system targets platforms such as IBM's Cell, but it works equally well for SMP platforms (see "Going Multicore With Sieve").
The compiler hides the underlying differences. Conventional multithreaded programming tools remain useful, but there's always room for improvement (see "Multithreading: It's Not New!"). Middleware (e.g., OpenMP) usage will likely increase to take advantage of the greater processing power, especially with multiplayer games.
Both the Xbox 360 and PlayStation 3 incorporate additional acceleration features, especially within their respective GPUs. In both cases, though, it's a matter of balance to make optimum use of these features.