[Engineering Feature]
Games Flourish In A Parallel Universe
Multicore processors accelerate games if developers can take advantage of the features and live with the limitations.
William Wong
ED Online ID #15745
June 21, 2007
Copyright © 2006 Penton Media, Inc., All rights reserved. Printing of this document is for personal use only.
Reprints
Gaming platforms like
Microsoft's Xbox 360 (see figure)
and Sony's PlayStation 3 (see figure)
push the envelope when
it comes to graphics and
computation, delivering
sophisticated and realistic games. With their latest multicore 64-bit processing architectures, programmers can create sophisticated,
multithreaded applications.
The computational processors are tightly
integrated with the graphical processing units,
minimizing system response time for a better
gaming experience. Even small delays can disrupt the flow of a game or its multimedia presentation. Performance and balance on both the
hardware and software fronts will provide an
optimal gaming experience.
Gamers tend to grade a system on the basis
of the game's playing capabilities, regardless of
how well it takes advantage of the underlying
hardware. Still, looking under the hood shows
each system's potential. As with most programming platforms, applications rarely take full
advantage of the hardware the first time
around. It takes time to learn about system
idiosyncrasies and to mold application frameworks to exploit the hardware.
Game developers have an additional challenge because game vendors often target multiple platforms with the same game. Obviously,
this is desirable from a vendor's perspective,
because it widens the market. Unfortunately,
even slight differences in platforms or their capabilities can significantly impact the software.
The differences between Microsoft's and
Sony's platforms are quite substantial, so a
seemingly minor problem potentially becomes
major. The Xbox 360 uses a more conventional
symmetrical processing (SMP) architecture.
Sony's PlayStation 3 is built on IBM's Cell
processor. The Cell foregoes the large caches
for its eight Synergistic Processing Elements
(SPEs), forcing application programmers to use
software-based caching support.
THE SYMMETRICAL APPROACH
Microsoft developed a multicore chip, with
IBM, based on the Power architecture (Fig. 1).
Its three 3.2-GHz processing cores are identical
and have their own 32-kbyte L1 instruction
and data caches. The two-way, set-associative
caches include parity error checking on the
128-bit lines.
Each core can run two threads. The processing cores share a 1-Mbyte L2 cache, but this core has an interesting architecture. Half of
the cache runs at the processors' clock frequency, while the rest of the L2 cache runs
at 1.6 GHz. Then, things become interesting when adding a new instruction called
Extended Data Cache Block Touch.
The instruction is designed to prefetch data from main memory into the L1 cache. It's often easier to take advantage of this instruction in a gaming environment, where the size and use of data is well-defined. Moving data into the cache reduces L2 thrashing, so it can be used to quickly build up a thread's working set. In a conventional processor, the working set is brought in incrementally, slowing down the overall thread
operation.
The processing chip accesses main memory through the
front-side bus connected to the graphics chip. The front-side
bus runs at 5.4 GHz with a bandwidth of 21.6 Gbytes/s. The
graphics chip provides a unified memory system to the onchip graphics processing unit (GPU) and the Power cores in
the processing chip. The GPU can read data directly from the
L2 cache for even better interaction with application code.
The processors also support cacheable and cache-inhibited
store operations, which are handled by different pipelines.
The cacheable operations use eight store-gathering, nonsequential buffers per core, while the non-cacheable operations use four sequential buffers. By understanding these
instructions, developers can optimize their applications.
For example, data written to main memory for use by the
GPU will often benefit from bypassing the cache if the application threads no longer need to access this data. Running the
data through the cache would simply flush data that might be
useful later. However, the cache isn't the only concern for software developers.
Each processing core includes a VMX128 (Vector/SIMD
Multimedia eXtension) unit. The VMX128 was specifically designed to accelerate 3D graphics and game
physics. Developers can benefit from this feature
because it was built on the VMX accelerator, which
is already found in many Power architecture cores
like those in Apple's G4 and G5 Power Macs.
Enhancing SIMD support in a compiler is a relatively straightforward process and typically allows a
programmer to exploit the underlying hardware
without significantly modifying the software.
There are significant advantages to
Microsoft's more conventional gaming
hardware approach. SMP with multilevel,
transparent coherent caches is standard fare
on PCs. Thus, it's significantly easier to develop multithreaded applications that will run on
different platforms, often with minimal application architectural changes other than recompilation. The same is true
for utilization of VMX 128, since this support is often hidden by the compiler.
WE DON'T NEED NO STINKIN' CACHE
The PlayStation 3 uses IBM's Cell processor (Fig. 2). The Cell has a Power architecture core, called the Power Processor Element
(PPE), similar to the ones used in Microsoft's multicore solution (see "Cell Processor Gets Ready To Entertain The Masses"). But the
Cell's core is designed to manage a set of eight synergistic
processor elements (SPEs).
The PPE is a Power architecture core that includes caching.
It often runs a typical operating system like Linux. Applications running on the PPE coordinate the operation of the SPEs
in addition to executing parts of the application that may not
benefit from the SPEs' multithreaded nature.
The Cell chip's architecture and layout are very different
from the Xbox chip, primarily due to the lack of caches on the
SPEs. As such, designers are able to add 256 kbytes of RAM
with each of the eight SPEs. This RAM is used for code and
data storage.
The Xbox chip can run six threads on three processors,
while each SPE is single-threaded. One big difference between
the two approaches is that the SPE operation is deterministic,
a feature not possible with a cache. Determinism is key in
some environments, especially gaming.
There's a tradeoff, though. Programmers or programming
tools need to account for the different access times. Access to
the 256 kbytes of local memory is the fastest. There's a small
overhead to access another SPE's memory, but a significant
amount of overhead to access main memory. In addition,
DMA transfers between main memory and SPE's memory are
fast, but it takes time to set up the transfers.
The SPE architecture affects how applications are coded to
take advantage of the multiple units as well as contend with
the memory limitations. Overall, programmers take what
Peter Hoffstee, Distinguished Engineer with IBM, calls a
shopping-list technique when it comes to scheduling. SPEs are
given jobs from a list and come back for another upon completion of their current job.
Two different approaches can be used to deliver code and
data from the list. The first essentially splits an application into chunks that will fit into an SPE. DMA is used to bring in
the necessary code and working set data (Fig. 3). The SPE
code may access other memory, but most of the data is loaded
when the chunk starts. Data may be written back to main
memory when the task is done. Often, the task runs to completion, and then the next chunk is loaded.
The chunking approach can be used to handle streams of data.
For example, game programs typically process data for a display
frame. This processing can be split into chunks, and the chunks
are then distributed among the SPEs. A single frame may be broken up into more chunks than SPEs. Consequently, it's simply a
matter of running the chunks through the SPEs at a rate fast
enough to complete a frame in time to display it.
The other approach is similar to chunking, but either the
code or data stays in place. For instance, an application
applies the same algorithm to a stream of data. The code is
loaded once into an SPE, and then data is moved in and out as
it's processed. The flip side is a chunk of data that's transformed by some code and then another and another. Double-buffering reduces the amount of data or code that can be
swapped, but it may improve efficiency.
This swapping approach was quite common in the past when
memory was at a premium. Think back to Fortran COMMON
statements or Basic program switching on mini computers.
Code and data can be pushed into SPEs or pulled in by code
running on the SPEs. This type of software-based caching can
vary significantly from one application to another, but it puts
the control into the hands of programmers instead of the
hardware. Huge benefits can be derived from software
caching and SPE communication if the hardware is brought to
bear on a problem. IBM optimized a ray tracing program that
caches seven different kinds of data blocks among multiple
SPEs. The end result: performance improved by almost an
order of magnitude.
Of course, the goal of compiler designers like those at
CodePlay is to automate the partitioning, swapping, etc. (see
"C/C++ Compiler Targets Multicore Chips,"). The company's Sieve system targets platforms such
as IBM's Cell, but it works equally well for SMP platforms (see "Going Multicore With Sieve").
The compiler hides the underlying differences. Conventional multithreaded programming tools remain useful, but there's
always room for improvement (see "Multithreading: It's Not
New!"). Middleware (e.g., OpenMP) usage
will likely increase to take advantage of the greater processing
power, especially with multiplayer games.
Both the Xbox 360 and PlayStation 3 incorporate additional acceleration features, especially within their respective
GPUs. In both cases, though, it's a matter of balance to make
optimum use of these features.
BALANCING ACT
Clinton Keith, chief technical officer
for Vivendi Software's High Moon Studios, notes that the difficulty in targeting games to different platforms is in matching
the application with the system hardware. High Moon Studios developed the popular Darkwatch game that runs on
both the PlayStation and Xbox platforms.
Keith indicated that bottleneck identification is critical to
enhancing system performance. Once identified, the bottlenecks can be addressed. This means moving where and how
computation is performed, depending on the architecture.
For example, CPU/GPU communication is constrained on
both the PlayStation 3 and Xbox 360. Memory and DMA
bandwidth are often more of an issue than raw computational
performance. That's something developers should consider
when designing the system because it may mean that those
extra cycles can be used for other chores.
Game developers have taken shortcuts and made approximations due to the game platforms' lower-than-necessary performance to provide a computationally accurate real-time gaming
environment. It's only recently that more complex simulation
could be performed.
Backgrounds started as fixed images, grew to sprite objects,
and then became objects with more density and complexity. A
tree may appear lifelike from one angle, but not from another.
Shadows may not be realistic, and so on. Incorporate more algorithms and processing power, and the difference between the
apparent and real instances becomes smaller.
Platforms like the Xbox 360 and PlayStation 3 open up the
possibility of modeling objects like trees. It would use a rule-based system running on an SPE that might otherwise
be idle. Other possibilities include moving computation currently found on
some GPUs to the CPU (or SPE). Taking
advantage of more computation power
may only be part of the issue, though.
Loading, memory, and bandwidth
also come into play, so moving code and
data to different parts of the system may
open up opportunities. It may be possible to reduce the amount of information
that flows between the CPU complex
and GPU, improving overall system efficiency. For example, a CPU can "chew
down" polygons into triangles handled
more efficiently by the GPU. This, in
turn, may free up cycles for other tasks.
Many tasks like the tree simulations
are becoming possible with this new
hardware. High on the list is improved
accuracy, particularly when it comes to
simulating the physics of a game. However, gaming platforms can't take
advantage of new hardware like Ageia's
PhysX chip (see "BFG Technologies And Ageia Make Physics Fun,").
But they can apply their multicore
performance to the task, making them
significantly better than earlier gaming
platforms and most PC-based solutions.
Chips like Ageia's will still have the
edge, just like GPUs have the edge for
graphics, so they may show up in future
gaming platforms.
Another key area affected by the new
hardware is improved scripting language execution and artificial-intelligence support. The latter would allow
flocking algorithms to be used, thus
enhancing the realism.
Even more possibilities open up as
multiplayer games are added to the mix.
Using single systems to coordinate the
other systems involved limits overall
system utilization. Balancing ease of
programming, system response, and system utilization is a complex task that's
made more difficult with the need to
address multithreaded applications running on these platforms.
Keith notes that the potential of the
Xbox 360 and PlayStation 3 is just at
the onset. It will take a copule of years,
plus tools like CodePlay's Sieve, to really see what they can do. Developers at
companies such as High Moon Studios
are already exploring the possibilities
when using up to 32 hardware threads.
In the meantime, expect some stunning
and aggressively intelligent games to
emerge on these platforms.
NEED MORE INFORMATION?
•Ageia
www.ageia.com
•BFG Technologies
www.bfgtech.com
•CodePlay
www.codeplay.com
•IBM
www.ibm.com
•Microsoft
www.microsoft.com
•Sony
www.sony.com
•Vivendi
www.vivendi.com
|