Electronics History: AMD’s Unified Shader GPU

This article is part of the Electronics History series: The Graphics Chip Chronicles.

ATI had been experimenting with the concepts of a unified shader for several years leading up to its acquisition by AMD. ATI, like NVIDIA and Microsoft, knew it was extravagant to have a shader or array of shaders dedicated to one specific function that would sit idle when it was unnecessary to process it. Making all shaders inside the graphics processing unit (GPU) identical would not only simplify the manufacturing, but it would increase the functionality and efficiency of a GPU by at least 50%.

The tricky part was scheduling and routing: How is a shader supposed to know when it’s a transformation (vertex) shader and when it’s a lighting (pixel) shader? The answer—and everyone knew it—was that the shader doesn’t care. An instruction is an instruction, and data is data. Just get on with it.

However, unified shading architecture hardware needs some form of load balancing and dynamic scheduling capability that could ensure all of the computational units (shaders) were kept working as much as possible.

The console iGPU group inside ATI, comprised of the engineers that developed the first GPU at the heart of the Nintendo 64, accepted the challenge. The group, which joined ATI as part of the ArtX acquisition, developed a unified-shader GPU for the Xbox 360 in 2004, which was called Xenos.

Inside Xenos: The GPU at the Heart of the Xbox 360

The pipeline stages in the Xenos GPU aren’t all that different from other GPUs since they lacked the unified shader model. That was the result of instructions happening at the register level instead of through the API. It saved the developers from having to learn new approaches, which would have taken time and disrupted their traditional coding techniques.

The Xenos GPU also features a separate cached location in the GPU, so that it could notify its state to the CPU as quickly as possible. Microsoft called it the “tail-pointer write-back.” It was all about keeping both the components from interrupting each other while the CPU updated the L2 cache, and the GPU pulled data from it. According to Microsoft, that routine provided a theoretical bandwidth of 18 GB/s.

The GPU in the Xbox 360 was a customized version of ATI’s R520, which was a revolutionary design at the time. In keeping with the “X” prefix and echoing the IBM processor, ATI called the GPU, the Xenos. The Xenos, code-named C1, came with 10 MB of internal DRAM and 512 MB of 700-MHz GDDR3 RAM. ATI’s R520 GPU used the R500 architecture, which was based on a 90-nm production process at TSMC, giving it a silicon area of 288 mm2 that housed 321 million transistors.

The arithmetic logic unit (ALU) inside the GPU worked with 32-bit IEEE 754 floating-point numbers (with typical graphics simplifications of rounding modes), denormalized numbers (flush to zero on reads), exception handling, and not-a-number handling. These units were capable of vector (including dot product) and scalar operations with single-cycle throughput. That is, all operations issued every cycle, giving them a peak processing of 96 shader calculations per cycle while fetching textures and vertices.

The GPU had eight vertex shader units supporting the VS3.0 Shader Model of DirectX 9.0. Every unit could process one 128-bit vector instruction plus one 32-bit scalar instruction for each clock cycle. Combined, the eight vertex shader units could transform up to two vertices every clock cycle. The Xenos was the first GPU to process 10 billion vertex shader instructions per second. The vertex shader units supported dynamic flow control instructions such as branches, loops, and subroutines.

At the time, one of the most important new features of DirectX 9.0 was the support for floating-point processing and data formats known as 4×32 float. Compared with the integer formats used in previous API versions, floating-point formats provided much higher precision, range, and flexibility.

Xenos: The First of Many Unified Shader GPUs

For Microsoft, the Xenos GPU was proof of the promise of the unified shader; it integrated the technology into the proprietary version of Direct3D 9.0 used in the Xbox 360. The Xbox 360 ran on a semi-custom version of Direct3D that accommodated the additional functions needed for the Xenos GPU. As a result, Direct3D influenced the design of ATI’s TerreScale Xenos architecture, and Xenos influenced future revisions of Direct3D beginning with version 10.

The Xbox 360 was introduced in late 2005.

ATI (and subsequently AMD—same team, just different company badges) had a very close relationship with Microsoft. NVIDIA did, too, and still does. Every company selling GPUs needed to work closely with Microsoft since the APIs are what made the latest features of the GPU accessible to game developers. And Microsoft, as the API builder, wanted the latest performance features offered by the GPUs.

In late 2006, a year after the Xbox 360 announcement, NVIDIA launched its GeForce 8800GTX, the first GPU available with unified shaders via a common API—DirectX 10, shader model 4.0. The new unified shader used a very long instruction word (VLIW) architecture where the core executes operations in parallel.

Block diagram of the Xbox 360 GPU — This is a block diagram of the Xbox 360 GPU.

In a unified architecture, a shader cluster is organized into five stream processing units. Each stream processing unit can retire a finished, single, precision floating-point instruction per clock, dot product (DP, and special case by combining ALUs), and integer ADD. The fifth unit is more complex and can handle special transcendental functions such as sine and cosine. Each shader cluster can execute up to six instructions per clock cycle, consisting of five shading instructions plus one branch.

Several months after NVIDIA’s push into unified shaders, AMD introduced its TeraScale architecture in the PC. The RV600 series of TeraScale PC GPUs represented ATI’s second-generation unified shader GPU and were designed to be fully compatible with Pixel Shader 4.0 and Microsoft’s DirectX 10.0 API. They were implemented on AMD’s Radeon HD 2000-series add-in-boards (AIBs).

TeraScale replaced ATI’s Xenos fixed-pipeline, hardware-scheduled unified shader. It was intended to compete with NVIDIA’s first unified shader microarchitecture, Tesla.

The R600 GPUs were manufactured on the 80- and 65-nm nodes. TeraScale was also used in AMD’s Brazos family of accelerated processing units (APUs), code-named Llano, Richland, and Trinity.

Unified shaders ushered in a new era for the GPU and firmly established the GPU as a computing element. By having none-specific shaders, which are 32-bit floating-point processors, the enormous processing power of large arrays of shaders in a single-instruction multiple-data (SIMD) architecture put the GPU on the map in a wide range of new markets, including science, medicine, and manufacturing.