Expected to make its first commercial appearance in Sony's next-generation entertainment system, Playstation 3, the CELL Processor is a highly parallel compute engine jointly developed by Sony, Toshiba, and IBM Corp. that was unveiled earlier this week at the IEEE International Solid State Circuits Conference. The first generation CELL chip can support concurrent real-time and conventional computing applications. Chip prototypes were able to correctly operate in the test lab at frequencies well over 4 GHz and deliver a single-precision compute throughput of over 256 GFLOPS.
At the heart of the chip is a new 64-bit IBM Power architecture processor called a Power Processor Element (PPE) that can run multiple operating systems (including Linux) and two simultaneous program threads. The PPE implements the Power instruction-set-architecture but has a leaner microarchitecture than previous implementations. Included in the PPE are 32-kbyte L1 data and instruction caches and a double-precision floating-point multiplier. In addition, the PPE is supported by an on-chip 512 kbyte L2 cache (Fig. 1). The PPE performs all control/coordination operations for the rest of the chip.
The chip also features eight streaming processors called synergistic processor elements (SPEs). Each SPE has its own local memory and can run an independent instruction stream. All SPEs are connected to the PPE and the rest of the chip via a high-bandwidth quad-ring bus called an element interconnect bus (EIB). The four 16-byte-wide data rings that make up the EIB can, in total, transfer up to 96 bytes per cycle. Ten simultaneous instruction threads are possible: the dual threads that run on the PPE and the eight threads possible on the eight SPEs. The PPE and SPE combination can also handle over 128 outstanding memory requests.
SPEs represent the first implementation of a new processor architecture designed to accelerate media and streaming workloads. Optimized for power efficiency and area, the SPEs are well suited for multicore implementations that can take advantage of parallelism. Load and store instructions for the SPEs are performed within a local address space served by a Local Store (LS) memory (256 kbytes) that's attached to each SPE. A 128-bit data bus (16 byte) connects the LS memory to each SPE.
Internally, the SPE is a single-instruction/multiple-data processor that can be programmed in high-level languages such as C or C++ with intrinsics. Most instructions process 128-bit operands (divided into four 32-bit words). Two pipelines in the SPE provide different compute functions-the "odd" pipe contains a permute unit and a channel unit, which perform bit operations (shift, rotate, gather, etc.) and channel read/write operations. The "even" pipe performs single-precision fixed- or floating-point computations on three 16-byte operands, delivering a 16-byte result. There is also a heavily pipelined double-precision floating-point unit in the SPE.
In all, there are seven execution units in the SPE. Up to two instructions can be issued by the SPE every clock cycle to all seven execution units. A direct-memory-access engine in the SPE helps the software schedule data transfers in parallel with the core execution. This overcomes memory latencies and allows the SPE to achieve a high memory bandwidth, and thus deliver a high throughput.
External DRAM connects to the chip using the Rambus XDR interface that operates at 3.2 GHz, and host data and control bus interfaces are based on the Rambus 6.4-GHz Flex I/O (formerly referred to as Redwood). There are two XDR memory interfaces, each consisting of 72 differential pairs, and the Flex I/O host bus interface containing 96 pin pairs. Between the DRAM and host data interfaces, the CELL processor has an aggregate I/O transfer capability of over 100 Gbytes/s.
To handle the high data transfer rates, the chip packs a lot of I/O and power and ground connections to ensure signal integrity-a total of 2965 C4-style "bumps" are used to connect the chip to a low-cost organic package that has about 1300 signal and power and ground contacts. The chip itself will be manufactured with a 90-nm partially-depleted silicon-on-insulator process employing about 234 million transistors interconnected with eight levels of copper metalization (Fig. 2). Total chip area is about 221 mm2. Extensive test capability is also included on the chip in the form of a block called the Pervasive unit that supports test, monitoring, and debug functions.
Each SPE consumes about 1 W when clocked at 2 GHz, 2 W at 3 GHz, and 4 W at 4 GHz. Including the eight SPEs, the PPE, and other logic, the CELL processor will dissipate close to 15 W at 2 GHz, about double that at 3 GHz, and perhaps double that again at 4 GHz.
Because of local heating caused by individual processing units, IBM designers applied local thermal sensing schemes and control mechanisms to achieve an aggressive low-cost thermal design. A linear sensor and 10 local digital thermal sensors are embedded in the CELL processor to provide warnings of any temperature increases and to trigger various thermal protection schemes.