Multicore central processing unit (CPU) and graphics processing unit (GPU) processors are used in an astounding variety of computational platforms, including supercomputers, desktop and laptop PCs, and smart phones. All of these platforms are taking advantage of inexpensive compute fabrics built with thousands of multicore sockets.
However, per-socket input-output (I/O) becomes increasingly limited as tens of CPU cores and hundreds of GPU cores compete for finite per-socket memory and bus bandwidth. If I/O problems could be solved by adding gates, rather than pins, multicore I/O could then also be accelerated by Moore’s Law, a doubling of transistor density roughly every 18 months, rather than being constrained by Amdahl’s Law, which states that the degree of parallelization of a computational task is limited to the part of the task that cannot be parallelized.
Figure 1 illustrates the interface gap between computational rates and I/O rates, using five generations of Nvidia GPUs as relevant examples. The Nvidia GeForce 8800GT has 112 floating-point cores and supports I/O rates of up to 20 Gbytes/s. The Nvidia Quadro Fx580 contains 256 improved floating-point cores and supports I/O rates of almost 36 Gbytes/s.
In this specific example, a GPU core’s computational performance increased by three times, while its I/O rate only increased by 1.8 times. Computational elements scale with Moore’s Law, while I/O scales with package size. As more cores are being added to CPUs and GPUs, package pin counts are falling ever farther behind.
Figure 2 illustrates both the multicore benefit and the multicore I/O challenge. The “one-core profile” shows the duration of I/O versus the duration of computation using a single core.
N cores can solve this hypothetical application’s computations up to N times faster, assuming the computation does not run up against Amdahl’s Law—if 10% of a task must be performed sequentially, the task cannot take advantage of more than 10 cores. Finally, Figure 2’s lower bars illustrate the improvement from I/O compression and decompression, which speeds up the I/O-related part of the task.
Specific I/O Bottlenecks
CPU and GPU vendors regularly try to overcome I/O restrictions by increasing interface bandwidths and signaling rates. As of 2011, CPU DDR3 memories can sustain 18 Gbytes/s across 64 pins, and GPUs boast GDDR5 memories with more than 120 Gbytes/s across 384 pins. Similarly, optical networks are now deploying InfiniBand links at 40 Gbits/s, while 10-Gbit/s Ethernet, already widely deployed in the Internet infrastructure, is now moving toward 100 Gbits/s. With this much available bandwidth, how can there possibly be a multicore I/O problem?
The problem lies in the “pins per core” limitation. When single-core Intel CPUs first reached 3 GHz in 2004, an Intel socket had about 800 balls, with 500 for power and ground and 300 for I/O. In 2004, the DDR2 memory interface delivered about 5 Gbytes/s, with PCI Express and Ethernet providing an additional 4 Gbytes/s.
In contrast, today’s Intel Sandy Bridge processor has a 1155-pin package, where roughly 500 pins are dedicated for I/O. However, all four Sandy Bridge cores share dual DDR3 controllers that operate at 1.33 Gtransfers/s across 204 pins per dual-inline memory module (DIMM), providing a maximum transfer rate of 21 Gbytes/s.
A 16-lane general-purpose PCI Express interface provides 8-Gbyte/s interface bandwidth. Per core, the aggregate Sandy Bridge DDR3 data rate is just above 5 Gbytes/s, which is no faster than the per-core memory rate in 2004. And, Sandy Bridge’s PCI Express bandwidth has only doubled since 2004.
The I/O problem is similarly challenging for the latest generation of Fermi GPUs, whose 448 cores share a 16-lane PCI Express Gen2 interface of 8 Gbytes/s and a 384-pin GDDR5 memory interface of 120 Gbytes/s. On average, each Fermi core can exchange just 256 Mbytes/s with GDDR5 memory and less than 18 Mbytes/s per core across PCI Express.
New memory and interface standards won’t make much difference. DDR4 memory with its 45 Gbytes/s won’t become mainstream until 2015, while PCI Express Gen3 will only double PCI Express Gen2 bandwidth starting in 2012. For both CPUs and GPUs, I/O rates continually fall behind the geometrical rise in the number of cores per socket.
A New Idea: Numerical Compression
Now that we’ve described why improving multicore I/O is important, let’s consider how compression and decompression might be integrated into existing multicore designs to help reduce this bottleneck. Figure 3 illustrates several locations where compression (“C” blocks) and decompression (“D” blocks) could be added to a multicore CPU to reduce various I/O bottlenecks.
Also, Figure 3 illustrates a generic six-core CPU whose cores are connected to each other, and to a memory and peripheral interface subsystem, via a high-speed, on-chip ring. The front-side bus block includes two DDRx memory controllers (up to 18 Gbytes/s per DDRx DIMM) and at least 16 lanes of PCI Express Gen2 (up to 8 Gbytes/s). CPU compression and decompression could be added in at least three on-chip locations:
- at each core-to-ring interface
- to each DDRx memory interface controller
- to the PCI Express interface controller
Conceptually, every transaction that could be compressed should be invoked as each core writes data to any other CPU, to off-chip memory, or to off-chip peripherals. Compressed data would be decompressed just before the data is delivered to each CPU from other CPUs, from off-chip memory, or from off-chip peripherals.
Let’s get specific about what kind of data could be compressed by the generic compress and decompress blocks. CPU and GPU vendors have already added dedicated compress and decompress accelerator intellectual property (IP) blocks to their existing chips. What’s unique about the “C” and “D” blocks in Figure 3?
First, existing compression accelerators are limited to compressing or decompressing consumer media. These existing blocks accelerate well-known compression algorithms such as MP3 for music, JPEG2000 for photos, and H.264 for video. However, these algorithms can only compress or decompress audio, image, and video files.
In contrast, the “C” and “D” blocks are designed to process any numerical data—integers and floating-point values. Since all CPUs and GPUs include dedicated hardware for numerical processing, it seems reasonable to conclude that numerical data makes up a significant part of CPU and GPU workloads.
In fact, given the increasing interest in cloud computing, the computational loads of modern CPUs and GPUs are tilting towards numerical processing applications. Hybrid CPU + GPU chips, such as Intel’s Knights Ferry and AMD’s Fusion, are aimed at high-end servers and high-performance computing (HPC), both of which perform lots of number-crunching.
Given Nvidia’s and AMD/ATI’s marketing efforts to grow the use of GPUs for low-cost, high-speed computation, the GPU market is increasingly focused on numerical computing for non-graphics applications. With GPU computing, Nvidia and AMD offer a “supercomputer on a desk” with Cray-like compute capabilities at the cost and power consumption of a desktop PC.
In summary, multicore numerical compression and decompression reduces I/O bottlenecks for a large and growing class of applications that include not only media, graphics, and video, but also HPC and cloud computing.