[Technology Report]
Smaller Servers, Larger Performance
Thanks to the latest architectures, designers can pack more processors into less space. But parallel processing may have to wait until the software catches up with the hardware.
It all began in 1952, when the ILLIAC I (Illinois Automatic Computer) graced the stage at the University of Illinois. By 1956, this machine had more compute power than all of Bell Labs— not bad for a 4.5-ton, 10- by 2- by 8.5-ft box filled with more than 2800 vacuum tubes and 64-kword drum storage. Eventually, the infamous ILLIAC IV vector processor incorporated 256 processors in its design.
The latest single-chip, multicore processors run rings around these dinosaurs. But the search for faster, better solutions continues unabated. The industry has made great progress in larger symmetrical multiprocessing (SMP) systems, and typical high-end servers host over two dozen processors. Multiple cores per processor effectively increase the number of processors. Yet moving into the hundreds to thousands range requires a change of architecture.
NUMA, or non-uniform memory access, retains common memory. The node's local memory remains the fastest, while slower access times are incurred as memory is accessed farther from the node. NUMA's big problem, though, is programming.
The NUMA architecture has worked well in AMD's Opteron. Each chip has its own memory interface, but its HyperTransport links can be used to access memory attached to other chips. This works well if the memory being accessed is in an adjacent chip because of the speed of the link. But the approach reverts to a typical NUMA system when hundreds of nodes are used.
A mixture of different application requirements has yielded a plethora of designs. These range from massive supercomputer complexes that are tied together by high-speed fabrics to clusters of blade servers connected by Ethernet.
Compute engines these days are based on standard platforms such as Intel's EM64T and IA64 processors, AMD's Athlon 64 and Opteron, and Sun UltraSparc processors. Similarly, standards like Ethernet, Serial RapidIO (sRIO), and InfiniBand provide the interconnect fabric. And in software, standards are slowly improving the developer's ability to employ these hardware features.
Super Apps High-performance computing (HPC) tends to cover everything these days, from supercomputing applications like weather prediction and earthquake modeling to clusters of Web servers. Systems like the Cray XT3 use the hypercube architecture to take advantage of dual-core AMD Opteron processors (Fig. 1 and Fig 2). Hypercubes offer scalability, but dataflow and routing become issues that programmers must address.
Different connection architectures like the hypercube have been giving way to fabric interconnects like Ethernet, sRIO, InfiniBand, and ASI (Advanced Switching Interconnect). These standards-based solutions cost less. Also, their performance has improved steadily as products mature.
InfiniBand, one of the most mature of these products, has found a niche in HPC. Some of the largest and fastest supercomputers are based on an InfiniBand interconnect. Mellanox's 480-Gbit/s InfiniScale III switch chip can be found at the center of many of these fabrics (see "Switch-Chip Fuels Third-Generation InfiniBand" at www.electronicdesign.com, ED Online 5999). It can be configured as an eight-port, 30-Gbit/s, 12x InfiniBand switch or as numerous 4x ports. Its low 200-ns latency is critical to efficient HPC applications.
Of course, even InfiniBand can go one better with devices like the Path-Scale 10X-MR PCI Express adapter (see "InfiniBand Hits 10M Messages/s" at ED Online 12359). Its connectionless architecture avoids the queue-pairs used with the usual OpenIB stack, allowing a node to handle up to 10 million messages/s.