Astera Labs
65ccfb8a2160ae001e60168e Aries Scm Promo Web

Smart Connectivity Module Pushes the Limits of PCIe

Feb. 14, 2024
We look inside the Aries smart cable module from Astera Labs, and at why long-range PCIe connectivity has become such a critical factor in the data center.

Check out our coverage of DesignCon 2024.

High-bandwidth, low-latency connectivity is a must-have in modern data centers. The only challenge? As more CPUs, GPUs, AI accelerators, and memory chips are piled into the data center, they’re being driven further away from each other even as they need to work more closely together to run AI and other heavy-duty workloads. Given the situation, robust connectivity with extra-long reach is becoming key.

Astera Labs, a semiconductor connectivity startup backed by Intel and other technology giants, is trying to solve the bottleneck with its new Aries smart cable modules (SCMs) for PCI Express (PCIe) and Compute Express Link (CXL). The company put the smart connectivity modules on display for the first time at DesignCon 2024.

The modules are designed to add Astera’s Aries DSP retimers to flexible copper cables, paving the way for long-range connectivity of processors and accelerators over PCIe Gen 5 or memory over the CXL 2.0 standard.

Typically, the server’s motherboard and/or the accelerator and riser cards that plug into it use PCIe and CXL retimers to ferry signals over further distances than the copper traces on a printed circuit board (PCB) or passive direct-attach copper (DAC) cables can by themselves. But the retimers can also be placed directly inside the cables, specifically inside the connector at one end, turning them into so-called active electrical cables (AEC).

Astera Labs said the Aries SCMs, purpose-built for AI and cloud data centers, make it possible to connect GPUs over far longer distances and even between racks to create larger AI clusters.

Retimers: A Critical Piece of the AI Puzzle

Today, GPUs are the gold standard for AI workloads in the data center, and they’re typically packaged with CPUs in a 4:1 ratio. These accelerators sit on separate PCBs and communicate between servers in a single rack over short distances—and increasingly with other racks over long connections—to form larger clusters for AI acceleration.

The connective tissue in these clusters is PCIe, the most widely used high-speed serial interconnect bus in servers due to its high bandwidth and low latency. PCIe is used within the server itself to connect CPUs to GPUs, CPUs to network interface cards (NICs), CPUs to accelerators, and CPUs to memory. It’s also frequently used to attach the CPU to GPUs, memory, and storage located within the same rack by using board-to-board connectors and/or cables.

As AI workloads take over more of the data center, the demand for high-speed connectivity is climbing. Thus, more companies are making the move to PCIe Gen 5, which doubles the data rate of PCIe Gen 4 from 16 GT/s to 32 GT/s. The performance comes at a cost, as it impacts signal integrity—specifically due to insertion loss (IL) in the channel—of the signal. That limits the distance a signal can run through the CPU’s package, the PCB, and the connectors and cables, which all weaken the signal as it travels through the channel.

Astera rolled out its Aries family of retimers to resolve these signal-integrity issues and deliver the full 32 Gb/s and sub-10-ns latency of the PCIe Gen 5 standard in a host of CPU, GPU, networking, storage, and switch SoCs. These retimers give customers more flexibility to arrange the various systems inside data centers. The company said the Aries DSPs are already used on NVIDIA’s H100 and AMD’s M3I00X GPUs.

As pointed out by Astera, PCIe retimers are becoming a small but important part of AI system infrastructure. Since it costs a not-insignificant amount of power and latency to distribute signals and data around the data center, companies on the front lines of the AI boom are cramming more GPU and AI accelerators cards per rack at a time. They use retimer ICs to stitch all of these systems together via PCIe Gen 5.

But they are hitting roadblocks, Thad Omura, chief business officer at Astera Labs, told Electronic Design. This is primarily due to the limits of existing connectivity technologies and the other unfortunate realities of modern data centers.

New power-hungry AI accelerators such as the H100 GPU consume up to 700 W of power, while the latest server CPUs and other AI silicon are close behind. This drives power-per-rack specifications up to 90 kW, which conflicts with what most data centers can manage at 15 to 30 kW a rack. As power demands continue to climb, he said companies are spreading out servers over more racks to funnel power more effectively to them.

The other challenge lies in keeping them cool, said Omura. Dissipating heat from these power-hungry, high-performance AI accelerators is tricky at best. Thus, it’s important to install them further away from each other to prevent excess heat from sapping their performance.

AEC: A New Alternative to Passive Copper Cables

As the core building blocks inside data centers are driven further apart, wired connectivity is becoming a bigger piece of the puzzle.

The problem, as Astera Labs points out, is that the passive copper cables employed in data centers can only transfer data up to three meters over PCIe Gen 5, which is sufficient to send data around a single column of servers.

But using the Aries SCM to relocate the PCIe retimer from the circuit board or accelerator card inside the server and into the AEC cable itself increases the range up to seven meters. As a result, the Aries-based AEC gives you the ability to connect the CPUs, GPUs, and other processor resources in one rack to the processors or memory in another without requiring the use of optical connectivity, said Astera.

One of the tradeoffs with using PCIe and CXL to span longer distances in the data center is that it adds a degree of latency for each meter traveled by the signal. Despite that, Astera said the Aries SCM opens the door for you to lash together larger clusters of AI accelerators while still spreading out the power and cooling over several racks at a time. Upgrading from passive to active cables also raises costs. But, as Omura noted, the longer range is worth it for many of its customers. “Active PCIe cables have emerged as a critical connectivity solution for GPU clusters and low-latency memory fabrics for AI infrastructure that has physically outgrown a single rack enclosure.”

In a bid to wring more performance out of AI infrastructure with less total cost of ownership (TCO), many hyperscalers are also exploring the potential of memory expansion and pooling over the CXL protocol. According to Astera, which also sells CXL smart memory controllers to many of these companies, the Aries SCM can be used to bridge longer distances between the CPUs, GPUs, and shared DRAM over CXL.

Nathan Brookwood, of market research firm Insight 64, said in a statement: “Just when hyperscalers were struggling with the problem of attaching more GPUs than could fit in a single data-center rack, Astera Labs comes along with a cabling solution that enables users to build multi-rack systems with hundreds of GPUs in new AI infrastructure.”

On top of that, Astera’s Aries modules can work with copper cables for short-length distances or even optical cables to span the largest data centers, allowing hyperscalers to deploy AI at cloud-scale.

Interconnects: The Next Big Battle in Silicon

As high-performance connectivity becomes one of the leading battlegrounds in silicon, Astera Labs is trying to come out on top. It’s also already on the bill of materials (BOM) of all leading hyperscalers.

 The company has several core chip families focused on solving different data bottlenecks within the data center. Its offerings include the “Leo” family of smart memory controllers that uses the CXL protocol to create a shared coherent memory space between the CPUs, GPUs, or other accelerators, and the “Taurus” series of SCM modules for server-to-switch and switch-to-switch Ethernet-based interconnects that run from 200 to 800 Gb/s over long distances.

Unlike general-purpose smart cables, the Aries SCM is supported by COSMOS, the company’s software platform for monitoring and managing the connectivity of the data center from the device level up to the full system. The Aries SCM also has backwards compatibility with previous generations of the PCIe bus.

Instead of supplying the copper cables itself, Astera Labs plans to sell the Aries SCM modules to technology firms and other OEMs. They can then work with cable manufacturers to put it into products.

Brian Kirk, CTO of Amphenol, said that it’s collaborating with Astera Labs. Plans are to put the Aries SCM in its PCIe and CXL cable assemblies to help handle AI or other data-hungry workloads in data centers.

Check out more of our coverage of DesignCon 2024.

About the Author

James Morra | Senior Editor

James Morra is a senior editor for Electronic Design, covering the semiconductor industry and new technology trends, with a focus on power management. He also reports on the business behind electrical engineering, including the electronics supply chain. He joined Electronic Design in 2015 and is based in Chicago, Illinois.

Sponsored Recommendations

Comments

To join the conversation, and become an exclusive member of Electronic Design, create an account today!