AMD's New 128-Core CPU Elevates Efficiency of Cloud Data Centers
AMD is raising the stakes in the cloud market with a new family of server processors, code-named Bergamo. Each member contains up to 128 x86 CPU cores that excel when it comes to power efficiency.
The Santa Clara, Calif.-based company said Bergamo is unique in that it uses a cloud-focused variant of its Zen 4 core, which debuted for the data-center market last year in its more general-purpose Genoa CPUs. AMD rolled out the EPYC 97X4 family at its data center and AI technology event in San Francisco, where Electronic Design was in attendance.
CEO Lisa Su said the Zen 4c core is based on the same microarchitecture as its Zen 4. But it’s specifically designed to be more compact and power-efficient than the base Zen 4 core, where the company’s focus was instead on increasing performance per core. Per AMD, it brings to the table up to 2.7X better energy efficiency and enables up to 3X more containers per server than rival silicon.
By pursuing a new path in the design of the CPU core, the company said that it can integrate up to 128 CPU cores per Bergamo chip, while its stock Genoa server CPU comes with a 96-core limit.
"The optimum design point for these processors is different than general-purpose computing," said Su, adding that Bergamo is currently shipping to hyperscale customers, including Meta. “They are very throughput-oriented, and they benefit from the highest density and the best energy efficiency.”
Head in the Clouds
With the Bergamo EPYCs, AMD is looking to build on its rising prospects in cloud data centers.
“Every major cloud provider has deployed EPYC for their internal and customer-facing instances,” said Su, including the most prominent U.S. public cloud providers Amazon, Google, and Microsoft.
AMD’s wins are denting Intel’s dominance of the data-center market, where a single CPU can cost more than $10,000. The company’s revival has been powered to a large degree by its Zen CPU architecture, introduced in 2017.
But given that cloud-computing giants are among the most prolific buyers of server silicon, it makes sense for the company to tailor its chips—at a more fundamental level—to its unique needs.
AMD said Bergamo is specifically made for “born in the cloud” workloads based on microservices—split into smaller chunks of code running in separate “virtual” CPUs that occupy a single, physical CPU (Fig. 1). Hefty CPU cores bundled with large amounts of cache tend to not be the best fit for these types of workloads.
Cloud service vendors rent out the computing power in their data centers core by core. As a result, more CPU cores means that they can run more virtual machines and containers in a single box of hardware in data centers. Therefore, they can cycle through many more workloads at once.
Power efficiency is the other name of the game with leaders in the cloud market. Electricity is one of their largest ongoing costs, affecting the total cost of ownership (TCO) of their data center and their sustainability goals.
By leveraging the Zen 4c core, AMD stated Bergamo can integrate up to 128 cores—with up to 256 threads when it uses simultaneous multithreading (SMT)—clocking in at base speeds of 2.25 GHz.
Looking ahead, Bergamo will have to contend with other cloud-focused chip designs, ranging from Ampere’s Arm-compatible CPUs that will eventually feature up to 192 cores to Intel’s 144-core Sierra Forest CPU due out in 2024.
The Core of the Matter
AMD said the new cloud-focused core at the heart of Bergamo was shaped by all of those factors.
The underlying microarchitecture of the Zen 4c core is identical to the Zen 4, giving it all of the same advanced features and even about the same performance in terms of instructions per clock (IPC).
The new CPU core is the result of a different physical implementation of Zen 4, along with a different performance-per-watt profile. AMD said it started from the same register transfer level (RTL) as the standard Zen 4. However, it re-implemented the physical design to save precious power and space inside the processor, enabling the cores to be bundled together more tightly on the die.
Mike Clark, head architect of AMD’s Zen core, said the cloud-native core is “logically” identical to Zen 4. But the company fine-tuned the floorplan of the CPU to favor denser logic over clock speeds.
High clock frequencies are a tradeoff with area. More of a general-purpose server CPU, Genoa clocks in at a maximum frequency of 4.1 GHz to handle the wide range of workloads that take place in the average data center. AMD said that it dialed down the maximum clock speed of Bergamo, as it features a single-core boost frequency of 3.1 GHz.
While the Zen 4 core and the L2 cache inside it fit into a 3.84-mm rectangle of silicon, the area of Bergamo’s cloud-focused Zen 4c measures a mere 2.48 mm2, translating to a space savings of 35% (Fig. 2).
With everything closer together inside the CPU core, it takes less power to transfer signals around it. This opened the door for the company to strip out many of the power-hungry transistors in the Zen 4 core to ferry signals further distances, said Clark.
Outside the core itself, the other major difference is a reduction in cache memory. While the L2 cache in the cloud-focused Zen core remains the same, the L3 cache shared by all of the CPUs in a chiplet decreases by 50% from 4 MB to 2 MB per core to save additional die area.
The abundance of cores and reduced speeds give Bergamo better energy efficiency than Genoa, which is a huge consideration in cloud data centers and other large installations of servers.
Fewer Chiplets, More Cores
Bergamo is based on the same chiplet architecture as Genoa (Fig. 3). But the improvements to the core and cache made it possible to double the number of cores per core complex die (CCD)—to 16.
AMD said it fuses eight of these chiplets together for a total of 128 Zen 4c cores in a single package that fits up to 82 billion transistors. The denser core arrangement contrasts with Genoa, which co-packages 12 CPU chiplets with up to eight cores each, for a total of up to 90 billion transistors. In Bergamo, the 16 CPU cores are subdivided into a pair of eight-core clusters that each have access to 16 MB of L3 cache. So, while every chiplet contains the same amount of L3 cache as the base Genoa, twice as many CPU cores have to share it.
Despite the differences with Genoa, the same 6-nm I/O die with AMD’s Infinity Fabric sits at the center of Bergamo, giving it the same 12 channels of DDR5-4800, 128 lanes of PCIe Gen 5, and Genoa’s other connectivity properties (Fig. 4). Compute Express Link (CXL) 1.1, the new cache-coherent interconnect for accelerators and memory expansion that runs on PCIe, is also supported.
Importantly, the Bergamo CPU fits into the same thermal and power envelope as the more general-purpose Genoa—with a TDP of 360 W—said Kevin Lepak, head of server SoC and system architecture for AMD’s EPYCs.
Because it shares the same architecture, AMD said Bergamo is software-compatible with Genoa. The CPU is socket-compatible, too, which makes it relatively easy to replace the current EPYCs.
Robert Hormuth, corporate vice president leading architecture and strategy for AMD’s data center unit, said the improvements under the hood gives Bergamo a leg up on Intel. He claims Bergamo can run cloud-native workloads up to 2.6X faster than Intel’s latest Xeon Scalable chips, previously called Sapphire Rapids
While high-end server chips tend to be costly, AMD said that Bergamo’s high performance-per-watt can help reduce the operating costs of data centers in the long run. The flagship SKU in the family, the EPYC 9754, with 128 cores and 256 threads, costs $11,900.
The company is also supplying a 128-core single-threaded Bergamo CPU, the EPYC 9754S, priced at $10,200 (Fig. 5).
The Fall of Milan
While its latest silicon is targeted specifically at the cloud market, Meta intends to deploy Bergamo in data centers that run its social media and messaging apps WhatsApp, Instagram, and Facebook.
Meta also apparently played a part in designing the Bergamo CPU. Alexis Black Bjorlin, VP of infrastructure at Meta, said it worked closely with AMD to fine-tune the server chip for its specific workloads, ranging from dense compute chiplets, the core-to-cache ratios, power management, and manufacturing “optimizations” that help it load a larger number of these servers in a rack.
“We are seeing significant performance improvements with Bergamo over Milan on the order of two and a half times,” said Bjorlin, citing the company’s pre-Genoa generation of EPYCs.