Alibaba Cloud has revealed the design of an Ethernet-based network it created specifically to carry traffic for training large language models – and has used in production for eight months.
The Chinese Cloud also revealed that its choice of Ethernet was informed by a desire to avoid vendor lock-in and leverage "the power of the entire Ethernet Alliance for faster evolution" – a decision that backs arguments made by a collection of vendors who are trying to attack Nvidia's networking business.
Alibaba's plans were revealed on the GitHub page of Ennan Zhai – an Alibaba Cloud senior staff engineer and research scientist focused on network research. Zhai posted a paper [PDF] to be presented at August's SIGCOMM conference – the annual get-together of the Association for Computing Machinery's special interest group on data communications.
Titled "Alibaba HPN: A Data Center Network for Large Language Model Training," the paper opens with the observation that traffic cloud computing traffic "… generates millions of small flows (eg lower than 10Gbit/sec)," while LLM training "produces a small number of periodic, bursty flows (eg 400Gbit/sec) on each host."
Equal-Cost Multi-Path routing – a commonly used method of sending packets to a single destination over multiple paths – becomes predisposed to hash polarization – a phenomenon that sees load balancing struggle and can significantly reduce usable bandwidth.
Alibaba Cloud's homebrew alternative, named "High Performance Network" (HPN), "avoids hash polarization by decreasing the occurrences of ECMP, but also greatly reduces the search space for path selection, thus allowing us to precisely select network paths capable of holding elephant flows."
HPN also addresses the fact that GPUs need to work in sync while training LLMs, which makes AI infrastructure sensitive to single points of failure – especially top-of-rack switches.
Alibaba's network design therefore uses a pair of switches – but not in the stacked configuration suggested by switch vendors.
Crammed full of cards
The paper explains that each host Alibaba Cloud uses for LLM training contains eight GPUs and nine network interface cards (NICs), each with a pair of 200GB/sec ports. One of the NICs handles housekeeping traffic on a "backend network."
The frontend network lets each GPU in a host directly communicate with other GPUs over an intra-host network that runs at 400–900GB/sec (bidirectional). Each NIC serves a single GPU – which Alibaba Cloud terms "rails" – an arrangement that sees each accelerator operate on "a dedicated 400Gb/sec of RDMA network throughput, resulting in a total bandwidth of 3.2Tb/sec."
"Such a design aims to maximize the utilization of the GPU's PCIe capabilities (PCIe Gen5×16), thus pushing the network send/receive capacity to the limit," the paper states.
Each port on the NICs connects to a different top-of-rack switch, to avoid single points of failure.
The Chinese Cloud's remarks about its preference to use Ethernet will be music to the ears of AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, and Microsoft. All of those vendors recently signed up for the Ultra Accelerator Link consortium – an effort to challenge Nvidia's NVlink networking biz. Intel and AMD have said the consortium – and other advanced networking efforts like Ultra Ethernet – represent a better way to network AI workloads because open standards always win in the long run, as they enable easier innovation.
But while Alibaba Cloud's NPM design is based around Ethernet, it still uses Nvidia tech. The GPU champ's NVlink is used for the intra-host network (which has more bandwidth than the network between hosts) and its "rail-optimized" design approach that sees each network interface card connect to a different set of top-of-rack switches is also in place.
Single-chip switches rule at Alibaba
The paper also makes many mentions of a "51.2Tb/sec Ethernet single-chip switch (first released in early 2023)" in Alibaba Cloud's top-of-rack switches. Two devices meet that description: Broadcom's Tomahawk ASICs which shipped in March 2023 and Cisco's G200 which arrived in June of the same year. The reference to "early 2023" suggests Alibaba Cloud went with Broadcom.
Whatever's inside Alibaba's switches, the paper reveals that the Chinese Cloud has a preference for switches powered by a single chip.
"There have been multi-chip chassis switches supporting higher bandwidth capacity," the paper states, before noting that "Alibaba Cloud's long-term experience in operating datacenter networks reveals that multi-chip chassis switches introduce more stability risks than single-chip switches."
The company's fleet of single-chip switches, it's revealed, outnumber multi-chip models by a factor of 32.6x. And those multi-chip switches experience critical hardware failures at a rate 3.77x higher than in single-chip switches.
DIY heatsink needed
While Alibaba Cloud adores single-chip switches – and enjoys the fact that the 51.2Tbit/sec units it adopted double the throughput of its previous units while consuming only 45 percent more power – the new models don't run cooler than their predecessors.
If the chips warm beyond 105°C, the switches can shut down. Alibaba Cloud could not find a switch vendor that offered cooling capable of keeping chips below 105°C.
It therefore created its own vapor chamber heat sink.
"By optimizing the wick structure and deploying more wicked pillars at the center of the chip, heat could be carried out more efficiently," the paper explains.
Datacenter design disclosed
All of the above is built into "pods" that house 15,000 GPUs apiece, each of which resides in a single datacenter building.
"All datacenter buildings in commission in Alibaba Cloud have an overall power constraint of 18MW, and an 18MW building can accommodate approximately 15K GPUs," the paper reveals, adding "In conjunction with HPN, each single building perfectly houses an entire Pod, making predominant links inside the same building."
All optic fibers within the building are less than 100 meters, which allows "use of lower-cost multi-mode optical transceivers (cutting 70 percent cost compared with single-mode optical transceivers)."
It's not all sweetness and light: the paper admits that "HPN introduces extra designs … making wiring much more complex."
"Especially at the nascent stage of constructing HPN, on-site staff make a lot of wiring mistakes." That means extra testing is needed.
The paper also notes that forwarding capacity of a single Ethernet chip doubles every two years. Alibaba Cloud is therefore already "designing the next-generation network architecture equipping the higher capacity single-chip switch."
"In the land construction planning of our next-generation datacenters, the total power constraints for a single building have been adjusted to cover more GPUs. Thus, when the new datacenter is delivered, it can be directly equipped with 102.4Tbit/sec single-chip switches and the next-generation HPN."
The paper also notes that training LLMs with hundreds of billions of parameters "relies on a large-scale distributed training cluster, typically equipped with tens of millions of GPUs."
Alibaba Cloud's own Qwen model comes in a variant trained on 110 billion parameters – which suggests it has an awful lot of pods using NPM, and many millions of GPUs in production. And it will need many more, as its models and datacenters become larger and more numerous.
Source: https://www.theregister.com/2024/06/27/alibaba_network_datacenter_designs_revealed/