NVIDIA’s 800G Ethernet switch powers the AI-based Colossal supercomputer

Nov. 5, 2024
The cluster has maintained 95% data throughput and zero application latency degradation or packet loss due to flow collisions—performance NVIDIA says previously was available only via InfiniBand.

NVIDIA recently achieved a major networking achievement. xAI’s Colossus supercomputer cluster, comprising 100,000 NVIDIA Hopper GPUs in Memphis, Tennessee, achieved this massive scale by using the NVIDIA Spectrum-X™ Ethernet networking platform.

The AI-centric company said the platform could “deliver superior performance to multi-tenant, hyperscale AI factories using standards-based Ethernet, for its Remote Direct Memory Access (RDMA) network.”

Colossus, the world’s largest AI supercomputer, is being used to train xAI’s Grok family of large language models. Chatbots are offered as a feature for X Premium subscribers. xAI is doubling the size of Colossus to a combined total of 200,000 NVIDIA Hopper GPUs.

What’s even more compelling about this is the timeline.

Instead of the typical timeframe for systems of this size that can take many months to years, the supporting facility and supercomputer was built by xAI and NVIDIA in just 122 days. It took 19 days from the time the first rack rolled onto the floor until training began.

“Colossus is the most powerful training system in the world,” said Elon Musk on X. “Nice work by xAI team, NVIDIA and our many partners/suppliers.”

Maintaining low latency was also a factor.

NVIDIA said across all three tiers of the network fabric, the system has experience zero application latency degradation or packet loss due to flow collisions. Across all three tiers of the network fabric, the system has experienced zero application latency degradation or packet loss due to flow collisions. It has maintained 95% data throughput enabled by Spectrum-X congestion control.

The Spectrum SN5600 supports speeds of up to 800 Gbits/sec and is based on the Spectrum-4 switch ASIC. xAI is pairing the Spectrum-X SN5600 switch with NVIDIA BlueField-3 SuperNICs.

Spectrum-X Ethernet networking for AI brings advanced features that deliver highly effective and scalable bandwidth with low latency and short tail latency, previously exclusive to InfiniBand. These features include adaptive routing with NVIDIA Direct Data Placement technology, congestion control, as well as enhanced AI fabric visibility and performance isolation — all key requirements for multi-tenant generative AI clouds and large enterprise environments.

“AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency,” said Gilad Shainer, senior vice president of networking at NVIDIA. “The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions.”

For related articles, visit the Business Topic Center.
For more information on high-speed transmission systems and suppliers, visit the Lightwave Buyer’s Guide.
To stay abreast of fiber network deployments, subscribe to Lightwave’s Service Providers and Datacom/Data Center newsletters.

Sponsored Recommendations

Understanding BABA and the BEAD waiver

Oct. 29, 2024
Unlock the essentials of the Broadband Equity, Access and Deployment (BEAD) program and discover how to navigate the Build America, Buy America (BABA) requirements for network...

Meeting AI and Hyperscale Bandwidth Demands: The Role of 800G Coherent Transceivers

Nov. 25, 2024
Join us as we explore the technological advancements, features, and applications of 800G coherent modules, which will enable network growth and deployment in the future. During...

On Topic: Fiber - The Rural Equation

Oct. 29, 2024
RURAL BROADBAND:AN OPPORTUNITY AND A CHALLENGE The rural broadband market has always been a challenge for service providers. However, the recent COVID-19 pandemic highlighted ...

Next-Gen DSP advancements

Nov. 13, 2024
Join our webinar to explore how next-gen Digital Signal Processors (DSPs) are revolutionizing connectivity, from 400G/800G networks to the future of 1.6 Tbps, with insights on...