Cisco shapes its strategy for Ethernet-based AI networks
Cisco is on a mission to make sure Ethernet is the chief underpinning for artificial intelligence networks now and in the future.
It has been a huge contributor to Ethernet development in the IEEE and other industry groups over the years, and now it’s one of the core vendors driving the Ultra Ethernet Consortium (UEC), a group that’s working to develop physical, link, transport and software layer advances for Ethernet to make it more capable of supporting AI infrastructures.
“Organizations are sitting on massive amounts of data that they are trying to make more accessible and gain value from faster, and they are looking at AI technology now,” said Thomas Scheibe, vice president of product management with Cisco’s cloud networking, Nexus & ACI product line.
“Customers want to know what they need to do now on the networking side to be able to run the huge clusters of GPUs they need and handle the volumes of data they create. And for most customers, it’s going to be Ethernet,” Scheibe said.
To that end, Cisco has put together a blueprint defining how organizations can use existing data center Ethernet networks to support AI workloads now.
Advancing Nexus 9000 features
A core component of Cisco’s AI blueprint is its Nexus 9000 data center switches, which support up to 25.6Tbps of bandwidth per ASIC and “have the hardware and software capabilities available today to provide the right latency, congestion management mechanisms, and telemetry to meet the requirements of AI/ML applications,” Cisco wrote in its Data Center Networking Blueprint for AI/ML Applications. “Coupled with tools such as Cisco Nexus Dashboard Insights for visibility and Nexus Dashboard Fabric Controller for automation, Cisco Nexus 9000 switches become ideal platforms to build a high-performance AI/ML network fabric.”
Two technologies that enable Nexus AI-based networking are the switch’s NX-OS operating system support for Remote Direct Memory Access Over Converged Ethernet, version 2 (ROCEv2) and Explicit Congestion Notification (ECN), Scheibe said.
ROCEv2 is a high-performance network computing technology that lets data transfer directly between the memory of two devices without having to involve a server CPU. It allows multiple packets to be transferred or routed simultaneously over a single connection, reducing latency and complexity as well as boosting throughput.
ECN essentially enables a lossless Ethernet network by monitoring for network congestion or other situations where packets could get dropped and throttling back the network to ensure that doesn’t happen. Lossless Ethernet networks are not only a key requirement for AI networking but also for today’s VOIP or video environments, Scheibe noted.
Another tool, Priority Flow Control, can help control congestion in Layer 3-based networks and plays an important role in overall congestion management.
Taken together, these technologies can give an Ethernet network the ability to prioritize certain sets of workloads – such as AI workloads that cannot tolerate any dropped packets and will always get network priority even if there’s congestion, Scheibe said.
“These technologies can be implemented in Nexus networks today, and customers can tune their environments to handle their workload mix,” Scheibe said. “There is ongoing work to handle larger and more AI workloads, and there are other techniques that can be used to make sure customers can easily distribute them across available bandwidth.”
Cisco has also published scripts so customers can automate specific settings across the network to set up this fabric and simplify configurations, Scheibe said.
In addition, Nexus 9000 switches come with built-in telemetry capabilities that can be used to correlate issues in the network and help optimize it for RoCEv2 transport, Cisco stated.
“The Cisco Nexus 9000 family of switches provides hardware flow telemetry information through flow table and flow table events. With these features, every packet traversing the switch can be accounted for, observed, and correlated with behavior such as micro-bursts or packet drops,” Cisco wrote. Customers can export this data to the Cisco Nexus Dashboard Insights management package and show the data per-device, per-interface, down to per-flow level granularity, according to Cisco.
Beyond the Nexus 9000
Another element of Cisco’s AI network infrastructure is its new high-end programmable Silicon One processors, which are aimed at large-scale AI/ML infrastructures for enterprises and hyperscalers.
Cisco added the 5nm 51.2Tbps Silicon One G200 and 25.6Tbps G202 to its now 13-member Silicon One family. The processors can be customized for routing or switching from a single chipset, eliminating the need for different silicon architectures for each network function. This is accomplished with a common operating system, P4 programmable forwarding code, and an SDK.
The new devices, positioned at the top of the Silicon One family, will bring networking enhancements that make them ideal for demanding AI/ML deployments or other highly distributed applications, Cisco said.
Core to the Silicon One system is its support for enhanced Ethernet features, such as improved flow control, congestion awareness, and avoidance.
The system also includes advanced load-balancing capabilities and “packet-spraying” that spreads traffic across multiple GPUs or switches to avoid congestion and improve latency. Hardware-based link-failure recovery also helps ensure the network operates at peak efficiency, according to Cisco.
Combining these enhanced Ethernet technologies and taking them a step further ultimately lets customers set up what Cisco calls a Scheduled Fabric. In a Scheduled Fabric, the physical components – chips, optics, switches – are tied together like one big modular chassis and communicate with each other to provide optimal scheduling behavior and much higher bandwidth throughput, especially for flows like AI/ML, Cisco said.
Data-center sustainability focus
While AI seems all-encompassing these days, there are other topics that are challenging data center network operators.
For example, customers are looking to efficiently expand existing data center networks to handle larger workloads, so they want to find the best way to integrate 400G into the network, Scheibe said.
Two other major challenges are reducing data center power consumption and increasing sustainability practices, Scheibe said.
“Organizations are looking for help on getting a baseline on how much power they are using and learning what their current carbon footprint is so they can make informed decisions on how to move forward,” Scheibe said.
Cisco Nexus Cloud offers a Network Energy Utilization service that gives customers an idea of a data center’s environmental impact.
Recently, Cisco announced that the Nexus Dashboard will provide real-time and historical insights for power consumption of all IT equipment in the data center and estimate the energy footprint of data center operations.
Nexus Dashboard will also provide AI Data Center Blueprint for Networking, which will offer enterprises looking to develop AI-based applications a way to set up their networks to handle the additional transaction load. For example, it will detail how to implement InfiniBand-to-Ethernet network migrations and large-scale machine-learning fabrics.
Next read this: