800G Ethernet support has recently been added to Juniper’s PTX and QFX platforms, marking a significant step towards the adoption of Ethernet as a primary technology for AI networking.
In a move designed to augment the routers at the heart of data center, enterprise, and service provider systems, Juniper Networks is expanding its AI network strategy. The company’s PTX routing and QFX switching platforms, now able to support 800G Ethernet, are poised to accommodate future Ethernet-based AI networking environments.
“Our 800GE platforms are designed to manage AI training workloads effectively,” said Julius Francis, Senior Director of Product Marketing and Strategy at Juniper. “We aim to enhance these platforms further, allowing them to fulfill a larger array of WAN use cases and advancing network capacity and density resolution.”
Francis noted that service providers, cloud providers, and large corporations often face the challenging task of achieving sustainable automation while providing sufficient capacity and scale across congestion hotspots such as metro aggregation, peering, core networks, data center interconnects (DCI), and DCI edge.
Achieving optimal GPU efficiency to reduce job completion times is essential for managing AI costs for both enterprises and cloud providers, Francis said.
“Traditionally, InfiniBand has been the go-to networking technology in the AI networking ecosystem, known for its performance yet hindered by its higher cost and limited availability compared to Ethernet – the most prevalent Layer 2 (L2) technology globally,” Francis said.
Juniper is now offering an Ethernet-based alternative with 400GE and 800GE options in its PTX and QFX platforms, which are enhanced by Apstra AIOps. Apstra is Juniper’s intent-based data center software that maintains a real-time repository of configuration, telemetry, security and validation information to ensure a network is doing what the organization wants it to do.
Juniper recently tightened ties between Apstra and its AI-Native Networking Platform, which is anchored by the vendor’s cloud-based, natural language Mist AI and Marvis virtual network assistant (VNA) technology.
Juniper’s PTX and QFX platforms, running on Junos operating system, are anticipated to lead its AI networking initiative. These platforms, with support for high-radix routing architecture, deep buffers, and a cell-based switch fabric, are perfectly suited for a spine or leaf roles in AI data center networking environments, as stated by Francis.
Other features of the PTX and QFX platforms that are customized for AI data center networking consist of efficient, deep-buffered interfaces, a scalable cell-based fabric architecture, virtual output queue (VOQ) scheduling, RDMA over converged Ethernet (RoCEv2), adaptive load balancing, and integrated IPFIX and in-band network telemetry metadata (INT-MD), according to Francis. The PTX series from Juniper also offers IP over Dense Wavelength Division Multiplexing (IPoDWDM) as an aspect of the company’s Converged Optical Routing Architecture (CORA).
In a recent blog by Amit Bhardwaj, Juniper’s Vice President of Product Management, he mentioned, “Transitioning from conventional, compartmentalized IP and optical control planes to a unified mesh architecture can significantly enhance network utilization and sustainability. CORA simplifies network layers, liberates unutilized WDM capacity, and removes the necessity for external transponders in many applications – permitting up to 54% power savings and a 55% reduction in carbon emissions.”
Juniper is predicted to be a major stakeholder in the AI networking industry, joining competitors like Cisco and Arista who are also continuously innovating technology to manage AI workloads.
A core part of Cisco’s AI blueprint is their Nexus 9000 data center switches. These switches have the capacity to allow up to 25.6Tbps of bandwidth per ASIC. The switches have both the hardware and software capabilities necessary to provide the right latency, congestion management mechanisms, and telemetry for AI/ML applications, as stated in Cisco’s Data Center Networking Blueprint for AI/ML Applications. The Nexus Dashboard Insights for visibility and Nexus Dashboard Fabric Controller for automation make the Cisco Nexus 9000 switches perfect platforms to develop a high-performance AI/ML network fabric.
Check the AI blueprint on Cisco’s Network
Cisco’s AI network infrastructure also includes their new high-end programmable Silicon One processors. These processors are targeted towards large-scale AI/ML infrastructures for enterprises and hyperscalers. Silicon One system’s support for advanced Ethernet features, such as enhanced flow control, congestion awareness, and avoidance is an essential part. The system also features advanced load-balancing capabilities, and “packet-spraying” that distributes traffic across multiple GPUs or switches to prevent congestion and enhance latency. Recovery from link failure is also hardware-based, which ensures the network operates at its peak efficiency, as per Cisco.
Get more details about Cisco’s AI Infrastructure
These improved Ethernet technologies can be further refined, allowing customers to set up what Cisco terms as a Scheduled Fabric. In a Scheduled Fabric, the physical components, switches, chips, and optics, function together like a single large modular chassis. They communicate with each other to provide optimal scheduling behavior and much superior bandwidth throughput, especially for streams like AI/ML, Cisco observes.
On the other hand, Arista is outlining its AI networking technology called AI Spine. The AI Spine, controlled by Arista EOS, utilizes data-center switches with deep packet buffers and networking software to efficiently manage AI traffic.
Arista’s introduction of AI Spine is rooted on 7800R3 Series data-center switches, boasting 460Tbps at the top level, along with support for numerous 40Gbps, 50Gbps, 100Gbps, or 400Gbps interfaces and 384GB of deep buffering. The AI Spine construct is envisaged to facilitate high-speed, reliable, latency-free, Ethernet-based networking systems capable of connecting thousands of GPUs at speeds 100Gbps to 800Gbps, as per statements from Arista.
Commenting on the competitive scene, Juniper’s Francis brought up the considerable challenge of managing a limited number of large data flows which are common in AI projects, posing a major concern for conventional network designs relying on per-flow load balancing. He stressed the urgent need for effective management and appropriate capacity allocation within network fabrics, in order to support AI training tasks. Failing to find and rectify network bottle-necks and inefficiencies can result in significant expenditure on AI infrastructure, he warned.
Francis elaborated on the existing custom, scheduled Ethernet fabric solutions that can improve resource distribution, though they come with their own range of operational problems and visibility issues, along with a strong vendor dependency similar to InfiniBand fabrics.
In tackling AI network challenges, Francis advocates leveraging open standard, interoperable Ethernet fabrics. This approach prioritizes network operational enhancements to specifically cater to the differing needs presented by various types of AI workload, he explained.
“Whether implemented in fixed form factors or large chassis switches suitable for multiplanar, multistage Clos architectures, or high-radix spine topologies, Ethernet offers the most cost-effective and flexible solution for data center technology,” Francis said. Clos is Juniper’s architecture for setting up large data center and fabrics. It utilizes Juniper’s EVPN-VXLAN fabric to offer increased network scalability and segmentation.
“As a converged technology, Ethernet fabrics support multivendor integration and operations, providing a range of design options to achieve the desired balance of performance, resiliency, and cost efficiency for the back-end networking of AI data centers and their broader AI infrastructures.”
Juniper’s AI technology was one of the core reasons Hewlett Packard Enterprise recently said it would acquire Juniper Networks for $14 billion. Networking will become the new core business and architecture foundation for HPE’s hybrid cloud and AI solutions delivered through the company’s GreenLake hybrid cloud platform, the companies stated.
The combined company will offer secure, end-to-end AI-native solutions that are built on the foundation of cloud, high performance, and experience-first, and will also have the ability to collect, analyze, and act on aggregated telemetry across a broader installed base, according to the companies. The deal is expected to close by early 2025 at the latest.