Arista Networks is working on creating a software-based agent designed to streamline the connection between network and server systems in large AI clusters.
This project involved a partnership with Nvidia to utilize their BlueField-3 SuperNIC, which is tailored for large scaled AI tasks and guarantees 400Gbps bandwidth through the use of remote direct memory access (RDMA) via Converged Ethernet (RoCE) in order to optimize GPU servers’ throughput.
The AI agent derives from the Extensible Operating System (EOS), the principal network operating system of Arista that runs all of its switches and routers. It combines network features along with connected GPUs into one manageable package.
With the operation occuring on Arista switches, the EOS AI agent can be expanded to direct-attached NICs and servers to enable one single point of control and visibility throughout an AI data center, according to Jayshree Ullal, Arista’s CEO, who wrote a blog about the new agent. “This remote AI agent, situated directly on an Nvidia BlueField-3 SuperNIC or running on the server, gathering telemetry from the SuperNIC, permits EOS, on the network switch, to configure, monitor, and debug network issues on the server, securing end-to-end network configuration and QoS consistency,” stated Ullal.
The remote agent implemented in the AI Network Interface Card (NIC) or server transforms the switch into the hub of AI network for configuring, monitoring, and troubleshooting issues with the AI Hosts and Graphics Processing Units (GPUs), as stated by Ullal. This gives a unified and singular control point and visibility. Use of the remote agent ensures uniform configuration and comprehensive traffic tuning as a sole homogenous entity.
The tracking and reporting of host and network behaviors make it possible for failures to be pinpointed for communication between Extensible Operating System (EOS) active in the network and remote agent on the host, according to Ullal. This implies that EOS can directly update the network topology, centralizing the task of topology discovery while using the familiar Arista EOS configuration and management constructs throughout all Arista Etherlink platforms and collaboration partners, Ullal further explained.
The Etherlink technology by Arista will receive support across a variety of products, such as 800G systems and line cards, and will adhere to the specifications set forth by the Ultra Ethernet Consortium.
Ullal explained that this technology suite is necessary due to the surge in AI dataset growth. It will assist customers with the coordination of a complicated mesh of components like GPUs, NICs, switches, optical cables, and other cables in extensive AI configurations.
“The demand for larger AI training models (LLMs) necessitates data parallelization. As these models grow larger, the number of necessary GPUs cannot match the immense parameter count and dataset size. Parallelization in AI, utilizing data, models, or pipelines, is dependent on the network linking these GPUs. Exchanging and calculating global gradients to adjust the model’s weights requires efficient teamwork from all components of the AI workflow. This includes GPUs, NICs, interconnecting accessories such as optics/cables, storage systems, and primarily the network tying everything together.“
According to Ullal, eliminating the isolated performance of network silos allows for the whole system to work in harmony and reach peak performance. “The AI Center excels by removing silos, allowing for coordinated performance fine-tuning, troubleshooting, and operation, where the centralized network plays an essential part in forming and powering this interconnected system.”
Arista plans to unveil its AI agent technology at the 10th anniversary of its IPO at the NYSE on June 5th, with customer trials anticipated in the second half of 2024.
In a related development, Arista announced a partnership with Vast Data to provide customers with high-performance infrastructure for AI development.
Vast, launched in 2019, offers a comprehensive package of storage, database, and computing services aimed at managing the development of large-scale AI workloads in data centers and cloud environments, according to the vendor.
As part of the agreement, Vast has certified Arista switches to operate within its environment. Together, they will collaborate to incorporate security and management technologies for AI-based infrastructures. Customers, according to Vast, will be able to track how AI data moves from edge-to-core-to-cloud.
Vast Data Platform, according to the vendor’s report, optimizes organizations’ data sets using proprietary Similarity data reduction and compression techniques. These techniques can significantly reduce power consumption and increase efficiency. You can read more about this in their press release.
In addition to Arista, Vast collaborates with a variety of partners. This includes Nvidia, which makes use of Nvidia’s Bluefield-3 SuperNIC, and Hewlett Packard Enterprise, which incorporates Vast’s technology in its HPE for GreenLake File Storage service.