Networking startup Enfabrica is making the rounds at trade shows to demonstrate its new networking products, which are specifically targeted to handle the heavy data throughput required for AI.
Enfabrica’s Accelerated Compute Fabric SuperNIC (ACF-S) silicon is designed to deliver higher bandwidth, greater resiliency, lower latency, and greater programmatic control to data center operators running data-intensive AI and HPC.
The company came out of stealth mode last year, announcing a $125 million funding round led by Atreides Management with support from Nvidia – which is also in the smartNIC business with its BlueField line – as well as several venture firms.
Shrijeet Mukherjee, who previously headed up networking platforms and architecture at Google, started the company in 2020 with CEO Rochan Sankar, previously a director of engineering at Broadcom. The two zeroed in on what they say is a problem with networking hardware: that it is built on 20-year-old designs that are just fine for CPUs but not adequate for GPU networking.
“Data center networking evolved to handle incoming traffic by distributing it across numerous nodes. However, the advent of AI and ML has introduced new challenges to this framework,” stated Mukherjee, the chief development officer.
According to Enfabrica, traditional data center setups face issues with sprawling server networking components and rigid connections that restrict bandwidth and fault tolerance. AI environments exacerbate these issues, with data transfers across GPUs necessitating multiple relay points, which are susceptible to congestion and cause uneven load distributions. Moreover, when a GPU link fails, it halts the entire process.
“Current supercomputers are not designed to be highly fault tolerant, requiring significant efforts to manage failures effectively,” explained Mukherjee.
Enfabrica aims to enhance fault tolerance in network designs by implementing systems that are not confined to point-to-point connections. This allows for multiple routing paths, enabling balanced load distribution and maintaining function even when some connections fail.
“Data centers currently operate on a model where a dual-socket system is ideal for processing tasks. This system works well until the requirements exceed the capabilities of those two sockets, leading to inefficiencies,” explained Mukherjee.
“After much consideration, we realized that a fundamental architectural change was necessary. The solution had to originate from a silicon-based company, one that could innovate rapidly and comprehensively around the needs of modern systems,” added Mukherjee.
ACF-S introduces robust multi-terabit switching and bridging capabilities among diverse compute and memory resources on a singular silicon die, streamlined without modifying the existing physical interfaces, protocols, or upper-layer software beyond the device drivers. This innovation not only lessens the number of required devices but also decreases I/O latency and the power consumption of devices prevalent in current AI clusters.
Moreover, the technology facilitates direct and unimpeded access to local CXL.mem DDR5 DRAM for any accelerator through CXL memory bridging, thus offering significant memory capacity enhancement—more than 50 times that of the typical GPU-native High-Bandwidth Memory (HBM) found in GPUs—within a single GPU rack.
Enfabrica showcased its innovative solutions at several prominent technology conferences including Hot Chips, AI Summit, AI Hardware & Edge AI Summit, and Gestalt IT AI Tech Field Day. They are preparing for their next presentation at SuperComputing 2024, which is scheduled for November 17-22 in Atlanta.
The company has yet to announce the shipping dates for its products.