Google this week unveiled technical details of its method for efficiently transferring upwards of 1.2 exabytes of data — more than 100 billion gigabytes — every day.
Details of a data copy service known as Effingo were contained in a technical paper that was presented during a delegate session at Sigcomm 2024, the annual conference of the ACM Special Interest Group on Data Communication, which wrapped up today in Sydney, Australia.
Its authors note in the paper that “WAN bandwidth is never too broad — and the speed of light stubbornly constant. These two fundamental constraints force globally distributed systems to carefully replicate data close to where they are processed or served. A large organization owning such systems adds dimensions of complexity with ever-changing network topologies, strict requirements on failure domains, multiple competing transfers, and layers of software and hardware with multiple kinds of quotas.”
According to the seven Google scientists who authored the report, on a typical day, Effingo transfers over an exabyte of data across dozens of clusters spread across continents, and serves more than 10,000 users.
They called managed data transfer “an enabler, an unsung hero of large-scale, globally distributed systems” because it “reduces the network latency from across-globe hundreds to in-continent dozens of milliseconds.” This enables, it said, “the illusion of interactive work” for users.
However, they wrote, the goal of most data transfer management systems “is to transfer when it is optimal to do so — in contrast to a last-minute transfer at the moment data needs to be consumed. Such systems provide a standard interface to the resources, an interface that mediates between users’ needs, budgets and system goals.”
Effingo is different in that it “has requirements and features uncommon in reported large-scale data transfer systems.” Rather than optimizing for transfer time, it optimizes for smooth bandwidth usage while controlling network costs by, for example, optimizing the copy tree to minimize the use of expensive links such as subsea cables.
Its other design requirements included client isolation, which prevents transfers by one client affecting those of other clients; isolated failure domains restricting copies between two clusters from depending on a third cluster; data residency constraints that prohibit copies being made to any location not explicitly specified by the client; and data integrity checks to prevent data loss or corruption. And, the system must continue to operate even when dependencies are slow or temporarily unavailable.
The paper provides details of how Google achieved each of these goals, with a section on lessons learned chronicling Effingo’s evolution. It emphasizes, however, that Effingo is still a work in progress and is continuously evolving. The authors said that Google plans to improve CPU usage during cross-data center transfers, improve integration with resource management systems, and enhance the control loop to let it scale out transfers faster.
Nabeel Sherif, principal advisory director at Info-Tech Research Group, sees great value in the service. He said today, “while there might be considerations around cost and sustainability for such a resource- and network-intensive use case, the ability for organizations to greatly increase the scale and distance of their georedundancy means being able to achieve better user experiences as well as removing some of the limitations of making data accessible to applications that don’t sit very close by.”
This, he said, “can be a game changer in both the areas of business continuity, global reach for web applications, and many other types of collaborations.”