On Tuesday at the ACM SIGCOMM 2015 conference in London, Google shared details, and I do mean details, on the way their internal datacentre networks are (or at least were) constructed. In a paper titled Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network [PDF] the authors describe how Google’s approach to large-scale datacentre networking has evolved from 2004 to at least 2012.
Networking folk and architects concerned with enterprise data networking could do worse than to read this paper.
The paper talks about failure as well as success. The initial attempt at building a Clos network, Firehose 1.0, didn’t work and never made it to production. The paper talks about why, and explains what was learned in order to build the next version, Firehose 1.1. This is knowledge sharing at its best. Too many people think that successful deployments or products spring fully formed from the mind of a lone genius, when reality is vastly different.
The paper also talks of practical and operational considerations. Alternate network topologies are mentioned (HyperX, Dcell, Bcube, Jellyfish; see the paper for citations) but the authors cite cabling, management, and routing challenges and complexity as not being worth it. I also like that the Neighbor Discovery protocol includes humans mis-cabling switches (either at install or during maintenance) as a design consideration. Mistakes happen, so we need to take that into account. I still see far too many technical designs that ignore human fallibility and maintenance and focus too much on the initial build.
A Note of Caution
While there is much to learn here, don’t go blindly aping these techniques. The paper is very clear about the special case that Google has that makes many of these choices make sense where they otherwise wouldn’t. For example, they have a fairly homogeneous environment with relatively few protocols to support. That means they can make choices that you and I can’t in an enterprise datacentre.
Google is operating at a scale that 99.8765% (made up number) of organisations are not. You are not Google, and I am utterly sick of people calling a 40-ish node cluster “web-scale”. This paper is talking about a data-centre wide cluster of thousands of nodes in the network alone, and as of 2012, bisection bandwidth of 1.3 petabits per second. Your network is nowhere near that.
Instead, look at the bits about maintainability, being able to upgrade the network as technology changes (because it marches ever onwards), being able to deploy a partially populated configuration and add components as needed. Pay particular attention to the reasoning for changing the way in which the partially depopulated network was first constructed and then filled at the end of section 3.3.
There’s loads more in there. I really do urge you to read it.