Understanding lifecycle management complexity of datacenter topologies

Understanding lifecycle management complexity of datacenter topologies Zhang et al., NSDI’19

There has been plenty of interesting research on network topologies for datacenters, with Clos-like tree topologies and Expander based graph topologies both shown to scale using widely deployed hardware. This research tends to focus on performance properties such as throughput and latency, together with resilience to failures. Important as these are, note that they’re also what’s right in front of you as a designer, and relatively easy to measure. The great thing about today’s paper is that the authors look beneath the surface to consider the less visible but still very important “lifecycle management” implications of topology design. In networking, this translates into how easy it is to physically deploy the network, and how easy it to subsequently expand. They find a way to quantify the associated lifecycle management costs, and then use this to help drive the design of a new class of topologies, called FatClique.

… we show that existing topology classes have low lifecycle management complexity by some measures, but not by others. Motivated by this, we design a new class of topologies, FatClique, that, while being performance-equivalent to existing topologies, is comparable to, or better than them by all our lifecycle management complexity metrics.

Now, there’s probably only a relatively small subset of The Morning Paper readers involved in designing and deploying datacenter network topologies. So my challenge to you as you read through this paper, is to think about where the hidden complexity and costs are in your own systems. Would you do things differently if these were made more visible? It would be great to see more emphasis for example on things like developer experience (DX) and operational simplicity – in my experience these kinds of attributes can have an outsize impact on the long-term success of a system.

Anyway, let’s get back to cables and switches…

Physically deploying network topologies

When it comes to laying out a network topology for real in a datacenter, you need to think about packaging, placement, and bundling. Packaging is how you group things together, e.g. the arrangement of switches in racks, and placement concerns how these racks are physically placed on the datacenter floor. Placement in turn determines the kinds of cabling you need, and for optical cables the power of the transceivers. Within a rack we might package several connected switches into a single chassis using a backplane. At the other end of the scale, blocks are larger units of co-placement and packaging that combine several racks. With all those connections, it makes things a lot easier to group together multiple fibres all connecting the same two endpoints (racks) into bundles, which contain a fixed number of identical length fibres.

Manufacturing bundles is simpler than manufacturing individual fibres, and handling such bundles significantly simplifies operational complexity.

Patch panels make bundling easier by providing a convenient aggregation point to create and route bundles. Bundles and fibres are physically routed through the datacenter on cable trays. The trays themselves have capacity constraints of course.

Here’s an example of a logical Clos topology and its physical instantiation:

The authors identify three key metrics that together capture much of the deployment complexity in a topology:

  1. The number of switches. More switches equals more packaging complexity.
  2. The number of patch panels, which is a function of topological structure and a good proxy for wiring complexity.
  3. The number of bundle types. This metric captures the other important part of wiring complexity – how many distinct bundle types are needed. A bundle type is represented by its capacity (how how many fibres) and its length.

These complexity measures are complete. The number of cable trays, the design of the chassis, and the number of racks can be derived from the number of switches (and the number of servers and the datacenter floor dimensions, which are inputs to the topology design). The number of cables and transceivers can be derived from the number of patch panels.

Here’s how Clos and Expander (Jellyfish) representative topologies for the same number of servers stack up against these metrics:

The expander graph topology shows much higher deployment complexity in terms of the number of bundle types. Clos also exposes far fewer ports outside of a rack (it has better port hiding).

Expanding existing networks

When you want to expand an existing network first you need to buy all the new gear and lay it out on the datacenter floor, and then you can begin a re-wiring process. This is all going on with live traffic flowing, so expansion is carried out in steps. During each step the capacity of the topology is guaranteed to be at least some percentage of the existing topology capacity. The percentage is sometimes known as the expansion SLO.

During a step existing links to be re-wired are drained, then human operators physical rewire links at patch panels. The new links are tested and then undrained (strange word!), i.e., brought into service.

For example, here’s a logical expansion (top row) and its physical realisation:

The most time-consuming part of all this is the physical rewiring. The two metrics that capture expansion complexity are therefore:

  1. The number of expansion steps, and
  2. The average number of rewired links in a patch panel rack.

Here’s how Clos and Expander stack up on those metrics for the same networks we saw earlier:

This time the victory goes to Expander (Jellyfish). Jellyfish has a much higher north-to-south capacity ratio. Northbound links exit a block, and southbound links are to/from servers within a block. “Fat edges” have more northbound than southbound links, and the extra capacity means you can accomplish more movement in each step.

Clos topologies re-wire more links in each patch panel during an expansion step and require many steps because they have a low north-south capacity ratio.

Enter the FatClique

Inspired by these insights, the authors define a new class of topologies called FatClique, which combine the hierarchical structure of Clos with the edge expansion capabilities of expander graphs.

There are three levels in the hierarchy. A clique of switches form a sub-block. Cliques of sub-blocks come together to form blocks. And cliques of blocks come together to from the full FatClique topology.

Four key design variables determine the particular instantiation of a FatClique topology: the number of ports in a switch that connect to other servers; the number of ports in a switch that connect to other sub-blocks in a block; the number of switches in a sub-block; and the number of sub-blocks in a block. A synthesis algorithm
takes a set of six input constraints (see §5.1) and determines the values for these four design variables.

There is plenty more detail in section 5 of the paper which I don’t have the space to do justice too here.

FatClique vs Clos vs Expander

The evaluation compares FatClique to Clos, Xpander, and Jellyfish at different network topology sizes, as shown in the table below.


(Enlarge)

Here’s how they stack up against the complexity metrics:

Number of switches

Number of patch panels

Number of bundle types

and associated cabling costs:

Number of expansion steps

Average number of rewired links

We find that FatClique is the best at most scales by all our complexity metrics. (The one exception is that at small and medium scales, Clos has slightly fewer patch panels). It uses 50% fewer switches and 33% fewer patch panels than Clos at large scale, and has a 23% lower cabling cost (an estimate we were able to derive from published cable prices). Finally, FatClique can permit fast expansion while degrading network capacity by small amounts (2.5-10%): at these levels, Clos can take 5x longer to expand the topology, and each step of Clos expansion can take longer than FatClique because the number of links to be rewired at each step per patch panel can be 30-50% higher.

The one thing I couldn’t find in the evaluation is any data to back up the opening claim that FatClique achieves all of this “while being performance-equivalent to existing topologies.”

The last word

As the management complexity of networks increases, the importance of designing for manageability will increase in the coming years. Our paper is only a first step in this direction…