The Network is Reliable | the morning paper

The Network is Reliable – Bailis and Kingsbury 2014

This must be the easiest paper summary to write of the series so far. The network is reliable? Oh no it isn’t…

OK, here’s a little more detail 🙂

Network reliability matters because it prevents us from having reliable communication, and that in turn makes building distributed systems really hard. (Fallacy #1 in Peter Deutsch’s ‘Eight fallacies of distributed computing‘ is ‘The network is reliable’).

In fact, if we look at this list for a moment we can see that the top three fallacies all correspond to interesting failure modes in networks as well as being something you have to take into account in the happy paths. Sudden large latency spikes for example can be very disruptive to fault detectors, and noisy network neighbours might consume lots of your bandwidth.

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn’t change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.

No. 5 is also something we’ve had to learn to accommodate to a whole new degree with cloud deployments. Topology changing all the time is the norm.

If we can assume that ‘the network is reliable enough,’ then it might make sense as an engineering trade-off to become unavailable in the event of a partition. I’ve heard the argument made that ‘partitions only happen in the cloud,’ we never really see them in practice in our own data centers. Conscious trade-off or not, the work of one of the authors of this paper, Kyle Kingsbury, on the wonderful Jepsen Reports shows that many real-world systems struggle with partitions.

The degree of reliability in deployment environments is critical in robust systems design and directly determines the kinds of operations that systems can reliably perform without waiting. Unfortunately, the degree to which networks are actually reliable in the real world is the subject of considerable and evolving debate. Some people have claimed that networks are reliable (or that partitions are rare enough in practice) and that we are too concerned with designing for theoretical failure modes. Conversely, others attest that partitions do occur in their deployments…

We have some pretty good statistics on disk, host, and rack failure rates, but not so much on network failures. Yet network failures can be much more disruptive than a disk failure for example.

As a result, much of what we believe about the failure modes of real-world distributed systems is founded on guesswork and rumor. Sysadmins and developers will swap stories over beer, but detailed, public postmortems and comprehensive surveys of network availability are few and far between. In this article, we’d like to informally bring a few of these stories (which, in most cases, are unabashedly anecdotal) together…

There follows a long collection of stories of every kind of network failure you can imagine. It’s the cumulative effect that gets you as you read through – start off on page 2 thinking ‘yeah ok, isolated scenario(s)’, but the stories keep coming and coming and coming. By the time you get to page 12, you’ve probably come to the conclusion that Murphy must have been working as a network engineer at the time he formulated his famous law!

Split-brains, partitions, device failures, link failures, high rates of packet loss, maintenance and admin issues, and more are all in here – resulting in comedies of errors in the systems built on top of them that would be fit for a christmas pantomine if the consequences weren’t so severe!

This reads like a skit from a comedy show:

This 90-second network partition caused file servers using Pacemaker and DRBD (Distributed Replicated Block Device) for HA (high availability) failover to declare each other dead, and to issue STONITH (shoot the other node in the head) messages to one another. The network partition delayed delivery of those messages, causing some file-server pairs to believe they were both active. When the network recovered, both nodes shot each other at the same time. With both nodes dead, files belonging to the pair were unavailable.

You really do need to read through the paper to get the overall impression, a summary cannot do it justice.

As a consequence of all this:

Split-brain is not an academic concern: it happens to all kinds of systems—sometimes for days on end. Partitions deserve serious consideration… It’s important to consider this risk before a partition occurs, because it’s much easier to make decisions about partition behavior on a whiteboard than to redesign, reengineer, and upgrade a complex system in a production environment—especially when it’s throwing errors at your users. For some applications, failure is an option—but you should characterize and explicitly account for it as a part of your design. Finally, given the additional latency and coordination benefits of partition-aware designs, you might just find that accounting for these partitions delivers benefits in the average case as well.

The authors also acknowledge that there might be reliable networks out there:

On the other hand, some networks really are reliable. Engineers at major financial firms have anecdotally reported that despite putting serious effort into designing systems that gracefully tolerate partitions, their networks rarely, if ever, exhibit partition behavior. Cautious engineering and aggressive network advances (along with lots of money) can prevent outages. Moreover, in this article, we have presented failure scenarios; we acknowledge it’s much harder to demonstrate that network failures have not occurred.

But if I was thinking about future architectures and cloud deployments, I wouldn’t want to rely on it ;).