Growing a protocol

Growing a protocol Ramasubramanian et al., HotCloud’17

I’ve been really enjoying working my way through a selection of the HotCloud papers – they’re relatively short, thought-provoking, and designed to promote discussion (each paper has a set of discussion questions at the very end – great if you’re looking at them in a group of some kind). Today’s paper is certainly no exception. The setting is a collaboration between Elastic and Peter Alvaro’s research group at the University of California. The question under investigation is how best to implement and evolve distributed system protocols without introducing unintended bugs.

In this paper, we argue that tool support for implementing and evolving fault-tolerant distributed systems needs to be rethought. We advocate exploration of the (sparse) middle ground between existing testing techniques practically inadequate for addressing fault tolerance concerns and traditional verification techniques ill-suited to the continual evolution of real-world evaluations.

The status quo

If you want to solve a distributed systems problem these days, you can generally start by picking up a proven protocol for the issue at hand. For example, Paxos, Raft, Primary/Backup, Chain Replication, reliable broadcast, and so on. So we might naively expect that…

…modern system designers can merely take mechanisms “off the shelf” and enjoy the guarantees of hardened subsystems while constructing otherwise novel applications. Any practitioner, however, will quickly identify this as a fallacy. Even initial protocol implementations tend to differ significantly from their specification.

(A sentiment echoed in the wonderful ‘Use of Formal Methods at Amazon Web Services’ paper).

But it’s really the ‘little’ changes over the lifetime of the system, the protocol optimisations over time, that can have outsize unintended consequences. We end up with a system that is, for example, “essentially Primary/Backup.” And while it may have the essence of Primary/Backup, does it retain the guarantees? It’s very hard to know.

Such a circumstance places implementors in the bad position of deriving false confidence from assertions that their implementation is “essentially Primary/Backup”.

What we normally rely on to ensure that changes to our system don’t introduce new bugs is regression testing: “regression testing techniques ensure future optimisations do not re-introduce bugs previously encountered in early stages of system development.” In regression testing we check that inputs known to result in bad behaviour in the past no longer do.

In the ideal case, the team would be using some kind of formal verification, and specifications would be co-evolved with the code and invariants proved afresh. In practice, this rarely happens. The correctness guarantees determined at initial verification time erode as protocols evolve.

The authors point out that regression testing alone is not sufficient to assert fault tolerance properties. Inputs that trigger bugs in one version of a protocol are not guaranteed to trigger the same bug in a different version. In distributed systems, a large class of bugs are tied to the execution schedule, not (just) to the inputs.

As a result, regression testing, as we currently employ it, is fundamentally too weak to prevent fault tolerance regression bugs. Root cause analysis is similarly inadequate, because a set of faults triggering bugs in later versions may fail to do so in an earlier version.

This difference between input-based and schedule-based approaches necessitates the use of a different approach for fault tolerance testing and verification.

We are left wanting something that works like verification, but feels like testing.

Elastic meets LDFI

The team at Elastic faced a problem like the one we just described. They had a data replication protocol based on Primary/Backup, and were looking to introduce a faster variant that could synchronise individual operations rather than relying on file copying. The Elastic team then made a couple of really smart moves:

  • “Since this was a new algorithm, Elastic was looking for ways to formally verify it.”
  • “Elastic engaged our research team because they wanted a technique that strikes a balance between formal verification and testing – in particular the strong correctness guarantees of the former and the agility of the latter.”

The project used Lineage Driven Fault Injection (LDFI), which builds a model based on good system execution and explores only fault scenarios capable of forcing the system into a bad state. (See the earlier write-up on The Morning Paper for more details).

We implemented a sequence of versions of the replication protocol and used LDFI to incrementally verify them as part of continuous integration.

LDFI helped to find existing bugs relating to the implementation of the new protocol, and also demonstrated the safety (or otherwise!) of a couple of subsequent optimisations.

There are many instances in the software development cycle for a bug to be introduced, the first of which is when a protocol specification is converted to an implementation. During our case study, we found a bug which manifested precisely from such a transaction scenario.

(The bug related to a primary failover with partially replicated writes in progress, see section 4.1 for details). LDFI generated the scenario automatically, rather than requiring testers to have the foresight to manually derive such test cases. Originally discovered in the context of concurrent writes, subsequent analysis showed that with the right inputs a variation can occur with just a single write.

… the two bugs are similar, but do not manifest from the same fault scenarios. This reinforces the claim from our motivating example that techniques such as root cause analysis as they are generally deployed would not be effective in reasoning about the fault tolerance properties of distributed systems.

A seemingly minor optimisation was explored to avoid extra processing: requests have monotonically increasing sequence numbers, and if a sequence number associated with a write request had been seen before, instead of processing it (again?), the payload should be dropped and the request simply acknowledged. It turns out there’s a problem with this scheme that can occur during primary failover:

Fortunately, LDFI quickly and automatically discovers such a scenario by using the initial successful execution to test fault scenarios that may cause failures… Optimization carries the risk of introducing entirely new bugs capable of breaking the end-to-end properties of the system, which is best handled by verification-based tools.

A second, substantially more complicated optimisation turned out not to produce any counterexamples when run against LDFI. Seemingly simple optimisations may break guarantees, while more complex ones may not – simplicity is not a guarantee of correctness.

Improving distributed software quality

Our experience at Elastic suggests approaches like LDFI are a step towards improving the state of the art in distributed software quality.

Where can we go from here? The team set out a number of future directions:

  • Integration with semantic-aware software model checkers: “An ideal tool solution would combine the best features of LDFI (which automatically builds models of domain knowledge, but ignore concurrency and asynchrony) with state-of-the-art combined approaches such as SAMC, since we know from Fischer et al. (FLP) that some of the most fundamental difficulties of distributed systems exist at the intersection of partial failure and asynchrony!”
  • Exploiting the CALM theorem: “we are developing a prototype system that combines the Lineage-Driven approach (utilizing explanations of what went right to reason about what could go wrong) and CALM analysis (using static analysis to prove commutativity of message processing logic) to simultaneously prune the space of faults and re-orderings.”
  • Using probabilistic system models to embrace rather than mask the inherent uncertainties in distributed executions.
  • Improving the debugging experience using provenance to reason about distributed executions (“a young research area capable of radical growth”)

Our experience using LDFI at Elastic suggests the provision of high-level explanations of how a system achieves (or fails to achieve) good outcomes are a good starting point for taming the complexity of distributed debugging.

There is wide scope for improving tools for implementing, evolving, and debugging distributed software:

The state of the art is so desperately poor that it should be easy for the research community to make an impact!


The authors of the paper would love your feedback. Do you agree that classical software quality techniques such as regression testing and root cause analysis do not extend to distributed systems in their current form? Can LDFI serve as a bridge between verification and testing, as explored in this real-world application? What other tools should the team be building, and what impact could the LDFI approach have on them?