Minimizing Faulty Executions of Distributed Systems

Minimizing Faulty Executions of Distributed Systems – Scott et al.

Now that we’ve spent a couple of days looking at test case minimizing for sequential systems, we’re ready to tackle Colin Scott et al.’s paper on doing the same for executions of distributed systems. This is the paper that describes the core system behind Colin’s blog post on Fuzzing Raft for Fun and Publication.

Even simple code can contain bugs (e.g., crashes due to unexpected input). But the developers of distributed systems face additional challenges, such as concurrency, asynchrony, and partial failure, which require them to consider all possible ways that non-determinism might manifest itself. Since the number of event orderings a distributed system may encounter grows exponentially with the number of events, bugs are commonplace.

An effective technique for uncovering bugs is fuzzing – injecting sequences of message deliveries and failures etc., though bugs still turn up in production too. In both cases, “the resulting executions can
contain a large number of events, most of which are not relevant to triggering the bug. For the developer responsible for fixing the bug, understanding which of those events caused their system to arrive at the unsafe state can be challenging and time consuming.”

What would help the developer of course, is reducing that sequence of events down to the minimum necessary to reproduce the bug before they begin to work on it.

In this paper we address the problem of automatically minimizing executions of distributed systems.We specifically focus on executions generated by fuzz testing, but we also illustrate how our techniques can be applied to production executions.

The authors implemented their ideas in a framework called DEMi (Distributed Execution Minimizer), which works with Akka-based programs, though the concepts can also be applied outside of that context.

Our tool, Distributed Execution Minimizer (DEMi), is implemented in ∼14,000 lines of Scala. We have applied DEMi to akka-raft, an open source Raft implementation, and Apache Spark, a widely used data analytics framework. Across 11 known and discovered bugs, DEMi reduces the number of events in faulty execution traces by as much as 97%, and provides results that are up to 16 times smaller than those produced by the state-of-the-art blackbox minimization technique. The results we find for these two very different distributed systems leave us optimistic that these techniques, along with adequate visibility into events (either through a framework such as Akka, or through custom monitoring), can be applied successfully to a wider range of systems.

DEMi assumes a system model of N independent single-threaded processes communicating through messages. Events may be external (injection of messages from outside the system, a process crash, or initial start of a process), or internal (delivery of a message from a network buffer to a process). From a starting configuration (known process states, and empty network buffers), we can apply a schedule of events, both internal and external.

A fuzzing tool generates a sequence of events that causes a failure. Minimization then proceeds in three stages:

  1. The first goal of minimization is to find the Minimal Causal Sequence (MCS) of external events drawn from this sequence such that removing any one of them no longer triggers the fault. “We start by minimizing external (input) events because they are the first level of abstraction that developers reason about. “
  2. Given this MCS, the next phase is to minimize the internal events by searching for smaller schedules containing the same external events as the MCS, and that still trigger the violation.
  3. Finally, DEMi tries to minimizes the data payloads of the external messages.

During phase one, DEMi initially uses Delta Debugging to drive minimization of the set of external events:

To find MCSes in a reasonable time, we split schedule exploration into two parts. We start by using delta debugging, an input minimization algorithm similar to binary search, to prune extraneous external events. Delta debugging works by picking subsequences of the external events, and checking whether each subsequence still triggers the safety violation. We assume the user gives us a maximum time budget, and we spread this time budget evenly across each subsequence’s exploration.

For each sub-sequence of external events to be tested by ddmin, it is necessary to explore the space of possible interleavings of internal and external events to see whether these trigger the failure. And there are a lot of those…

Once delta debugging has selected an external event sequence E’, we need to check whether E’ can result in the invariant violation. This requires that we enumerate all schedules that contain E’ as a subsequence, and test whether any of these schedules result in the invariant violation. Unfortunately, the number of possible schedulesis exponential in the number of events, so pruning this
schedule space is essential to allow finishing in a timely manner.

Many events in a schedule are commutative – if both e1 and e2 are buffered for independent delivery to independent processes p1 and p1, the order in which they get delivered makes no difference. Dynamic Partial Order Reduction (DPOR) is ‘a well-studied technique for pruning commutative schedules from the search space.’ Given two commutative schedules, it will explore only one.

Unfortunately, even when using DPOR, the task of enumerating all possible schedules containing E as a subsequence remains intractable.

Therefore this phase is time-boxed, and a number of heuristics (scheduling strategies) are used to guide DPOR to look in the most profitable places early on, in order to maximise the chances that invariant violations are discovered quickly. DPOR uses backtracking to explore multiple non-equivalent schedules from a given prefix of events. The scheduling strategies prioritize the order in which DPOR explores backtrack points.

The first strategy requires the developer to provide a ‘message-fingerprint function’ (implementation of equals) that can be used to test for content equivalence of two messages but ignores fields such as authentication cookies and sequence numbers that are considered to be relevant in determining the system’s control flow (what these fields are varies by system of course). This equivalence can be used to further reduce the number of schedules to be explored.

The second strategy is to maintain the ‘happens-before’ relationships in the external events. “We realize this observation by starting our exploration with a single, uniquely defined schedule for each external event subsequence: deliver only messages whose fingerprints match those in the original execution, in the exact same order that they appeared in the original execution.”

The third strategy is to choose the next schedule by favouring those whose message types match those in the original execution, in the same order.

Once the external events have been minimized, the same techniques are used to prune the internal events. Finally, if the application developer provides a function for separating the components of messages, DEMi iteratively removes message components until a 1-minimal set of message components is arrived at that still triggers the fault.

In summary, we first apply delta debugging (Min) to prune external events. To check each external event subsequence chosen by delta debugging, we use a stateful version of DPOR. We leverage data independence by applying a user-defined message fingerprint function to deprioritize certain message contents. We first try exploring a uniquely defined schedule that closely matches the original execution. To overcome inflation due to history-dependent message contents, we explore subsequent schedules by choosing backtrack points that match the types of messages from the original execution. We spend the remaining time budget attempting to minimize internal events, and wherever possible, we seek to shrink external message contents.

The DEMi implementation is based on Akka and AspectJ. It is currently limited to a single JVM (this is not a fundamental limit), and to applications built on top of the Akka framework. As we saw earlier, in evaluation DEMi was very effective at finding and then minimizing failure-inducing execution traces in akka-raft and Apache Spark.

Here’s a summary of the evaluation results:

Bug found or reproduced? #events: Total(External) Minimized #events time (secs)
raft-45 reproduced 1140(108) 37(8) 498
raft-46 reproduced 1730(108) 61(8) 250
raft-56 found 780(108) 23(8) 197
raft-58a found 2850(108) 226(31) 43345
raft-58b found 1500(208) 40(9) 42
raft-42 reproduced 1710(208) 180(21) 10558
raft-66 found 400(68) 77(15) 334
spark-2294 reproduced 1000(30) 40(3) 97
spark-2294-caching reproduced 700(3) 51(3) 270
spark-3150 reproduced 600(20) 14(3) 26
spark-9256 found 600(20) 16(3) 15

The DEMi artifact itself could by adapted to other RPC libraries besides Akka, using interposition frame- works such as AspectJ, or by manually adding hooks to the library. The core of DEMi should not need to be changed significantly.