SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems

SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems – Leesatapornwongsa et al. 2014

This is the second of three papers we’ll be looking at this week on the theme of verifying correctness of, and catching bugs in, distributed systems. Yesterday we saw the Statecall Policy Language and associated tool chain which can verify behaviour of automata at a single node in a distributed system. Today we look at the harder problem of distributed model checking.

The last five years have seen a rise of software model checkers targeted for distributed systems; for brevity, we categorize such systems as dmck (distributed system model checker). Dmck works by exercising all possible sequences of events (e.g., different reorderings of messages), and hereby pushing the target system into corner-case situations and unearthing hard-to-find bugs. To address the state-space explosion problem, existing dmcks adopt advanced state reduction techniques such as dynamic partial order reduction (DPOR), making them mature and highly practical for checking large-scale systems.

However, they still fall short when it comes to checking behaviour in the presence of failures. Which is a problem since, to quote Mike Burrows from the Chubby paper, “readers will be unsurprised to learn that the fail-over code, which is exercised far less often than other parts of the system, has been a rich source of interesting bugs.” The authors provide further evidence to back-up this claim:

We scanned through thousands of issues from their (ZooKeeper, Hadoop, Cassandra) bug repositories. We then tagged complex reliability bugs that can only be caught by a dmck (i.e., bugs that can occur only on specific orderings of events). We found 94 dmck-catchable bugs. Our major finding is that 50% of them are deep bugs (require complex re-ordering of not only messages but also crashes and reboots).

SAMC solves this problem, enabling the discovery of deep bugs that may also involve crash scenarios, such as the following reported against Zookeeper:

ZooKeeper Bug #335: (1) Nodes A, B, C start with latest txid #10 and elect B as leader, (2) B crashes, (3) Leader election re-run; C becomes leader, (4) Client writes data; A and C commit new txid-value pair {#11:X}, (5) A crashes before committing tx #11, (6) C loses quorum, (7) C crashes, (8) A reboots and B reboots, (9) A becomes leader, (10) Client updates data; A and B commit a new txid-value pair {#11:Y}, (11) C reboots after A’s new tx commit, (12) C synchronizes with A; C notifies A of{#11:X}, (13) A replies to C the “diff” starting with tx 12 (excluding tx {#11:Y}!), (14) Violation: permanent data inconsistency as A and B have {#11:Y} and C has {#11:X}.

The author’s prototype ‘SAMPRO’ runs to just under 11Kloc of Java code. It was used to test 10 versions(both old and current) of Zookeeper, Hadoop/YARN, and Cassandra across 7 different protocols (leader election, atomic broadcast, cluster management, speculative execution, read/write, hinted handoff, and gossiper). For validation, versions with know deep bugs were tested (SAMC found them all without prompting), as well as the latest versions:

We take 12 old real-world deep bugs that require multiple crashes and reboots (some involve as high as 3 crashes and 3 reboots) and show that SAMC can find the bugs one to three orders of magnitude faster compared to state-of-the-art techniques such as black-box DPOR, random+DPOR, and pure random. We show that this speed saves tens of hours of testing time. More importantly, some deep bugs cannot be reached by non-SAMC approaches, even after 2 days; here, SAMC’s speed-up factor is potentially much higher. We also found 2 new bugs in the latest version of ZooKeeper and Hadoop.

Some of these deep bugs were found with less than 5 minutes of testing, the hardest to find deep bug required 576 executions to find, each lasting approximately 40 seconds. (That’s about 6.5 hours). If any one of these bugs occured ‘in the wild’ – and you were lucky enough to detect it in the first place – I’m betting it would take you many more than 6.5 hours to track it down!

Hopefully this is enough to convince you that the approach is more than just an academic exercise and has applicability for real-world systems. Integration with target systems is made as non-intrusive as possible – and on a personal note I’m very pleased to see they use AspectJ for this.

We design SAMPRO to be highly portable; we do not modify the target code base significantly as we leverage a mature interposition technology, AspectJ, for interposing network messages and timeouts… SAMPRO is also equipped with crash and reboot scripts specific to the target systems. The tester can specify a budget of the maximum number of crashes and reboots to inject per execution.

The secret to SAMC is carefully controlling state-space explosion in order to keep the area to be explored by the model as contained as possible. If you’re interested in the details of how it works, read on… !

A short introduction to dmcks and the prior art

[A] dmck inserts an interposition layer in each node of the target system with the purpose of controlling all important events (e.g., network messages, timeouts) and preventing the target system to process the events until the dmck enables them. A main dmck mechanism is the permutation of events; the goal is to push the target system into all possible ordering scenarios. For example, the dmck can enforce abcd ordering in one execution, bcad in another, and so on… To model check a specific protocol, dmck starts a workload driver (which restarts the whole system, runs specific workloads, etc.). Then, dmck generates many (typically hundreds/thousands) executions; an execution (or a path) is a specific ordering of events that dmck enables (e.g., abcd, dbca) from an initial state to a termination point… Dmck continuously runs state checks (e.g., safety checks) to verify the system’s correctness.

The challenge with this basic approach is that the state space to be explored is huge! So for the approach to be viable, it’s necessary to find ways to reduce it. MODIST uses a black box approach called dynamic partial order reduction (DPOR) to determine that certain events are independent and thus their execution order does not matter.

Fortunately, the notion of event independence can be mapped to distributed system properties. For example, MODIST specifies this reduction policy: a message to be processed by a given node is independent of other concurrent messages destined to other nodes (based on vector clocks).

After DPOR, the next advance came with dynamic interface reduction (DIR).

The reduction intuition behind DIR is that different thread interleavings often lead to the same global events (e.g., a node sends the same messages regardless of how threads are interleaved in that node).

Symmetry reduction exploits symmetry in the system (surprise!):

Other than the aforementioned methods, symmetry is another foundational reduction policy. Symmetry-based methods exploit the architectural symmetry present in the target system. For example, in a ring of nodes, one can rotate the ring without affecting the behavior of the system. Symmetry is powerful, but we find no existing dmcks that adopt symmetry.

None of the above turn out to be good enough though: they don’t find bugs quickly enough, and they do not scale with the inclusion of failure events.

Last week we saw the theme of exploiting application semantics at the datastore layer in order to break through to new levels of performance and scalability. SAMC applies the same concept to dmck, using a white-box model with some knowledge of application semantics to take dmcks to the next level of performance and scalability.

SAMC’s semantic-aware reduction patterns

SAMC is built on top of sound model checking foundations such as DPOR and symmetry. We name these foundations as mechanisms because a dmck must specify accordingly what events are dependent/independent and symmetrical, which in SAMC will be done by the reduction policies and protocol-specific rules. Our main contribution lies within our four novel semantic-aware reduction policies: local-message independence (LMI), crash-message independence (CMI), crash recovery symmetry (CRS), and reboot synchronization symmetry (RSS). To the best of our knowledge, none of these approaches have been introduced in the literature. At the heart of these policies are generic event processing patterns (i.e., patterns of how messages, crashes, and reboots are processed by distributed systems). Our policies and patterns are simple and powerful; they can be applied to many different distributed systems. Testers can extract the patterns from their target protocols (e.g., leader election, atomic broadcast) and write protocol-specific rules in few lines of code.

Let’s look at Local Message Independence in a little detail to give you a flavour, and then I’ll briefly explain the other three. Local messages are defined as messages currently in flight to a given node. There are four identified message processing patterns at the node: on receipt of a given message, and for a given local state at the node, a node may:

  • Discard the message (take no action)
  • Increment a counter (a special case since this is so common in distributed algorithms)
  • Update the local state to some constant value
  • Modify the local state in some other way

The user specifies predicates (pd_iscard_, pi_ncrement_, pc_onstant_, and pm_odify_) that take a message and local state, and indicate when these processing patterns apply.

Based on these patterns, we can apply LMI in the following ways. (1) m1 is independent of m2 if pd is true on any of m1 and m2. That is, if m1 (or m2) will be discarded, then it does not need to be re-ordered with other messages. (2) m1 is independent of m2 if pi (or pc) is true on both m1 and m2. That is, the re-orderings do not matter because the local state is monotonically increasing by one (or changed to the same constant). (3) m1 and m2 are dependent if pm is true on m1 and pd is not true on m2. That is, since both messages modify the local state in unique ways, then the re-orderings can be “interesting” and hence should be exercised. All these rules are continuously evaluated before every event is enabled. If multiple cases are true, dependency has higher precedence than independency.

The remaining reduction policies are:

  • Crash-Message Independence specifies the independence relationship between a crash to be injected and outstanding messages. The tester is required to distinguish between cases that may lead to message sends, and cases that can only affect local state. For example, in a quorum system, a follower crash that still leaves a quorum of followers may only impact local state, but a crash that does not leave a quorum will result in election messages. In cases that only affect local state SAMC can skip redundant re-orderings of crashes with respect to outstanding messages.

  • Crash Recovery Symmetry (CRS) exploits symmetry to reduce the number of crash faults that need to be injected:

The intuition is that some crashes often lead to symmetrical recovery behaviors. For example, let’s assume a 4-node system with node roles FFFL. At this state, crashing the first or second or third node perhaps lead to the same recovery since all of them are followers, and thereby injecting one follower crash could be enough. Further on, if the system enters a slightly different state, FFLF, crashing any of the followers might give the same result as above.

  • Reboot Sychronization Symmetry (RSS) : “The intuition behind reboot synchronization symmetry (RSS) is similar to that of CRS.When a node reboots, it typically synchronizes itself with the peers. However, a reboot will not lead to a new scenario if the current state of the system is similar to the state when the node crashed. “

Pattern extraction

Given the patterns and associated generic event processing patterns, a software tester must write protocol specific rules by extracting patterns from their target systems.

Given the patterns described in previous sections, a tester must perform what we call an “extraction” phase. Here, the tester must extract the patterns from the target system and write protocol-specific rules specifically by filling in the predicates and abstractions as defined in previous sections… Currently, the extraction phase is manual; we leave automated approaches as a future work …. The processing patterns only cover high-level semantics; testers just fill in the predicates and abstractions but no more details. In fact, simple semantics are enough to significantly help dmck go faster to deeper states.

Here’s an example of the Local Message Independence specification for Zookeeper’s Leader Election protocol:

    bool pd : !newVote(m, s)
    
    bool pm : newVote(m, s)
    
    bool newVote(m, s) :
       if (m.ep > s.ep) 
         ret 1;
       else if (m.ep == s.ep) 
         if (m.tx > s.tx) 
           ret 1;
         else if (m.tx == s.tx && m.lid > s.lid) 
           ret 1;
           
       ret 0;      

I’m not a fan of the syntax here, but hopefully you get the idea of the level at which things need to be specified. (pd is the discard predicate, and pm is the modify predicate; ep = epoch, tx = latest tx id, lid = leader id).

The protocol-specific rules are written in only 35 LOC/protocol on average. This shows the simplicity of applying SAMC reduction policies across different systems and protocols; all the rigorous state exploration and reduction are automatically done by SAMPRO.