ZooKeeper’s Atomic Broadcast Protocol: Theory and practice – Medeiros 2012.
This is part 7 in a ten part series on consensus and replication.
Perhaps after reading yesterday’s paper on Zab you feel like you’ve got a good high-level understanding of how ZooKeeper atomic broadcast works under the covers. Unfortunately I’ve got news for you – you don’t!* It’s not your fault though – it turns out the ZooKeeper implementation (as of 2012 at least, I haven’t read the latest source code) doesn’t match what’s described in the paper in some key areas. And it’s not the case that it drifted from the paper version over time, it seems that the implementation never matched the paper. This at least, is what today’s paper choice claims.
(*) Unless you’ve independently kept up to date with the implementation, or already read today’s selection.
Even though bugs are being actively fixed, experience shows us how distributed coordination problems must be dealt (with) with caution. In fact, the sole purpose of ZooKeeper is to properly handle those problems… ZooKeeper’s development, however, has also experienced these quoted problems. Some of these problems came from the difference between the implemented protocol and the published protocol from Junqueira et al.. Apparently, this difference has existed since the beginning of the development, which suggests early optimization attempts.
The first part of the paper is a summary of the Zab protocol as we examined it yesterday. Recall the protocol has three key phases, and glossed over how a leader is chosen (“there’s an oracle”). We could consider the leader election part to be a ‘phase 0’ :
At the beginning of Phase 1, the peer inspects its vote and decides whether it should become a follower or a leader. For this reason, leader election is sometimes called Phase 0.
We’ll pick up the story in section 4 of the paper, “Implementation.”
Apache ZooKeeper is written in Java, and the version we have used for studying the implementation was 3.3.3 . Version 3.3.4 is the latest stable version (to this date), but this has very little differences in the Zab layer. Recent unstable versions have significant changes, though. Most of the source code is dedicated to ZooKeeper’s storage functions and client communication. Classes responsible for Zab are deep inside the implementation…
(As of the date of this blog post, the most recent stable release is 3.4.6).
The Java implementation of Zab roughly follows Algorithms 1, 2, and 3. Several optimizations were added to the source code, which make the actual implementation look significantly different from what we have seen in the previous section [original paper]. In particular, the default leader election algorithm for Phase 0 is tightly coupled with the implementation of Phase 1.
This algorithm is known as Fast Leader Election (FLE) and attempts to minimize state transfers on leader election by electing the peer with the most up-to-date history. As a consequence, there ends up being no clear distinction between phases 1 and 2 of the protocol, so these are conceptually grouped together into what is called the recovery phase which comes right after phase 0 (FLE) and makes the assumption that the leader has the latest history in the quorum. The revised recovery phase works as follows:
- When the leader receives a FOLLOWERINFO message from a follower, it sends a NEWLEADER response with the leader’s most recent transaction id.
-
The leader then sends one of three messages: SNAP, DIFF, or TRUNC, intended to synchronize state between the leader and the follower. If the client is too out of date, a SNAP message with a complete snapshot of the state will be sent. If the client is only a little bit out of date, a DIFF message will be sent with just the state the client is missing. If the client is ahead of the the leader though, a TRUNC message is sent.
- When a follower recieves the NEWLEADER message, it checks to see that it has a higher epoch number than the follower’s own – and transitions to the election state to execute phase 0 if it does not.
- When a follower receives SNAP it copies the snapshot to its database and commits the changes. If it receives DIFF it accepts all proposals in the diff and commits the changes. If the follower has received TRUNC however (with a transaction id), it aborts all proposals it knows of from the TRUNC id onwards.
-
Following the processing of SNAP, DIFF, or TRUNC, the follower sends back a new leader acknowledgement.
-
Once the leader has a quorum of acknowledgements the protocol transitions to phase 3.
The purpose of this synchronization is to keep the replicas in a mutually consistent state. In order to do so, committed transactions in any replica must be committed in all other replicas, in the same order. Furthermore, proposed transactions that should not be committed anymore must be abandoned so that no peer commits them. Messages SNAP and DIFF take care of the former case, while TRUNC is responsible for the latter.
Let us know turn our attention to the FLE protocol itself.
In FLE, peers in the election state vote for other peers for the purpose of electing a leader with the latest history. Peers exchange notifications about their votes, and they update their own vote when a peer with more recent history is discovered. A local execution of FLE will terminate returning a vote for a single peer and then transition to Recovery Phase. If the vote was for the peer itself, it shifts to state leading (and following itself), otherwise it goes to state following.
“Some problems have emerged from the implemented protocol due to differences from the designed protocol…” Two of these are described, one related to a looping behaviour where the system can get stuck bouncing between FLE and Recovery phases, and one relating to the way proposal are abandoned on receipt of a TRUNC message.
Getting these protocols right is hard. We have previously seen how Amazon and Google use formal techniques to uncover deep bugs in distributed systems. In “SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems,” Leesatapornwongsa et al. describe a very promising technique that found a number of known-to-exist deep bugs in ZooKeeper, and also some new ones (as well as in YARN and in Cassandra). We’ll take a look at that work in a future edition of The Morning Paper.