Viewstamped replication: A new primary copy method to support highly available distributed systems – Oki & Liskov ’88.
Given a set of co-operating nodes that form a group, how can we replicate information to group members and maintain a consistent “one copy serializability” property as group members come and go? Oki and Liskov introduce two interlocking protocols: one that describes the behaviour of the group while membership and leadership is stable, and one that describes how the group transitions to a new membership and/or leadership arrangement. The combination of a given set of members (called cohorts in the paper) with one designated as leader (called the primary) is called a view. Each view has a unique id, and every event that occurs within a view has a unique (logical) timestamp. The combination of a view id and event timestamp is known as a viewstamp, from whence the paper derives its name.
The notion of ‘membership’ is worth some clarification. The concept is that the there are a fixed set of members in the group, which when all is functioning normally will all be able to communicate with each other. One of these will be the primary. View changes occur if there is a network partition or a node fails – the underlying assumption is that this is temporary and network connectivity will be restored, or the failing node restarted, eventually. Adding completely new members, or permanently decommissioning existing ones, is out of scope for this paper (but we’ll cover ‘Viewstamped Replication Revisited’ later this week that shows how to accommodate this).
The hopes for distributed systems back in the late ’80s seem to be modest:
One of the potential benefits of distributed systems is their use in providing highly-available services, that is, services that are likely to be up and accessible when needed.
(Emphasis mine). You’ll find lots of references to objects and RPCs (of its time), but one thing that really strikes me about the paper is that the whole work was produced without ever fully implementing the ideas! “We believe that our method will perform better than a non-replicated system…”, “At present we are implementing our method, we will be able to run experiments about system performance when our implementation is complete.” So many seemingly great ideas fall into the gap between theory and practice that my instinct is to suspend belief until a proven implementation is available. Of course though, this work turned out to be no mere ‘castle in the air.’
The network and process models are the familiar non-byzantine, failstop processes with lost, delayed, duplicated, and out-of-order message delivery.
Our replication method assumes a model of computation in which a distributed program consists of modules, each of which resides at a single node of the network. Each module contains within it both data objects and code that manipulates the objects; modules can recover from crashes with some of their state intact. No other module can access the data objects of another module directly. instead, each module provides procedures that can be used to access its objects; modules communicate by means of remote procedure calls. Modules that make calls are called clients; the called module is a server.
Each module has a number of replica instances – typically 3 to 5 – called cohorts. One of these is designated the primary which processes requests from clients and participates in a two-phase commit protocol. The others are passive backups and simply receive state information from the primary.
Over time the communication capability within a group may change as cohorts or communication links fail and recover. To reflect this changing situation, each cohort runs in a view. A view is a set of cohorts that are (or were) capable of communicating with each other, together with an indication of which cohort is the primary; it is a subset of the configuration and must contain a majority of group members… Because a timestamp is meaningful only within a view, we introduce viewstamps. A viewstamp is simply a timestamp concatenated with the viewid of the view in which the timestamp was generated. We guarantee that for each viewstamp v in its history, the cohort’s state reflects event e from view v.id iff e’s timestamp is less than or equal to v’s timestamp.
Transaction processing guarantees serialization, and that a transaction can commit only if its events are known to a majority of cohorts. The view change algorithm guarantees that events known to a majority of cohorts survive into subsequent views.
The transaction processing model assumed is a distributed transaction model with 2PC. The paper discusses the client acting as coordinator (not a design choice I would make), but you can also move the coordination to a server-side component without affecting the fundamental algorithm:
If the client is not replicated, it is still desirable for the coordinator to be highly available, since this can reduce the “window of vulnerability” in two-phase commit. This can be accomplished by providing a replicated “coordinator-server.” The client communicates with such a server when it starts a transaction, and when t commits or aborts the transaction. The coordinator-server carries out two-phase commit as described above on the client’s behalf.
There are two key differences to the regular transaction protocol: viewstamps are used to determine whether a transaction can commit, and instead of writing information to stable storage during the 2PC, the primary sends it to the backups over a communication buffer that “provides reliable delivery of event records to all backups in the primary’s view.”
How can we reconcile this ‘reliable delivery of event records to all backups in timestamp order’ with the opening assumptions about lossy, delayed, and out-of-order message delivery? The answer is that these guarantees are made within a view – if the communication buffer fails to deliver a message then the assumption is that a crash or communication failure has occured, which will cause a view change.
A 2PC coordinator first sends a prepare message to all participants in the transaction (this set is maintained in a pset as remote procedure calls are made within the scope of the transaction). When a participant (primary for a module group) receives a prepare message it first checks whether it knows about all the calls made during the transaction in its own history. If it does not, it refuses the prepare. If it does, it forces its buffer to ensure that all of the completed-call events are known within a majority of the module instances (i.e. to a sub-majority of its backups), and then replies with an accept.
If all participants agree to commit, the coordinator commits the transaction. If any participant votes to abort of course, the transaction is aborted. If the coordinator does not hear back from participants after repeated tries, it first refreshes its cached knowledge of group membership, and then retries the prepare phase if there is a more recent view. It aborts if not. Full details of the processing rules, including how remote procedure calls should be processed, are given in the paper.
Changing views
Whereas the transaction processing protocol is a slightly modified form of what has gone before, the view change ideas are new.
Transaction processing depends upon forcing information to backups so that a majority of cohorts know about particular events. The job of the view change algorithm is to ensure that events known to a majority of cohorts survive into subsequent views. It does this by ensuring that every view contains at least a majority of cohorts and by starting up the new view in the latest possible state.
This relies on an observation upon which Paxos also depends – if a majority know a certain piece of information x, then any subsequent majority must also include at least one member that knows x.
If every view has at least a majority of cohorts, then it contains at least one cohort that knows about any event that was forced to a majority of cohorts. Thus we need only make sure that the state of the new view includes what that cohort knows. This is done using viewstamps: the state of the cohort with the highest viewstamp for the previous view is used to initialize the state in the new view. This scheme works because event records are sent to the backups in timestamp order, and therefore a cohort with a later viewstamp for some view knows everything known to a cohort with an earlier viewstamp for that view.
Cohorts exchange keep-alive messages. Any cohort that detects a change (member returning to communication, loss of communication to an existing member, or the cohort itself is recovering from a crash) initiates the view change protocol in the manager role, with the other cohorts as underlings.
A new view manager creates a new view id, sends invitations to join the view to all other cohorts, and awaits their responses. If the view manager receives an invitation with a higher view id in the meantime, it accepts the invitation and becomes an underling. When cohorts receive the view manager’s invitation, they accept it if the view id is higher than any they have seen before, and become underlings. Acceptances can take one of two forms: if the cohort is up to date, it sends a normal acceptance containing its current viewstamp and an indication of whether or not it is the primary in the current view; if the cohort is not up to date, it sends a crash-accept response.
Once all cohorts have responded, or the timeout expires, the new view manager examines the responses to see if it is possible to form a new view.
View formation can succeed only if two conditions are satisfied: at least a majority of cohorts must have accepted the invitation, and at least one of them must know all forced information from previous views. The latter condition may not be true if some acceptances are of the “crashed” variety…. If the view can be formed, the cohort returning the largest viewstamp (in a “normal” acceptance) is selected as the new primary; the old primary of that view is selected if possible, since this causes minimal disruption in the system.
Underling cohorts call await-view to find out the decision. If no response is forthcoming within a specified timeout the cohort becomes a view manager and attempts to form a new view.
The system performs correctly even if there are several active primaries. This situation could arise when them is a partition and the old primary is slow to notice the need for a view change and continues to respond to client requests even after the new view is formed. The old primary will not be able to prepare and commit user transactions, however, since it cannot force their effects to the backups. If the same cohort is the primary both befo,re and after the view change, then no user work is lost in the change. Otherwise, we guarantee the following: Transactions that prepared in the old view will be able to commit, and those that committed will still be committed.
The authors note that this algorithm is not tolerant of lost messages and slow responses. “Fairly long timeouts” are recommended to mitigate this. The core algorithm also assumes that minimum information is written to stable storage (relying on replication to preserve information after a crash). Therefore in a “catastrophe” that causes several cohorts all to crash simulatenously information loss is possible.
Note that a catastrophe does not cause a group to enter a new view missing some needed information. Rather, it causes the algorithm to never again form a new view.
Protection against catastrophe can be added by, for example, using stable storage at the primary.
I covered this paper since it precedes Paxos (Paxos and Viewstamped Replication were developed independently at about the same time, late 80’s) and is often cited. Later in the week we’re going to look at the 2012 paper “Viewstamped Replication Revisited” which is (for this reader at least) a much clearer exposition of the ideas and includes advances from the intervening 24 years. If you’re only going to read one paper on VR, I’d recommend that one.