Mergeable persistent data structures

Mergeable persistent data structures – Farinier et al. 2014

Irmin is part of the MirageOS project that was the subject of yesterday’s paper, where it is also the basis for a Git-like persistent file system used for the OS.

What if you could version-control a (mutable) persistent data structure, inspect its history, clone a remote state and revert it to a previous state? We find that using the existing tools around source-code DVCS and applying them to application data results in significant benefits to developing distributed applications. Instead of structuring our system like a conventional database, we instead provide the functionality as a library that can be linked and customised directly by the application logic.

Irmin is an OCaml implementation of this idea, using the Git repository format:

Irmin uses the same concepts as Git: clone, push, pull, branch, rebase, … and exposes them as a library that can be used by any OCaml application. It also features a bidirectional mapping with the Git format: any changes committed by an application are reflected in the associated Git repository, and any manual change in the Git repository is reflected back in the application via Irmin. This lets users use the usual Git tools (tig, gitk, . . .) to inspect and manipulate the data structure.

I’m reminded of the work on differential sync first pointed out to me by some of my previous colleauges at SpringSource who exploited the technique in a JSON context. Differential sync bases itself on the ‘diff’ mechanism though, whereas Irmin exploits the full DVCS machinery. If you used diff and patch to implement merge in Irmin, you’d have an interesting combination of the two…

There’s a little teaser up front in the paper that ‘all this comes at a reasonable cost.’ What is that cost? About a factor of 5 it later turns out.

Conflicts are handled by one of three mechanisms:

  1. By using CRDTs (conflict-free)
  2. By providing a data-structure specific merge operation
  3. By pushing the problem onto the developer of apps built using the store

Irmin has an underlying append-only block store, described as a key/value store but the user is not in control of the keys, so it feels to me more like a Blob store – the key returned from a store operation is simply a hash of the stored value. Built on top of this is a mutable tag store that is a genuine key/value store.

[the tag store…] is a key/value store, where keys are names created by users and values are keys from the block store, and can be seen as a set of named pointers to keys in the block store. This store is expected to be local to each replica and very small. The tag store is central for higher-level algorithms such as synchronisation and garbage collection.

In this paper we are treated to an in-depth look at how merge is supported for two particular data structures – using the 2nd conflict-resolution strategy detailed above – a queue and a rope. (Details of other aspects of Irmin are deferred to other works).

Mergeable Queue

I had to read over this section several times to follow the discussion, so I hope that I’ve been able to faithfully capture key aspects of the design here. First remember that everything is ultimately stored in Irmin’s append-only store. A single queue is then internally implemented as two linked lists. One list (the push list) is the list that items are added to when a new value is pushed onto the queue. The second list (the pop list) is used when elements are to be popped from the list. If the pop list is empty at the time a pop request comes in then a normalization process moves all entries from the push list to the pop list for efficient subsequent popping.

I can’t quite reconcile the data structures that we’re shown with the prose, but key to the merge operation is that the index structure maintains push and pop elements. These are not the heads of the push and pop lists, but instead counts of the number of push and pop operations that have happened on the queue. The merge algorithm implies a FIFO queue, though I don’t think this is explicitly stated anywhere.

Given this, take an old queue that has seen, for example, 6 pushes and 2 pops (so has current length 4), and two diverged replicas q1 and q2. The respective push and pop counts for q1 and q2 stand at (8,2) and (7,4). So we know that q1 has seen 2 pushes and no pops compared to the old version, and q2 has seen 1 push and 2 pops. Proceed as follows:

  • Pop two elements from q1, to represent the pops that happened in q2 but are not reflected in q1 yet, yielding q1′
  • Remove from q2 elements that are now in both q1 and q2 as a result of inheritance from old. The old queue had length 4, and only 2 additional items have been popped in the q2 branch, so we need to pop 2 more elements from q2 to remove these possible duplicates, yielding q2′
  • The new merged queue is obtained by concatenating q1′ and ‘q2’.

Mergeable Rope

A rope is a data structure that is used for efficiently storing and manipulating a very long string. A rope is a binary tree having leaf nodes that contain a short string. Each node has an index equal to the sum of the length of each string in its left subtree. Thus a node with two children divides the whole string into two parts: the left subtree stores the first part of the string and the right subtree stores the second part. The binary tree is crossed from the root to leaf each time the string has to be accessed. It can be done in log(n) time if the tree is balanced.

(The MirageOS Git-like fs is based on the rope structure).

Ropes beat Strings when it comes to concatenate, insert, and delete operations (O(log n) vs O(n)), trading off for slightly more expensive set/get and split operations (O(log n) vs O(1)). A Rope does not have to hold string data, they can also be used for any other type of data that supports these operations (and merge).

Here’s a quick sketch of how merge works: first determine the smallest subtree where merges occured. Now recursively compare sub-branches. If the sub-branch is the same in both the of the diverged replicas, then use it. If one replica’s branch has diverged from the common ancestor, but the other replica’s one has not, keep the one with the changes in it. If there are changes on both sides, flatten the sub-trees into a user-provided container with a data type specific merge operation defined on it.

In this way, many changes can be automatically merged without needing to invoke the user-supplied function. Rebalancing the tree makes merge more expensive so it is done as infrequently as possible, and with as small a scope as possible each time.