Edelweiss: Automatic storage reclamation for distributed programming

Edelweiss: Automatic storage reclamation for distributed programming – Conway et al. 2014

This is the final selection from Peter Alvaro is his desert island paper week, and what a great paper to finish on!

Please don’t let the title of this paper put you off! To be honest, the title didn’t really grab my attention at first – but the contents of the paper, well that’s a whole different story. To me, Edelweiss is a very exciting development, and a signpost towards a better way of building distributed systems.

Earlier in my career, I had the good fortune to be embedded inside an outstanding technical team including two engineers I’ve subsequently worked with over an extended period: Glyn Normington and Steve Powell. The project we were working on involved a major enhancement to a TP monitor (good old-fashioned transaction processing). Glyn and Steve are both very experienced in the pragmatic application of formal methods, and on this project something wonderful (to my eyes) happened: for a key part of the system a simple formal specification was produced (using CSP if I recall correctly) and shown to exhibit the desired behaviour. This simple model was then progressively refined to a more space & time efficient implementation model, showing equivalence at each step and hence the continued correctness of the more complex models.

This is what Edelweiss does – but for code. You write simple code to express what the system should do, and Edelweiss introduces a series of refinements that preserve the user-visible behaviour and improve efficiency (in this case, the efficiency of Event Log Exchange (ELE) based systems). Fans of CQRS and Event Sourcing take note!

Both practitioners and academics have proposed writing systems by using immutable state whenever possible. Rather than directly modifying shared state, processes instead accumulate and exchange immutable logs of messages or events, a model we call Event Log Exchange (ELE). Previously learned information is never replaced or deleted, but can simply be masked by recording new facts that indicate that the previous information is no longer useful.

ELE-based designs have a number of advantages, which space prevents me from exploring here, but also bring with them the problem of ever-growing event logs.

To avoid this, ELE designs typically include a background “garbage collection” or “checkpointing” protocol that reclaims in- formation that is no longer useful. This pattern of logging and background reclamation is widespread in the distributed storage and data management literature, having been applied to many core techniques including reliable broadcast and update propagation, group communication, key-value storage, distributed file systems, causal consistency, quorum consensus, transaction management, and multi-version concurrency control.

These schemes are typically built from scratch by hand and are subtle and hard to get right. Edelweiss is a sublanguage of Bloom that omits primitives for mutating and deleting data (since ELE applications don’t do that).

Edelweiss programs describe how local knowledge contributes to the distributed computation. The system computes the complementary garbage collection logic: that is, it automatically and safely discards data that will never be useful in the future.

As a side-benefit of reading the paper, you will get a wonderful and concise introduction to Bloom itself, which is necessary to follow the examples. Starting from a baseline of unreliable message delivery, we are then treated to a succession of impressively short programs for reliable unicast, reliable broadcast, a key-value store, a causally-consistent key-value store, and atomic registers.

The largest of these is the distributed key-value store with causal-consistency which requires 19 lines of Edelweiss rules. (And the resulting efficient implementation generated by Edelweiss is equivalent to a 62 line vanilla Bloom program). Compare both to a recent C++ implementation that required about 13Kloc!

The garbage collection schemes generated by Edelweiss are often similar to hand-written schemes proposed in the literature for each design. Moreover, removing the need for hand-crafted garbage collection schemes simplifies program design — the resulting programs are more declarative, and the programmer can focus on solving their domain problem rather than worrying about storage.

I have so many highlights in my copy of the paper that it would be impossible to share them all with you here without almost duplicating the original. So instead let me give you a flavour of the Edelweiss / Bloom language and describe the refinement rules Edelweiss uses. If you like what you see, I strongly encourage you to go on and read the full paper – you’re missing out if you don’t!

Bloom and Edelweiss

A Bloom cluster is a set of nodes. Each node has collections and rules. Collections have a schema and comprise of a set of facts (tuples) like a relation in Datalog. There are several collection types in Bloom – table and channel being the two you’ll see most in this paper. Table is a persistent collection, and channel collections allow async communication: a fact inserted into a channel is delivered to a remote node at a non-deterministic time in the future.

Nodes run an event loop: in each timestep inbound messages are received and added to local collections. Then the rules are evaluated over those collections to produce new values. If these denote outbound messages (facts added to channels) then the messages are sent. Then the node returns to listening for inbound messages.

Let’s look at the simple reliable unicast program which builds on top of the underlying unacknowledged asynchronous communication.

    01  class Unicast
    02      include Bud
    03
    04      state do
    05          channel :chn, [:id] => [@addr, :val]
    06          table :sbuf, [:id] => [:addr, :val]
    07          table :rbuf, sbuf.schema
    08      end
    09
    10      bloom do
    11          chn <~ sbuf
    12          rbuf <= chn
    13      end
    14  end

The state block (lines 04-08) defines the collections, and the bloom block (lines 10-13) defines the rules.

The chn collection (line 05) stores (id, addr, val) facts (tuples). id is the key, and the @ symbol in front of addr indicates that this column specifies the location to which channel messages will be sent.

The sbuf collection (line 06) stores (id, addr, val) facts and acts as the send buffer of information we have transmitted. The rbuf collection (line 07) is the receive buffer for messages we have received.

In each timestamp we match facts in input collections on the rhs of rules, and add new facts to the collections named in the lhs of rules.

At line 11, the <~ operator signifies that the rhs will be merged into the lhs collection at some non-deterministic future time. So this rule copies facts from the send buffer into the channel for asynchronous delivery.

In line 12, any incoming messages (facts) received in the channel collection are added to the receive buffer.

This is the complete program, and is enough to give you a feel for the declarative programming style of Bloom/Edelweiss.

An Edelweiss open source implementation is available.

Refinement 1: Avoiding Redundant Messages

The beauty of this Edelweiss program is how clear and simple it is to reason about the behaviour. The externally visible behaviour is exactly what we want, but the internal efficiency leaves something to be desired. For example, do we really need to grow our send buffer indefinitely, and insert every fact in it into the communication channel on every timestep?

Edelweiss's ARM (Avoidance of Redundant Messages) refinement rule can detect this inefficiency and automatically rewrite the program to avoid duplicate transmissions. Edelweiss cannot change the program semantics (reliable) delivery, so figures out that it needs to introduce an ack mechanism. It can infer that acks only need to contain message ids (the key of chn).

The result of this Edelweiss refinement is as if the programmer had added the following two collections:

    table :chn_approx, [:id]
    channel :chn_ack, [:@sender, :id]

and updated the bloom rules as follows:

    bloom do
        chn  :id)
        rbuf <= chn
        chn_ack <~ chn {|c| [c.source_addr, c.id]}
        chn_approx <= chn_ack.payloads
    end

which is already noticeably more complex than the initial program (though still pretty impressive compared to non-Bloom language implementations!).

Refinement 2: Positive Difference Reclamation (DR+)

The ARM refinement avoided duplicate transmission, but the send buffer still grows indefinitely – we’d like to be able to reclaim acknowledged messages.

Edelweiss’s positive difference reclamation (DR+) refinement is able to infer that sbuf is only referenced by a single rule (the ‘notin’ operator statement in the first line of the rewritten bloom block after applying ARM). sbuf is on the ‘positive’ side of this operator (adding new facts to chn_approx cannot cause things previously sent to need to be resent). The DR+ refinement therefore adds the following rule to the program:

sbuf <- (sbuf * chn_approx).lefts(:id => :id)

(Remember that the program text for the user is still the simple listing we first saw).

Refinement 3: Range Compression

We can also reduce the storage used by the chn_approx collection. We don’t actually need to keep every fact (id) in the collection, we just need to know which ids it is safe to remove from the send buffer. If ids are assigned from a totally ordered sequence, and we expect all messages to be eventually delivered then we would end up with a set of ids ranging from k (initial value) to n (last sent value). With unordered async delivery there may be gaps in this range at any point in time, so we represent it with a range tree by storing sets of pairs (k0,k1), … , (km, kn) where each pair represents a gap-free sequence.

Edelweiss automatically applies range compression to outbound channel messages ‘when profitable.’ The programmer can also explicitly request it for a collection by using the new ‘range’ collection type.

Refinement 4: Punctuations

In the paper you’ll see an extension of the reliable unicast program to reliable broadcast. When ARM and range compression are applied in this new context something very interesting happens:

Recall that for reliable unicast, chn_approx will eventually be range-compressed to a single value, effectively yielding a logical clock. With broadcast, chn_approx will contain a single “clock” value for each node in the broadcast group and hence behaves similarly to a vector clock. That is, the combination of ARM and range compression essentially “discovers” the relationship between event histories and logical clocks! Edelweiss allows programmers to simply manipulate sets of immutable events; it then automatically produces the corresponding “clock” management code.

With broadcast, it is harder to reclaim storage from chn_approx because we need an ack from every recipient before it is safe to do so. Edelweiss handles this by adding the concept of punctuations. A punctuation is a guarantee that no more tuples matching a predicate will appear in a collection. Edelweiss introduces a new sealed collection type which indicates a collection whose contents are fixed after system initialization. This enables Edelweiss to automatically emit a punctuation for the collection. When a collection is not sealed, using epochs can enable the same optimization.

Refinement 5: Negative Difference Reclamation (DR-)

Negative difference reclamation is applied to an expression of the form X.notin(Y, :A => :B) where X and Y are persistent and A is a key of X. It reclaims matching tuples from X and Y at the expense of keeping a range compressed set of X-keys. See the key-value store in the paper for full details.

Edelweiss in practice

See the full paper for a great set of Edelweiss example programs, culminating in a program for atomic read/write registers with snapshot isolation.

While traditional designs utilize mutable storage, we show how a mutable register interface can be implemented via an Edelweiss program that accumulates an immutable event log. We then extend the program to support atomic writes to multiple registers and multi-register reads that reflect a consistent snapshot. For all of these programs, Edelweiss enables automatic and safe garbage collection, generating space-efficient implementations that are semantically equivalent to the original programs.

Future directions include generalizing Edelweiss to join semilattices to capture other kinds of growth beyond just accumulation of facts in a collection – for example, monotonically increasing integers, or boolean values that move from false to true.