Derflow: Distributed Deterministic Dataflow programming for Erlang

Derflow: Distributed Deterministic Dataflow programming for Erlang – Bravo et al. 2014

Today’s choice is part of the work of the SyncFree European research project on large-scale computation without synchronisation.

Non-determinism makes it very difficult to reason about distributed applications. So Bravo et al. figured life might be easier if we could just make them deterministic instead. How do you do that? Just go with Derflow…

Concurrent programs in non-deterministic languages are notoriously hard to prove correct and have led to well-known disasters…. we believe that limiting the ability to write non-deterministic code provides a reasonable alternative to exhaustively checking our applications for correctness.

Derflow is based on the idea of a distributed single-assignment data store, and is built on top of riak core.

Given the same input values, a program written in deterministic dataflow style will always return the same output values, or never return. These input values can be data streams as well, which is a natural generalization of functional programming to the concurrent setting. Our proposed solution provides a distributed deterministic dataflow solution which operates transparently over distributed Erlang, providing the ability to have highly-available, fault-tolerant, deterministic computations.

The basic Derflow model is easy to understand. Think of a single assignment store where the keys are variable identifiers, and the values are either as-yet-unassigned, a reference to another variable, or an Erlang term. Once you’ve assigned a value to a variable, you can never change it.

There’s also a whole bunch of hidden metadata associated with each variable that the system keeps track of:

  • what variables are bound to this one
  • what processes are waiting for the variable to be bound to a value
  • if a variable is part of a stream, what the ‘next’ value in the stream is (see below)
  • any processes waiting on a reader before setting the value (to support lazy production of values)
  • any monitors tracking the availability/reachability of the variable (to handle partitions etc.)

The basic programming model consists of declare() (creates a new variable and returns its id), bind(x,v) (bind a variable x to value v) , and read(x). A process reading an unbound variable will block until a value is bound. Derflow also makes use of Erlang’s spawn to support concurrency.

Streaming data is supported by treating the stream as a list of dataflow variables:

Streams are a useful technique which allow threads, or processes, to communicate and synchronize in concurrent programming. A stream is represented here as a list of dataflow variables, with an unbound dataflow variable as the final element of the list.

The operation produce extends the tail of the stream by binding a new value to the last element, and creating a new unbound last variable (‘next’) which it returns. consume reads an element from the stream and also returns the id of the next variable in the stream.

Failures need to be carefully handled if they are not to introduce non-determinism.

determinism and dataflow variables provide a very useful property for failure handling: redundant computation will not affect the correctness of a deterministic dataflow program. We propose a failure handling model where failed processes or temporarily unreachable processes, can be restarted while still providing the guarantees of the deterministic programming model.

If a computing process fails, it is just re-executed. In the Derflow model, duplicate processing cannot alter the outcome. (Though of course this relies on programs being side-effect free too). If a process is blocked because a variable it is waiting to read never becomes available, then restarts have to cascade more deeply:

The re-execution of blocked process will result in the process immediately blocking again. Therefore we must provide a way to identify dependencies between processes and dataflow variables in order to provide a deterministic restart strategy which guarantees progress. A common strategy to ensure progress in this situation is to restart the process that declared the failed dataflow variable. In addition, all the processes depending on the restarted process should also be restarted.

The implementation of Derflow builds on riak core. mnesia was considered but rejected, in part because of its behaviour under network partitions:

Problems arise in the presence of network partitions where the mnesia nodes on either side of the network partition are able to make progress independently. Currently, no mechanisms exist for reconciling the changes made to the database when nodes reconnect, nor reasoning about concurrent or causally influenced operations.

Several examples of Derflow programs are given, which show that the outcome of the program is independent of the degree of concurrency used in computing it.

In Derflow, any function that uses dataflow variables can be run in a different process while keeping the final result same. Thus, programmers can transparently add concurrency to their programs (either parallelism or distribution) in a secure way without thinking about data races and possible bugs.

However, this concurrency is not managed automatically via the runtime – programmers must explicitly specify it (contrast this to e.g. an execution planner for datalog).

The semantics of Derflow programs are very clearly explained in the paper, and the deterministic model of concurrent execution certainly looks interesting. Yet I found it hard to assess just from this one paper whether Derflow would be an interesting way of writing real systems, or whether the restrictions would feel limiting. Perhaps the answers to this question are found in the literature on Kahn Process Networks on which Derflow is based.

Deterministic dataflow was first proposed by Gilles Kahn in 1974, in a programming model that is now known as Kahn networks. In 1977, a lazy version of this same model was proposed by Kahn and David MacQueen. However, up until recently this model has never become part of mainstream concurrent programming. This may be due to either the model’s inability to express non-determinism or the simultaneous invention of two other models for handling concurrent programming: the actor model (message passing) and monitors (shared state).

(Note that Derflow also introduces a mechanism to support explicit introduction of small amounts of non-determinism where you really need them).