Blazes: Coordination analysis for distributed programs – Alvaro et al. 2014
For many practitioners distributed consistency is the most critical issue for system performance and manageability at scale.
In Blazes, Alvaro et al. take a fresh look at ‘an urgent issue for distributed systems developers,’ namely the correctness and efficiency of distributed consistency mechanisms for fault-tolerant services. We need enough coordination to avoid anomalies, but ideally no more than that since otherwise the coordination overhead will reduce performance. So how do we know how much is enough, and how can we avoid over-coordinating?
Blazes analyzes a dataflow-based model of your distributed systems (based on components and their interactions). If the program is written in Bloom, the model can be inferred from the program itself. For Java, a small set of annotations are used to specify model characteristics. From the model, Blazes figures out where coordination will be required, and generates the code necessary to perform that coordination (though details of the generated code are not provided in this paper).
It can be inferred that a ‘Blazes engine’ takes as input a model description, and produces as output a coordination specification which can in turn be fed into a code generator. This sounds like a good addition to the distributed systems engineer’s toolkit, and a nice complement to tools such as TLA+ (as used by Amazon) , regardless of the particular embodiment of source model format or code generation. Indeed, the analysis results will surely be of interest even without the code generation.
The key intuition exploited by BLAZES is that even when components are order-sensitive, it is often possible to avoid the cost of global ordering without sacrificing consistency. In many cases, BLAZES can ensure consistent outcomes via a more efficient and manageable protocol of asynchronous point-to-point communication between producers and consumers — called sealing — that indicates when partitions of a stream have stopped changing.
A component is a possibly stateful unit of computation, processing streams of inputs and outputs. Streams are unbounded, unordered collections of messages. Components and streams are logical constructs and may be physically embodied in multiple instances. An important additional part of the model is the ability to embed punctuations in a stream (supporting windows).
A punctuation guarantees that the producer will generate no more messages within a particular logical partition of the stream.
If you want to be able to tolerate the loss of a given instance of state, you must either be able to recreate that state from something else you still have (a replay strategy), or you must have another copy of it somewhere else (replication). Fault tolerance requires a distributed system support at least one of the replay and replication strategies. Non-deterministic messaging interacts with this in subtle ways to introduce various kinds of system anomalies in the absence of sufficient coordination.
- Run anomalies are where a component produces different outputs in different runs over the same inputs
- Instance anomalies occur when replicated component instances produce different outputs in the same execution, over the same inputs.
- Divergent anomalies are those where the state of multiple replicas becomes permanently inconsistent.
If a component uses monotonic logic (as proposed in the CALM theorem), then it will produce deterministic results despite non-determinism in input orders. That sounds like the kind of thing it would be useful to know when figuring out how much coordination is required.
We call a dataflow component confluent if it produces the same set of outputs for all orderings of its inputs. Confluent components never exhibit any of the three dataflow anomalies listed above. Confluence is a property of the behavior of components—monotonicity (a property of program logic) is a sufficient condition for confluence.
For non-confluent components, we can use either sequencing or sealing (disallowing components from producing outputs until all of their inputs have arrived, useful in conjunction with punctuated streams).
In a Blazes model, components are annotated as either Confluent (C) , or Order-Sensitive (O), and as either Read-only (R, stateless), or a write path (W, stateful).
The CR annotation indicates that a path through a component is confluent and stateless; that is, it produces deterministic output regardless of its input order, and its inputs do not modify the component’s state. CW denotes a path that is confluent and stateful. The annotations ORgate and OWgate denote non-confluent paths that are stateless or stateful, respectively.
The optional gate subscript identifes the attributes that serve as partition keys to identify the partitions over which the component operates. CRDTs for example, would be modelled as CW.
Streams may be annotated with a Seal(key) annotation which indicates that the stream is punctuated on the subset key of the stream attributes. The Rep annotation indicates that a stream is replicated.
Blazes uses component and stream annotations to determine if a given dataflow is guaranteed to produce deterministic outcomes; if it cannot make this guarantee, it augments the program with coordination code.
When Blazes finds potential for anomalies, it selects the minimal coordination strategy needed to avoid them.
Blazes will automatically repair dataflows that are not confluent or convergent by constraining how messages are delivered to certain components. When possible, BLAZES will recognize the compatibility between sealed streams and component semantics, synthesizing a seal-based strategy that avoids global coordination. Otherwise, it will enforce a total order on message delivery to those components.
By eliminating ‘over-cautious’ coordination, Blazes improves program performance (throughput):
The overhead of conservatively deploying a transactional topology is considerable. The uncoordinated dataflow has a peak throughput roughly 1.8 times that of its coordinated counterpart in a 5-node deployment. As we scale up the cluster to 20 nodes, the difference in throughput grows to 3×.
With Blazes taking care of coordination, Alvaro et al. argue that placement becomes one of the biggest remaining issues a distributed systems engineer needs to focus on.
Rules of thumb regarding data placement strategies typically involve predicting patterns of access that exhibit spatial and temporal locality; data items that are accessed together should be near one another, and data items accessed frequently should be cached.
Finally, for a given model, Blazes can determine the coordination required. But how do we know if we have a good model in the first place? Might another formulation be better?
Some design patterns emerge from our discussion. The first is that, when possible, replication should be placed upstream of confluent components. Since they are tolerant of all input orders, inexpensive replication strategies (like gossip) are sufficient to ensure confluent outputs. Similarly, caches should be placed downstream of confluent components. Since such components never retract outputs, simple, append-only caching logic may be used. More challenging and compelling is the possibility of capturing these design principles into a compiler and automatically rewriting dataflows.