Out of the Fire Swamp* – Part I, ‘The Data Crisis’

(*) with apologies to Moseley, Marks, and Westley.

Something a little different to the regular paper reviews for the next three days. Inspired by yesterday’s ‘Consistency without Borders,’ and somewhat dismayed by what we learned in ‘Feral Concurrency Control’, I’m going to attempt to pull together a bigger picture, to the extent that I can see it, that emerges out of many of the papers we’ve been studying. There is more to say here than I can shoehorn into the review of any one single paper!

The Data Crisis

We have a “data crisis”. Most of the applications that we write that work with persistent data are broken, and the situation seems to be getting worse. How did we get into this state and what can we do about it?

Let’s consider some of the evidence.

Even applications that work with a traditional RDBMS are prone to integrity violations

  • The gold standard of serializable ACID transactions in a trusted RDBMS is either unavailable or not used in practice when it is. (See table 2 in Highly Available Transactions: Virtues and Limitations for a good summary of default and maximum isolation levels in a variety of common databases). Oracle’s strongest isolation level is something called Snapshot Isolation for example, which still permits anomalies. The default isolation level for databases is very rarely serializable. It should go without saying that none of the weaker-than-serializable isolation models guarantee serializability, but the performance benefits are considered to outweigh the potential costs of anomalies.
  • A relational database that either doesn’t offer serializability, or is configured to use a lesser isolation level, will be vulnerable to anomalies – and we can even build models to predict how many. It was a great piece of marketing to call them ‘anomalies.’ When you say ‘will be vulnerable to violation of application integrity constraints’ it doesn’t sound nearly so cute and harmless.
  • As we saw in Feral Concurrency Control, programmers don’t even take advantage of the guarantees that relational databases do give. Feral application level concurrency control mechanisms through validation and association annotations and scant use of transactions is becoming increasingly common. Fundamental integrity constraints such as uniqueness are broken under such a model (even if you did have a database offering serializability on the backend).
  • I’ve yet to meet a programmer that performed a careful analysis of all the anomalies that could occur as a result of the selected isolation level and chosen concurrency control mechanisms, and validated those anomalies against the business transactions and integrity constraints to ensure that either they were acceptable or a recovery strategy was in place if they were to occur. This is very difficult to do in general, so the more common approach is just to pretend that anomalies don’t exist.

It gets far worse when we venture into the land of eventual consistency

This of course is all just the relational database world, our ‘safe place’ where we like to think everything is consistent (it isn’t). Then along came the cloud and the CAP theorem. We need availability, low-latency, partition tolerance, and scalability. Because we couldn’t avoid partitions we rushed headlong into a world of eventual consistency.

Your database is fried from all the guarantees that lied. I know you tried, but do you have any ethics or pride? / You think you built a database but all you made was a dark place where writes never use disk space.
Kelly Sommers, tweet storm excerpts, April 2015.

Two constraints we need to learn to live with

Can’t we have safety and performance? (and availability, and partition-tolerance?). What’s the way out of this mess? How do I build a modern application on a stack I can put into production without exposing myself to data loss and corruption?

There are two very interesting and relevant results from some of the papers that we’ve looked at in The ‘Morning Paper’.

  1. We’re always going to have anomalies. In any always available convergent system we can’t get any stronger consistency that causal+ consistency. (See Consistency, Availability, and Convergence).
  2. We can’t solve the data crisis solely by building a better datastore – we’re going to need some cooperation at a higher level. (Consistency Without Borders, Feral Concurrency Control).

Let’s unpack those statements a little bit:

Anomalies are Inevitable

We didn’t get better at dealing with failure by pretending it doesn’t happen. Quite the opposite in fact – we realised that the more components in a system, the more likely failure is, to the point where we need to explicitly design for it in the understanding that failure will happen. And to force our hands we introduced Chaos Monkeys and a Simian Army.

We’re not going to get better at dealing with application integrity violations by pretending they don’t happen. Quite the opposite in fact – the more concurrent operations we have (and the weaker the isolation level), the more likely application integrity violations are, to the point were we need to explicitly design for them in the understanding that application integrity violations will happen. And to force our hands, perhaps the Simian Army needs a new monkey – Consistency Monkey! Liveness failures are easy to spot (a process dies and becomes unresponsive, bad things happen). Safety failures lurk in the shadows, we need to work to make them equally visible.

What does it mean to design for application integrity violations? First off, let’s assume that we’ve done everything practical to avoid as many of them as possible, and secondly let’s assume that we know which anomalies that still leaves us vulnerable to (I’ll come back to these topics later). Pat Helland then gives us a model for thinking about this situation based on Memories, Guesses, and Apologies. You do the best you can (guesses) given your knowledge of the current state of the world (memories), and if it turns out you got something wrong, you apologize.

Note that you can’t apologize if you don’t know you’ve done something wrong! So we need a process that can detect integrity violations when they occur, and trigger the apology mechanism. If we’re smart about it we can automate the detection of violations and invocation of a user-defined apology callback – but what it means to apologise will always be specific to the application and needs to be coded by the developer.

Instead of trading off performance and data corruption, we need to be trading off performance and the rate of apologies we need to issue.

We can’t just build a better datastore

Reasoning about consistency at the I/O level as sequences of reads and writes is too low level and too far divorced from the application programmer’s concerns. Application programmers find the mapping too obscure to be of practical use in figuring out, for example, what anomalies might occur and what they might mean in application terms. And as we’ve seen, so long as we stick with that model, programmers will most likely to continue to eschew what the database provides and implement feral concurrency controls. At the same time, loss of semantic information about what the application really is trying to achieve leads to inefficiencies due to over-conservative coordination. Building a better datastore without also building a better way for applications to interact with the datastore isn’t going to solve the data crisis.

It all starts with the application

If you reflect on some of the papers we’ve been looking at, an interesting pattern starts to emerge that is very application-centric:

  • We get maximum performance out of distributed systems by minimizing the amount of coordination required. And the way to do that is through an application-level understanding of the application’s invariants. This is the lesson of I-Confluence.
  • We can’t get stronger consistency than causal+ consistency. Out of the box this can give us the ALP of ALPS (availability, low-latency, partition-tolerance) but it struggles with Scalability under the default potential causality model. To get scalability back, we need to move to an explicit causality model. In other words, we need the application to explicitly tell us the causal dependencies of a write.
  • To compensate for anomalies that do occur, we need an apology mechanism. Apologies can only be defined at the application level.
  • Programmers have voted with their feet and told us that they want to control application integrity in tandem with the domain model.
  • Reducing the need for coordination through the adoption of distributed data types (e.g. CRDTs and Cloud Types) requires the cooperation of the application programmer.

In Part II we’ll explore further what it could mean to address the data crisis from the application down.

13 thoughts on “Out of the Fire Swamp* – Part I, ‘The Data Crisis’

    1. Not to make them do the consistency, etc. that a data store does (should do?), any more than we make an app developer write the software that keeps their processes running (we rely on systemd or supervisord or whatever for that). Rather, it is that they define the conditions of success and failure and the repair (“apology”), and then we should have systems that manage that for us.

      1. Actually, I was referring to making programmers “define the conditions of success.”

        This solves a lot of problems, but is it practical?

        For example, I ask my graduate students each semester which software testing tools they use at school or at work. 99 out of 100 don’t use any testing tools (except for maybe JUnit, which automates the easy part.)

        We could instead ask programs to define correctness assertions. But programmer’s have demonstrated that they don’t want to identify and verify assertions (using whatever means) any more than they want to use testing tools.

        I wonder: if programmers don’t want to make the effort to learn about the available levels of consistency, or even read the manual to determine the default (automated!) consistency level, why would they make the (large) effort to determine the necessary invariants? Professors are able and willing to do it, but is everyone else?

      2. I think that finding a good model that makes this easy and natural remains an important challenge here. But there is some hope – the validation and association annotations and DSLs that are highly popular are forms of expressing application invariants – they’re just not always very well founded at the moment…

      3. I would counter that “the best programmers always do testing *first*”, but the truth is that even the best sometimes do not, and we have to think about the large masses, not the top 5-10% (if that).

        The big challenge of testing is that it is “extra” work. It isn’t really, when it saves you oodles of time down the road, but it distracts from what they really want to do: program (i.e. build stuff).

        But defining an application domain model is part and parcel of the building part: think Java annotations or Rails models. You already do it. If we can leverage that, perhaps extend/complete it without adding to the burden (make it in the critical path it already is in without extending the length of that path), and then provide tools that leverage that, it could be usable (even by the masses).

  1. Adrian, this was an excellent piece. All of the database consistency models have involved not just complex dealings with the data store (as you outlined), but a “trust me” structure. I think this is part of why devs have gone elsewhere. They don’t want the headache, nor do they fully trust it.

    But if you actually had a model at the application that describes what consistency is, you could begin to rely on the data store more with less complexity, because you can validate its state programmatically and automatically.

    If I can do a “write this transaction”, and know that some other component is checking “is everything in a consistent state as defined by the architect/developer,” then I feel far more comfortable with it. I would still need to design and build “apology” mechanisms, of course.

    I think this is why “build for failure” is so much easier on the infrastructure level. The definition of failure there is easy: process (or host) failed. And recovery is easy: start it again. Whether it is a process managed by systemd or fleet, or a container managed by kubernetes or rancher, the simplicity of definition of “good state” and therefore the ease of “apologizing” makes it, well, easy.

    Can we define (relatively easily) both a “valid state” check and a “good apology” model that developers/architects can take advantage of?

Leave a Reply to rc Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.