Skip to content

Highly Available Transactions: Virtues and Limitations

November 7, 2014

Highly Available Transactions: Virtues and Limitations – Bailis et. al 2014.

Since yesterday we looked at the Boom Hierarchy, it seemed fitting today to take a selection from the BOOM project (no relation). Thus earning me the Basil Brush award 😉

What a great paper this is, I have so many highlights and annotations on my copy that it will be very hard to do it justice in a short summary. I highly encourage you to download and read the full thing (link above as always).

…despite its narrow scope, the CAP Theorem is often misconstrued as a broad result regarding the ability to provide ACID database properties with high availability; this misunderstanding has led to substantial confusion regarding replica consistency, transactional isolation, and high availability.

In a nutshell, what Bailis et. al. show is that you don’t need to radically alter your application design in order to get high availability. In fact, if you’re using for example Oracle 11g or Postgres with the default isolation level (read commited), you can continue to enjoy the exact same isolation level with a highly available system. It’s the implementation choices of previous generation systems that constrain them.

While weak isolation levels do not provide serializability for general-purpose transactions, they are apparently strong enough to deliver acceptable behaviour to many application programmers and are substantially stronger than the semantics provided by current highly available systems.

The definition of high availability used in the paper is: “guarantees a response from each non-failing server in the presence of arbitrary network partitions between them”. This is important to understand – if you can reach any server, you should be able to get an answer.

…although many implementations of highly available transaction (HAT) semantics are not highly available, this is an artifact of the implementations rather than an inherent property of the semantics.

HAT implementations can provide the same guarantees as you’re used to with a traditional RDBMS on its default setting, but with scale-out support and 1-3 orders of magnitude lower latency on current infrastrucure!

A further refinement to the basic HA definition is the notion of ‘sticky availability’ (or affinity, as we used to call it in in CICS and other systems of that generation). Stickiness means that a client goes back to the same server so that it can always read its writes. (One way of achieving this is to act as a ‘server’ itself via a client-side cache). Note a subtlety here introduced by partitioning:

in a partially-replicated system, where servers are replicas for subsets of the set of data items, a client must maintain stickiness with a single logical copy of the database, which may consist of multiple physical servers.

After a very readable discussion of isolation models, we end up with the following summary:

  • Achievable with high availability:
    read uncommitted; read committed; monotonic atomic view; item cut isolation; predicate cut isolation; writes-follow-reads; monotonic reads; monotonic writes
  • Achievable with sticky availability: read-your-writes, causal consistency, pipeline random access memory
  • Only achievable in unavailable models:
    cursor stability; snapshot isolation; repeatable read; one-copy serializability; recency guarantees

the main cost of high availability and low latency comes in the inability to prevent Lost Update, WriteSkew, and provide recency bounds.

Bailis et al. then go on to implement a proof-of-concept HAT system with monotonic atomic view (MAV) isolation.

HAT systems can provide useful semantics without substantial performance penalties. In particular, our MAV algorithm can achieve throughput competitive with eventual consistency at the expense of increased disk and network utilization. Perhaps more importantly, all HAT algorithms circumvent high WAN latencies inevitable with non-HAT implemenations.

This is a very important result. As always though, it’s good to understand the fineprint. In particular, ACID has a ‘D’ (for durable) in it. And there’s an important short paragraph on durability in the middle of the paper:

A client requiring that its transactions’ effects survive F server faults requires that the client be able to contact at least F+1 non-failing replicas before committing. This affects availability and, according to the Gilbert and Lynch definition we have adopted, F > 1 fault tolerance is not achieveable with high availability.

In other words, in the N,R,W model (N nodes, read from R, write to W), this paper is concerned with systems where R=W=1. This would rule out for example systems that don’t (synchronously) keep a durable record of a transaction if you want to avoid data loss even on the loss of just a single node. So you might want a system where W>1 (for example), which could still give very good availability, but not ‘HA’ as defined in the paper. It’s trade-offs all the way down…

In the conclusions, Bailis et al. touch on a topic that is dear to my heart: how do we program such systems? (as opposed to how do we build them).

… we believe there is considerable work to be done to improve the programmability of highly available systems. Isolation and data consistency are means by which application-level consistency is achieved but are typically not end goals for end-user applications. Our results hint that a mix of HAT and non-HAT semantics (the latter used sparingly) is required for practical applications, but the decision to employ each and the system architectures for a hybrid approach remain open problems.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: