Building on Quicksand – Helland & Campbell 2009
Last week we looked at Consistency analysis in Bloom, and Coordination Avoidance in Database Systems. A common theme in both of these is that some collaboration between the application (understanding of application level semantics) and datastore is key to unlocking the next level of performance. We can see the same direction with CRDTs, which move the focus to a higher level of abstraction than simply reading and writing state. In ‘Building on Quicksand’ (which predates all of the above) Helland & Campbell discuss the inevitability of this direction when pursuing the ALPS properties.
(Log shipping) is well known to most readers. A classic database system has a process that reads the log and ships it to a backup data-center. The normal implementation of this mechanism commits transactions at the primary system (acknowledging the user’s commit request) and asynchronously ships the log. The backup database replays the log, constantly playing catch-up.
We’ve not strayed very far into exotic NoSQL models or anything here, but introducing that little bit of asynchrony has important implications…
Log shipping is asynchronous to the response to the client. This inherently opens up a window in which the work is acknowledged to the client but it has not yet been shipped to the backup. A failure of the primary during this window will lock the work inside the primary for an unknown period of time. The backup will move ahead without knowledge of the locked up work. In most deployments of log-shipping, this is not considered in the application design. It is assumed that this window is rare and that it is unnecessary to plan for it. Bad stuff just happens if you get unlucky.
When we change from sychronous checkpointing to asynchronous in order to save latency, we lose the notion of an authoritative truth. In such a situation, business rules become probabilistic, and you will need a mechanism to sort out mistakes once they are discovered.
In many applications, it is possible to express the business rules of the application and allow different operations to be reordered in their execution. When the operations occur on separate systems and wind their way to each other, it is essential to preserve these business rules. Example business rules may be: “Don’t overbook the airplane by more than 15%” or “Don’t overdraw the checking account.”
Commutative operations can be allowed to execute independently so long as they don’t violate the rules of the system. This works when rising up to the application operations level, if our abstraction operates solely in reads and writes it doesn’t:
The layering of an arbitrary application atop a storage subsystem inhibits reordering. Only when commutative operations are used can we achieve the desired loose coupling. Application operations can be commutative. WRITE is not commutative. Storage (i.e. READ and WRITE) is an annoying abstraction…
At the business level we can make informed decisions about the degree of consistency that is required.
Note that that it is possible to have multiple business rules with different guarantees. Some operations can choose classic consistency over availability (i.e. they will slow down, eat the latency, and make darn sure before promising). Other operations can be more cavalier. Some examples: Locally clear a check if the face value is less than $10,000. If it exceeds $10,000, double check with all the replicas to make sure it clears; Schedule the shipment of a “Harry Potter” book based on a local opinion of the inventory. In contrast, the one and only one Gutenberg bible requires strict coordination! The major point is that availability (and its cousins offline and latency-reduction) may be traded off with classic notions of consistency. This tradeoff may frequently be applied across many different aspects at many levels of granularity within a single application.
(and operating solely at the storage level, we can’t).
In fact, the authors argue that as soon as we allow partial knowledge, the best computing model is one of memories, guesses, and apologies.
- A local replica hopefully remembers what it has seen!
- When an application take a decision based on local information, it may be wrong. “In any system that allows a degradation of the absolute truth, any action is, at best, a guess.” It might be a good guess (highly probable to be correct), but it’s still a guess.
- Therefore mistakes will sometimes be made, and you need a mechanism to apologise. This could be triggering of a manual apology mechanism requiring human intervention, or it could be automated.
Every business that operates in the real world needs an apology mechanism regardless of consistency models, because bad things sometimes happen in the real world outside of the control of our computers! For example, a delivery vehicle may be involved in an accident and unable to make a scheduled delivery.
Consider a system for reserving resources (e.g. selling items from inventory), in which multiple replicas process reservation requests and may sometimes be out of communication. There are two basic models, and which you prefer is a business decision:
- You can over provision by pre-allocating each replica a quota of the overall inventory . In this model you will never sell something you don’t have in stock, but you will very likely have excess inventory at replicas.
- You can over book by allowing independent allocation without strict coordination . This allows for the possibility that you will sometimes promise something you can’t deliver, and will need a mechanism to apologise.
Consider a case where the only book in inventory is scheduled for delivery. Due to an over-provisioning scheme, there is no confusion about the inventory and the book is promised to a customer. In preparing the book for shipment, it is run over by the forklift in the warehouse. So, over-provisioning notwithstanding, you need to apologize! Even if the computer systems are perfect, business includes apologizing because stuff will go wrong!
(The seat reservation pattern is a common compromise between over provisioning and over booking. A lease on a resource is granted while a client decides whether to proceed with a transaction. If the lease expires before a set timeout, the resource is returned to the pool and the client must start again).
Additional examples are given of the Amazon shopping cart built on Dynamo, and a banking application that clears cheques. These both support commutative operations that can be processed at different replicas in different orders.
Storage systems alone cannot provide the commutativity we need to create robust systems that function with asynchronous checkpointing. We need the business operations to reorder. Amazon’s Dynamo does not do this by itself. The shopping cart application on top of the Dynamo storage system is responsible for the semantics of eventual consistency and commutativity. The authors think it is time for us to move past the examination of eventual consistency in terms of updates and storage systems. The real action comes when examining application based operation semantics.
The classic definition of ACID is Atomic, Consistent, Isolated, and Durable. Its goal is to give the illusion that there’s only one computer that is doing nothing else while a particular transaction is processing. What the examples we’ve been discussing point toward is a new kind of ACID, ACID 2.0: Associative, Commutative, Idempotent, and Distributed.
The goal for ACID2.0 is to succeed if the pieces of the work happen: At least once; Anywhere in the system; In any order. This defines a new KIND of consistency. The individual steps happen at one or more system. The application is explicitly tolerant of work happening out of order. It is tolerant of the work happening more than once per machine, too.
With the old ACID you need to use either pessimistic or optimistic concurrency control mechanisms, “these tend to be fragile….,” however:
When the application is constrained to the additional requirements of commutativity and associativity, the world gets a LOT easier. No longer must the state be checkpointed across failure units in a synchronous fashion. Instead, it is possible to be very lazy about the sharing of information. This opens up offline, slow links, low quality datacenters, and more. Surprisingly, we find that many common business practices comply with these constraints. Looking at the business operations from the standpoint of how work has traditionally been performed shows many examples supportive of this approach. It appears we in database-land have gotten so attached to our abstractions of READ and WRITE that we forgot to look at what normal people do for inspiration.
Where can we find inspiration for designing ACID 2.0 systems?
The real world is rife with algorithms for idempotence, commutativity, and associativity. They are part of the lubrication of real world business and of the applications we must support on our fault tolerant platforms. A major trick is to look for mechanisms to create equivalence of the operation or resource. […] Whenever the authors struggle with explaining how to implement loosely-coupled solutions, we look to how things were done before computers.