Life Beyond Distributed Transactions

Life Beyond Distributed Transactions: An Apostate’s Opinion – Pat Helland, 2007

It takes real skill to strip something back to its essence and explain it clearly in such a way that the ramifications become apparent. In my view Pat Helland pulls this off admirably in this paper and helps the reader think more deeply about the fundamental issues.

I wish I’d been there to hear Pat’s talk at reactconf earlier this week. For now those of us that couldn’t be there will have to wait for the slides to become available.

This paper explores and names some of the practical approaches used in the implementations of large-scale mission-critical applications in a world which rejects distributed transactions.

Why the rejection? (a) people aren’t building large-scale systems with distributed transactions in practice, and (b) if they do try to, ‘the projects founder because the performance costs and fragility make them impractical.’

Everything in the paper flows from one simple thought experiment:

…assume the number of customers, purchasable entities, orders, shipments, health-care patients, taxpayers, bank account, and all other business concepts manipulated by the application grow significantly larger over time (to an almost-infinite number). Typically, the individual things do not get signifcantly larger; we simply get more and more of them.

Now what do you do? Clearly you can’t fit this on a single machine, so you are forced to spread what formerly ran on a single or small number of machines across a larger number of machines.

scaling implies using a new abstraction called an ‘entity’ as you write your program. An entity lives on a single machine at a time, and the application can only manipulate one entity at a time. A consequence of almost-infinite scaling is that this abstraction must be exposed to the developer of business logic.

As you read through the paper it becomes clear that what Helland is refering to as an ‘entity’ is not simply a ‘persistent object’ as an application developer might naturally equate the term, but Eric Evans’ concept of an ‘aggregate entity.’

Each entity has a unique identifier or key. Entities represent disjoint sets of data. Each datum resides in exactly one entity. The data of an entity never overlaps the data of another entity.

Entities are the boundary of atomicity

An entity lives within a single scope of serializability, and because of this we are always guaranteed to be able to do atomic transactions within the confines of a single entity. Furthermore:

A scale-agnostic programming abstraction must have the notion of entity as the boundary of atomicity.

You can’t have a transaction that spans more than one entity, for the simple reason that you have no way of guaranteeing multiple entities will reside on the same system. (Remember, we’re talking about aggregate entities here).

There are no constraints on the representation of the entity. It may be stored as SQL records, XML documents (this was 2007 remember!), files, blobs, or anything else that is convenient and appropriate for the apps needs. One possible representation is as a collection of SQL records (potentially across many tables) whose primary key begins with the entity key.

Let’s pause for a minute and think of the implications. If you are using a SQL store to persist state, and an aggregate entity is mapped to multiple tables, then to provide the atomic single scope of serializability for that aggregate entity we’ll need to ensure that all of the related rows across those tables are colocated. Furthermore, that also means we’ll need to use the same partitioning strategy for all of those tables.

At Pivotal we worked on a distributed store offering a SQL interface, called GemFire XD. It was for this reason that GemFire allowed you to specify partitioning and colocation rules for tables. For example:

[code lang=text]
CREATE TABLE COUNTRIES
(
COUNTRY VARCHAR(26) NOT NULL CONSTRAINT COUNTRIES_UNQ_NM Unique,
COUNTRY_ISO_CODE CHAR(2) NOT NULL CONSTRAINT COUNTRIES_PK PRIMARY KEY,
REGION VARCHAR(26),
CONSTRAINT COUNTRIES_UC
CHECK (country_ISO_code = upper(country_ISO_code) )
) PARTITION BY PRIMARY KEY;

CREATE TABLE CITIES
(
CITY_ID INTEGER NOT NULL CONSTRAINT CITIES_PK Primary key,
CITY_NAME VARCHAR(24) NOT NULL,
COUNTRY VARCHAR(26) NOT NULL,
AIRPORT VARCHAR(3),
LANGUAGE VARCHAR(16),
C_COUNTRY_ISO_CODE CHAR(2) CONSTRAINT COUNTRIES_FK
REFERENCES COUNTRIES (COUNTRY_ISO_CODE)
) PARTITION BY COLUMN (C_COUNTRY_ISO_CODE)
COLOCATE WITH (COUNTRIES);
[/code]

This result also helps to explain some of the popularity of document stores, where documents with embedded data naturally model the aggregate entity as the unit of serializability and distribution. For example, the MongoDB documentation states:

In MongoDB, write operations are atomic at the document
level, and no single write operation can atomically 
affect more than one document of more than one 
collection. A denormalized data model with embedded
data combines all related data for a represented entity 
in a single document.

Alternative representations such as an entity-based Event Store also fit nicely into this paradigm.

We also have to accept that alternative indices can’t reside within the same scope of serializability as primary indices or a given entity (see the paper for details).

The scale-agnostic application program can’t atomically update an entity and its alternate index! The upper-layer scale-agnostic application must be designed to understand that alternate indices may be out of sync with the entity accessed with its primary index (i.e. entity key).

I like CQRS here for making this distinction very explicit (and providing denormalized views for efficiency where useful too).

Entities are the unit of addressing

…the size of the set of entities within a single application is growing larger than can fit in a single data-store. Each individual entity fits in a store but the set of them all does not. Increasingly, the stateless application is routing to fetch the entity based upon some partitioning scheme. Second, the fetching and partitioning scheme is being separated into the lower-layers of the application and deliberately isolated from the upper-layers of the application responsible for business logic. This is effectively driving towards the message destination being the entity key.

Messages are addressed to entities. So we are looking at a distributed system, with (aggregate) entities encapsulating state, and individually addressed via messaging. The word is never used in the paper, but to me this screams ‘distributed actors,’ with each aggregate entity represented by an actor.

Messaging is asynchronous with respect to sending transactions.

It would be horribly complex for an application developer to send a message while working on a transaction, have the message sent, and then the transaction abort. This would mean that you have no memory of causing something to happen and yet it does happen! For this reason, transactional enqueing of messages is de rigueur.

As the system scales, and entities move, messages have to follow. Moreover:

As entities move, the clarity of a FIFO queue between the sender and the destination is occassionally disrupted. Messages are repeated. Later messages arrive before earlier ones. Life gets messier. For these reasons, we see scale-agnostic applications are evolving to support idempotent processing of all application-visible messaging. This implies reordering in message delivery, too.

Helland goes on to describe the activity tracking needed to manage the state of conservations on a partner-by-partner basis. These activity workflows (small ‘w’) need to reach agreement in the absence of atomicity.

Workflow to reach decisions, as has been discussed for many years, functions within activities within entities. It is the fine-grained nature of workflow that is surprising as one looks at almost infinite scaling.

Two layer architectures

Helland proposes there should be (at least) two layers in each scalable application.

The lower layer of the application understands the fact that more computers get added to make the system scale. In addition to other work, it manages the mapping of the upper layer’s code to the physical machines and their locations. The lower layer is scale-aware in that it understands this mapping. We are presuming that the lower layer provides a scale-agnostic programming abstraction to the upper layer.

Using this abstraction, the upper layer is written without worrying about these scaling issues.

By sticking to the scale-agnostic programming abstraction, the upper layer of application code is written without worrying about scaling issues.

Given all the assumptions about entities and messaging we’ve previously covered, you can see how this could be achievable. The lower layer discussion puts me in mind of mesos and the frameworks that plug into it.

In conclusion

Best left to Pat Helland in his own words:

Today, we see new design pressures foisted onto programmers that simply want to solve business problems. Their realities are taking them into a world of almost-infinite scaling and forcing them into design problems largely unrelated to the real business at hand.

In a few years we are likely to see the development of new middleware or platforms which provide automated management of these applications and eliminate the scaling challenges for applications developed within a stylized programming paradigm.

And all written in 2007 remember.