Scalable Atomic Visibility with RAMP Transactions
Scalable Atomic Visibility with RAMP Transactions – Bailis et al. 2014
RAMP transactions came up last week as part of the secret sauce in Coordination avoidance in database systems that contributed to a 25x improvement on the TPC-C benchmark. So what exactly are RAMP transactions and why might we need them?
As soon as you partition your database across multiple servers, things start to get interesting. We’d like to maintain atomic isolation – either all of a transaction’s effects are visible or none are – for transactions that span partitions…
The status quo for these multi-partition atomic transactions provides an uncomfortable choice between algorithms that are fast but deliver inconsistent results and algorithms that deliver consistent results but are often slow and unavailable under failure.
A lot of implemented systems have chosen to go with the fast-and-furious option resulting in incorrect behaviour for cases where atomic visibility matters. The RAMP (Read Atomic Multiple Partition) transaction models introduced in this paper show that you can have performance and scalability of transactions spanning multiple partitions with atomic visibility.
…data stores like Bigtable, Dynamo, and many popular “NoSQL” and even some “NewSQL” stores do not provide transactional guarantees for multi-item operations. The designers of these Internet-scale, real-world systems have made a conscious decision to provide scalability at the expense of multi-partition transactional semantics. Our goal with RAMP transactions is to preserve this scalability but deliver correct, atomically visible behavior for the use cases we have described.
Under evaluation, the RAMP algorithms did not degrade substantially under contention, and scaled linearly to over 7.1 million operations per second on 100 servers.
Bad things that can happen when you don’t have atomic multi-partition isolation
Without atomic isolation foreign key constraints, secondary indexing, and materialized view maintenance can all break!
Data models often represent bi-directional relationships as two distinct uni-directional relationships. “For example, in TAO, a user performing a ‘like’ action on a Facebook page produces updates to both the LIKES and LIKED_BY associations.”
These applications require foreign key maintenance and often, due to their unidirectional relationships, multi-entity update and access.
Without atomic isolation broken bi-directional relationships, and dangling or incorrect references can surface.
With data partitioned across servers by primary key, access by secondary attributes becomes more challenging.
There are two dominant strategies for distributed secondary indexing. First, the local secondary index approach co-locates secondary indexes and primary data, so each server contains a secondary index that only references (and indexes) data stored on its server. This allows easy, single-server updates but requires contacting every partition for secondary attribute lookups (write-one, read-all), compromising scalability for read-heavy workloads. Alternatively, the global secondary index approach locates secondary indexes (which may be partitioned, but by a secondary attribute) separately from primary data. This alternative allows fast secondary lookups (read-one) but requires multi-partition update (at least write-two)
Real-world services tend to use either local secondary indexing (non-scalable but correct), or non-atomic (scalable but incorrect) global indexes. In the latter cases queries involving the secondary attributes can return records that shouldn’t match, and omit ones that should.
Without atomic isolation, materialized views can diverge from the base data. For example, a count may become inaccurate.
With RAMP transactions, base data and views can be updated atomically. The physical maintenance of a view depends on its specification, but RAMP transactions provide appropriate concurrency control primitives for ensuring that changes are delivered to the materialized view partition. For select-project views, a simple solution is to treat the view as a separate table and perform maintenance as needed: new rows can be inserted/deleted according to the specification, and, if necessary, the view can be (re-)computed on demand (i.e., lazy view maintenance). For more complex views, such as counters, users can execute RAMP transactions over specialized data structures such as the CRDT G-Counter.
Consider databases that are partitioned over multiple servers. Each item has a single logical copy stored on one of those partitions, which one can be calculated using the item itself (e.g. primary key). In order to achieve scalability the author’s identify two key properties that must be preserved: synchronization independence, and partition independence.
Synchronization independence ensures that one client’s transactions cannot cause another client’s to block and that, if a client can contact the partition responsible for each item in its transaction, the transaction will eventually commit (or abort of its own volition).
(Also known as transactional availability).
Partition independence ensures that, in order to execute a transaction, a client never has to contact partitions that its transaction does not access. Thus, a partition failure only affects transactions that access items contained on the partition. This also reduces load on servers not directly involved in a transaction’s execution. In the distributed systems literature, partition independence for replicated data is called replica availability or genuine partial replication.
A third constraint is that the metadata required to achieve synchronization and partition independence is not too large: “there are many potential solutions for providing atomic visibility that rely on storing prohibitive amounts of state.”
The RAMP transaction algorithms
You may be wondering why I keep referring to algorithms (plural). This is because the authors actually define three RAMP variants: RAMP-Fast, RAMP-Small, and RAMP-Hybrid. These trade-off between performance and the amount of metadata that needs to be kept.
At a high level, RAMP transactions allow reads and writes to proceed concurrently. This provides excellent performance but, in turn, introduces a race condition: one transaction might only read a subset of another transaction’s writes, violating RA (i.e., fractured reads might occur). Instead of preventing this race (hampering scalability), RAMP readers autonomously detect the race (using metadata attached to each data item) and fetch any missing, in-flight writes from their respective partitions. To make sure that readers never have to block for writes to arrive at a partition, writers use a two-phase (atomic commitment) protocol that ensures that once a write is visible to readers on one partition, any other writes in the transaction are present on and, if appropriately identified by version, readable from their respective partitions.
RAMP-Fast stores metadata in the form of write sets (thus the overhead is linear in transaction size), and has one RTT for reads in the best case (two in the worst case). RAMP-Small uses constant size metadata (it only stores the transaction timestamp) but always requires two RTT for reads. RAMP-Hybrid takes the same write set information as RAMP-Fast, but encodes it in a Bloom filter. With no false positives from the filter, Ramp-Hybrid would therefore behave as RAMP-Fast. And with all false positives, it behaves as RAMP-Small. All of the variants require two RTTs/transaction for writes.
The two-phase atomic commitment protocol used by RAMP ensures readers never block waiting for writes to arrive. It is known that every atomic commitment protocol may block during failures.
Blocked writes instead act as “resource leaks” on partitions: partitions will retain prepared versions indefinitely unless action is taken. To “free” these leaks, RAMP servers can use the Cooperative Termination Protocol (CTP). CTP can always complete the transaction except when every partition has performed PREPARE but no partition has performed COMMIT… Compared to alternatives (e.g. replicating clients), we have found CTP to be both lightweight and effective.
There is of course much more detail in the full paper, which I encourage you to go on and read. Section 6 on Related Work contains a nice short summary of isolation guarantees in the wild. “In recent years, many ‘NoSQL’ designs have avoided cross-partition transactions entirely, effectively providing Read Uncommitted isolation…”