Fast In-memory Transaction Processing using RDMA and HTM

Fast In-memory Transaction Processing using RDMA and HTM – Wei et al. 2015

This paper tries to answer a natural question: with advanced processor features and fast interconnects, can we build a transaction processing system that is at least one order of magnitude faster than the state-of-the-art systems without using such features?

The authors build a distributed transaction system, DrTM, that exploits Hardware Transactional Memory (HTM) and Remote Direct Memory Access (RDMA):

Hardware transactional memory (HTM) has recently come to the mass market in the form of Intel’s restricted transactional memory (RTM). The features like atomicity, consistency and isolation (ACI) make it very promising for database transactions. Meanwhile, RDMA, which provides direct memory access (DMA) to the memory of a remote machine, has recently gained considerable interests in the systems community.

With a 6-node cluster (20 cores/server), DrTM achieves 5.52M transactions per second on TPC-C. It’s interesting to compare that number to RIFL that we looked at last week – from Figure 14 in the RIFL paper we can see that RAMCloud with its kernel bypass transport (fastest configuration) does about 1250 txns per minute (about 21 tps) on TPC-C with a 6-node cluster (only 4 cores/server in this setup). RAMCloud is tuned for low latency of course, and has an average latency of 1ms. Using I-Confluence analysis, and extrapolating from the figures in Coordination Avoidance in Database Systems, Bailis et al. achieve about 480K transactions per second on TPC-C with a 6-node cluster (32 vcpus/server). That system scaled linearly up to 12.7M tps (with a 200 node cluster). We’d have to debate whether the hardware was comparable, etc. (well, clearly it is not because one system is using direct hardware support, though fewer cores per server) – but an order of magnitude difference is significant. How fast could we go if we implemented an I-Confluent system on top of HTM and RDMA???

To get this level of performance, DrTM depends on moving as much concurrency control as possible into HTM. One key challenge is the limited working set of HTM, and DrTM using transaction chopping to keep large transactions within it. A second key challenge is that RDMA cannot be used within an HTM region:

DrTM addresses this with a concurrency control protocol that combines HTM and two-phase locking (2PL) to preserve serializability. Specifically, DrTM uses RDMA-based compare-and-swap (CAS) to lock and fetch the corresponding database records from remote machines before starting an HTM transaction. Thanks to the strong consistency of RDMA and the strong atomicity of HTM, any concurrent conflicting transactions on a remote machine will be aborted. DrTM leverages this property to preserve serializability among distributed transactions. To guarantee forward progress, DrTM further provides contention management by leveraging the fallback handler of HTM to prevent possible deadlock and livelock.

HTM (RTM)

Intel’s Restricted Transactional Memory (RTM) provides strong atomicity within a single machine, where a non-transactional code will unconditionally abort a transaction when their accesses conflict. RTM uses the first-level cache to track the write set and an implementation-specific structure to track the read set, and relies on the cache coherence protocol to detect conflicts. Upon a conflict, at least one transaction will be aborted. RTM provides a set of interfaces including XBEGIN, XEND and XABORT, which will begin, end and abort a transaction accordingly.

The read/write set of an RTM transaction is limited in size due to the private cache and buffers on the CPU that are used to support it. Abort rates increase significantly as working set sizes increase, and a transaction that exceeds the hardware capacity will always be aborted. Any use of network I/O will also cause the transaction to be aborted.

RDMA

Remote Direct Memory Access (RDMA) is a networking feature to provide cross-machine accesses with high speed, low latency and low CPU overhead. Much prior work has demonstrated the benefit of using RDMA for in-memory stores and computing platforms.

RDMA has three communication options. In order of increasing performance these are IP-emulation to enable socket-based code to be used unmodified, an MPI with SEND/RECV verbs, and ‘one-sided RDMA’ which provides one-way direct access to the memory of another machine bypassing the CPU. One-sided RDMA provides only read, write, and two atomic operations fetch_and_add, and compare _and_swap.

DrTM

DrTM partitions data into shards spread across many machines connected by RDMA. It uses one worker-thread per core, each thread executes and commits a single transaction at a time. DrTM exposes a partitioned global address space, though a process still needs to distinguish between local and remote accesses. Remote access is primarily via one-sided RDMA operations for efficiency.

The memory store layer of DrTM provides a general key-value store interface to the upper layers:

To optimize for different access patterns, DrTM provides both an ordered store in the form of a B+ tree and an unordered store in the form of a hash table. For the ordered store, we use the B+ tree in DBX, which uses HTM to protect the major B+ tree operations and was shown to have comparable performance with state-of-the-art concurrent B+ tree. For the unordered store, we further design and implement a highly optimized hash table based on RDMA and HTM.

There is prior work on RDMA-optimised hash-tables, but nothing that exploits the combination of HTM and RDMA.

DrTM leverages the strong atomicity of HTM and strong consistency of RDMA to design an HTM/RDMA- friendly hash table. First, DrTM decouples the race detection from the hash table by leveraging the strong atomicity of HTM, where all local operations (e.g., READ/WRITE/ INSERT/DELETE) on key-value pairs are protected by HTM transactions and thus any conflicting accesses will abort the HTM transaction. This significantly simplifies the data structures and operations for race detection. Second, DrTM uses one-sided RDMA operations to perform both READ and WRITE to remote key-value pairs without involv- ing the host machine. Finally, DrTM separates keys and values as well as its metadata into decoupled memory region, resulting in two-level lookups like Pilaf. This makes it efficient to leverage one-sided RDMA READ for lookups, as one RDMA READ can fetch a cluster of keys. Further, the separated key-value pair makes it possible to implement RDMA-friendly, location-based and host-transparent caching.

DrTM uses cluster chaining for hashing. When caching lookup results, DrTM chooses to cache the key’s location rather than value. This minimizes the lookup cost while still retaining strongly consistent reads and writes.

To support distributed transactions, DrTM combines HTM with a higher-level two-phase locking (2PL) protocol:

…to preserve serializability among conflicting transactions on multiple nodes, we design a 2PL-like protocol to coordinate accesses to the same database records from local and remote worker threads. To bridge HTM(which essentially uses OCC) and 2PL, DrTM implements the exclusive and shared locks using one-sided RDMA operations, which are cache- coherent with local accesses and thus provide strong consistency with HTM.

The challenge is that any RDMA operation inside an HTM transaction will automatically cause it to abort.

To this end, DrTM uses 2PL to safely accumulate all remote records into a local cache prior to the actual execution in an HTM transaction, and write back the committed updates to other machines until the local commit of the HTM transaction or discard temporal updates after an HTM abort.

DrTM therefore requires prior knowledge of the read/write sets of transactions for locking and prefetching in the ‘start’ phase. For typical OLTP transactions such as TPC-C this is not normally a problem.

On each individual machine DrTM uses HTM to provide transaction suppport. To mitigate the working set size limitations, DrTM uses transaction chopping ‘with optimisations’ to decompose larger transactions into smaller pieces when needed. For read-only transactions with very large read sets DrTM provides a separate scheme to execute read-only transactions without HTM. DrTM transactions support strict serializability.

Currently DrTM provides durability, but not high availability:

DrTM currently preserves durability rather than availability in case of machine failures, as done in recent in-memory databases. How to provide availability, e.g., through efficiently replicated logging, will be our future work.

The source code of DrTM will soon be available at http://ipads.se.sjtu.edu.cn/drtm.