Mojim: A Reliable and Highly-Available Non-Volatile Memory System

Mojim: A Reliable and Highly-Available Non-Volatile Memory System – Zhang et al. 2015

This is the second in a series of posts looking at the latest research from the recently held ASPLOS 15 conference.

It seems like we’ve been anticipating NVMM (Non-volatile main memory) for a while now; and there has been plenty of research looking at the implications for different parts of the stack. Today’s paper caught my eye amongst other things because it connects NVMM to systems we understand and use today – most notably MongoDB – to show us what the potential benefits and trade-offs might be.

Fast, non-volatile memory technologies such as phase change memory (PCM), spin-transfer torque magnetic memories (STTMs), and the memristor are poised to radically alter the performance landscape for storage systems. They will blur the line between storage and memory, forcing designers to rethink how volatile and non-volatile data interact and how to manage non-volatile memories as reliable storage. Attaching NVMs directly to processors will produce non-volatile main memories (NVMMs), exposing the performance, flexibility, and persistence of these memories to applications. However, taking full advantage of NVMMs’ potential will require changes in system software.

These NVM technologies are byte-addressable, with performance approaching that of DRAM. When attached to main memory they provide a raw storage medium that is orders of magnitude faster than modern disks and SSDs. NVMM also degrades much more slowly than flash – good for about 100M write cycles – and can retain data for 100s of years (See for example PCM on Wikipedia. For the experiments conducted in this paper, the expected performance of NVMM is as follows: read latency 300 ns, read bandwidth 5 GB/s, write barrier delay 1 microsecond, write bandwidth 1.6 GB/s.

Impressive as these numbers are, I’ve learned its often more informative to look at the ratios than the absolute numbers. If everything improves at roughly the same rate, then systems get faster but the fundamental trade-offs made during system design still hold. But when the ratios change, then the optimal way to build systems probably changes too. Instead of ‘numbers every programmer should know’ (which keep changing year on year of course), I argue that the real value is in ‘ratios every programmer should know’ (which also change, but more slowly). What kind of ratios might we be interested in? Latency, bandwidth, capacity, and cost come to mind as the four big ones. If you want to do performance modeling the absolute numbers matter of course, but if you want to make system design trade-offs then the ratios are key. With capacity, we’re not just interested in the ratio of typical storage capacity from one medium to the next, but also in the percentage of application data that can easily fit within that capacity. The 5-minute rule and its derivatives (+10 yrs, +20 yrs)are well-known examples of these kind of system design trade-offs.

Watch the ratios change:

For traditional storage systems with slow hard disks and SSDs, the performance overhead of replication is small relative to the cost of accessing a hard drive or SSD, even with complex protocols for strong consistency. With NVMMs, however, the networking round trips and software overhead involved in these techniques threaten to outstrip the low-latency benefit of using NVMMs in the first place. Even for systems with weak consistency, increasing the rate of reconciliation between inconsistent copies of data can threaten performance… Existing replication mechanisms built for these slower storage media have software and networking performance overhead that would obscure the performance benefits that NVMM could provide.

For NVMM to serve as a reliable primary storage mechanism (and not just a caching layer) we need to address reliability and availability requirements. The two main storage approaches for this: replication and erasure coding, both make the fundamental assumption that storage is slow. In addition, NVMM is directly addressed like memory, not via I/O system calls as a storage device.

We propose Mojim, a system that provides replicated, reliable, and highly-available NVMM as an operating system service. Applications can access data in Mojim using normal load and store instructions while controlling when and how updates propagate to replicas using system calls. Mojim allows applications to build data persistence abstractions ranging from simple log-based systems to complex transactions.

Mojim is an operating system service that presents a file-system interface for ‘flexibility in application usage models.’ With a memory-mapped (mmap’d) file, it maps the NVMM pages corresponding to the file directly into the applications address spaces rather than paging them in and out of the kernel’s buffer cache. A primary Mojim node supports reads and writes to data in a Mojim region, which is replicated to a mirror node, and optionally to a second tier of backup nodes as well. The mirror and back-up nodes support reads only. Mojim supports a range of replication modes and protocols enabling trade-offs to be made between reliability, availability, consistency, and cost.

Synchronisation occurs on mysnc and gsync system calls:

Mojim leverages the existing msync system call to specify a sync point that applies to a single, contiguous address range. The semantics of Mojim’s msync correspond to conventional msync, and applications that use msync will work correctly without modification under Mojim. Mojim allows an application to specify a fine-grained memory region in the msync API and replicates it atomically, while traditional msync flushes page-aligned memory regions to persistent storage and does not provide atomicity guarantees. Mojim’s gmsync adds the ability to specify multiple memory regions for the sync point to replicate, allowing for more flexibility than msync.

The primary and mirror node are connected with a high-speed Infiniband link:

Mojim uses Infiniband (IB), a high-performance switched network that supports RDMA. RDMA is crucial because it allows the primary node to transfer data directly into the mirror node’s NVMM without requiring additional buffering, copying, or cache flushes.

Infiniband is also the preferred connection to the second tier backup nodes, but a mode is offered that uses ethernet for these links as an alternative.

Consider the unreplicated (single machine) case: you have to flush (mysnc) a modified memory region to ensure persistence. This model has poor availability and is only as reliable as the NVMM device, data corruptions are also still possible if crashes happen in the middle of change sets. With Mojim’s M-sync mode that provides strong consistency between the primary and mirror node, reliability and availability are improved, and perhaps surprisingly, so is performance – latency is reduced by 40-45%!

Our evaluation results show that M-sync offers performance comparable to or better than (the unreplicated case) because flushing CPU caches is often more expensive than pushing the data over RDMA.

In this model, there is still a small (approx. 450 microsecond) window of vulnerability after a mirror node failure during which a primary node failure can still result in data loss. This vulnerability can be eliminated by flushing data from the primary node’s caches before returning from an mysnc or gmsync call. Other permutations supported by Mojim include asychronous (rather than synchronous) replication to the mirror node; the ability to use SSDs for the mirror node, keeping just the log in NVMM (to reduce costs); the ability to sync to a second tier of backup nodes; and the ability to sync to a tier of backup nodes over ethernet.

Comparing DRAM and emulated NVMM, the performance with emulated NVMM for all schemes is close to that with DRAM, indicating that the performance degradation of NVMM over DRAM only has a very small effect over application-level performance.

The authors ported the PMFS file system, Google Hash Table, and MongoDB. Porting PMFS required changes to just 20 lines of code, and the Google hash table port only 18 lines:

Google hash table is an open source implementation of sparse and dense hash tables. Our Mojim-enabled version of the hash table stores its data in mmap’d PMFS files and performs msync at each insert and delete operation to let Mojim replicate the data. Porting the Google hash table to Mojim requires changes to just 18 lines of code.

MongoDB stores its data in memory-mapped files and performs memory loads and stores for data access, making it a strong match for Mojim. Furthermore, by taking advantage of Mojim’s gmsync capability and its reliability guarantees, it is possible to remove explicit journaling and still achieve the same consistency level.

To guarantee the same atomicity of client requests as available through MongoDB, we modify the storage engine of MongoDB to keep track of all writes to the data file and group the written memory regions belonging to the same client request into a gmsync call. In total, this change requires modifying 117 lines of MongoDB.

Those 117 lines give quite a big pay-back:

MongoDB with Mojim outperforms the MongoDB replication method REPLICAS SAFE by 3.7 to 3.9×. This performance gain is due to Mojim’s efficient replication protocol and networking stack. Mojim also outperforms the un-replicated JOURNALED MongoDB by 56 to 59× and the un-replicated FSYNC SAFE by 701 to 741×. JOURNALED flushes journal content for each client write request. FSYNC SAFE performs fsync of the data file after each write operation to guarantee data reliability without journaling. Both these operations are expensive.

The authors also evaluated recovery times following a crash. With all data held in NVMM, a 5GB region can be recovered in 1.9 seconds (1TB in 6.5 minutes). When keeping only the mirror’s log in NVMM, and the main data on SSDs, recovery of 5GB takes 17 seconds.