# Data Tiering in Heterogeneous Memory Systems

Data Tiering in Heterogeneous Memory Systems – Dulloor et al. 2016

Another fantastic EuroSys 2016 paper for today, and one with results that are of great importance in understanding the cost and performance implications of the new generation of non-volatile memory (NVM) heading to our data centers soon. Furthermore, we also get some great insight into the amount of change that might be necessary in existing applications (e.g. in-memory datastores) to start taking effective advantage of the new hardware.

### Optimising applications for hybrid memory systems with X-Mem

X-Mem provides a modified version of the malloc API (xmalloc) that enables the programmer to specify a tag for the data structure being allocated. X-Mem then profiles a run of the application to determine the benefit of placing a given application data structure in DRAM vs NVM. This profiling step results in a tag-to-priority map which is then used by the X-MEm runtime to determine the optimal placement of data structures.

Here’s an example where the application has three data structures (tree, hash, and buffer) each with their own tag. Memory allocations for each of these data structures are managed by a separate jemalloc instance. The tags used by the application are assigned internal priorities used to decide placement.

We use a greedy algorithm to maximize the total benefit by first sorting the (memory) regions ino a list ordered by decreasing benefit, and then placing regions starting from the head of the list into DRAM until no more space is left in DRAM. This algorithm is optimal because any placement that does not include one of the chosen regions can be further improved by replacing a region not in our chosen set with the missing region.

Regions may migrate between DRAM and NVM at runtime as new memory regions are added. X-Mem reissues an mbind call with the desired mapping for the region, and lets the Linux operating system do the heavy lifting of actually migrating the pages. This process leaves the application-visible virtual addresses unchanged, so no modifications are needed to the application.

When assigning piorities to tags, the analysis needs to take into account not only the frequency of access to data structures, but also the access patterns. The latter is important because the processor works hard to hide latency, and the access patterns play a big role in how well it can do that.

The authors consider random, pointer chasing, and streaming access patterns. For random accesses , the processor can use its execution window to detect and issue multiple requests simultaneously. In pointer chasing, the processor issues random accesses but the address for each access depends on the loaded value of a previous one, forcing the processor to stall until the previous load is complete. In streaming access, the instruction stream consists of a striding access pattern that can be detected in advance by the prefetcher to issue the requests in advance.

In Figure 5 we vary the latency and bandwidth characteristics of memory in our NVM performance emulator (§5.1). Pointer chasing experiences the maximum amount of stall latency, equal to the actual physical latency to memory. In the case of random access, out-of-order execution is effective at hiding some of the latency to memory. In the case of streaming access, the prefetcher is even more effective at hiding latency to memory. The order of magnitude difference in the performance of these access patterns validates our observation that the frequency of accesses alone is not sufficient to determine the “hotness” of a data structure, and illustrates the need for the more sophisticated classification of memory accesses proposed in this paper.

There are of course many more details on the inner workings of X-Mem and the performance benefit calculations used to determine where to place data structures in the paper itself.