Split-Level IO Scheduling

Split-Level IO Scheduling – Yang et al. 2015

The central idea in today’s paper is pretty simple: block-level I/O schedulers (the most common kind) lack the higher level information necessary to perform write-reordering and accurate accounting, whereas system-call level schedulers have the appropriate context but lack the low-level knowledge needed to build efficient schedulers – so why not create a scheduling framework that can intercept at both the block and system call levels to get the best of both worlds? This is what the authors call ‘split-level IO scheduling.’

By implementing a judiciously selected set of handlers at key junctures within the storage stack (namely, at the system-call, page-cache, and block layers), a developer can implement a scheduling discipline with full control over behavior and with no loss in high- or low-level information. Split schedulers can determine which processes issued I/O (via graph tags that track causality across levels) and accurately estimate I/O costs. Furthermore, memory notifications make schedulers aware of write work as soon as possible (not tens of seconds later when writeback occurs). Finally, split schedulers can prevent file systems from imposing orderings that are contrary to scheduling goals.

On top of this split-level framework, the authors build a fair queuing scheduler that reduces priority misallocations by 28x compared to the Linux CFQ scheduler; a deadline-based scheduler that reduces tail latencies by 4x; and a resource limiting scheduler that improves isolation by 6x for some workloads. This latter scheduler can improve isolation for virtual machines and for HDFS, and the deadline based scheduler provides a solution to fsync-induced freezes in database systems.

[A] block-level framework fails to support correct cause mapping (due to write delegation such as journaling and delayed allocation) or control over reordering (due to file-system ordering requirements). [A] system-call framework solves these two problems, but fails to provide enough information to schedulers for accurate cost estimation because it lacks low-level knowledge. These problems are general to many file systems; even if journals are not used, similar issues arise from the ordering constraints imposed by other mechanisms such as copy-on-write techniques or soft up- dates. Our split framework meets all the needs in Table 1 (cause mapping, cost-estimation, and reordering) by incorporating ideas from the other two frameworks and exposing additional memory-related hooks.

The ideas apply generally, but the implementation in the paper is integrated with the ext4 and XFS filesystems in Linux. Three split-level schedulers are built on top: Actually-Fair Queuing (AFQ) – 950 loc; Split-Deadline – 750 loc; and Split-Token – 950 loc.

Actually Fair Queuing

AFQ allocates I/O fairly among processes according to their priorties. It has performance similar to Linux’s CFQ, while avoiding the priority inversions that can happen with CFQ.

AFQ employs a two-level scheduling strategy. Reads are handled at the block level and writes (and calls that cause writes, such as fsync) are handled at the system-call level. This design allows reads to hit the cache while protecting writes from journal entanglement. Beneath the journal, low-priority blocks may be prerequisites for high-priority fsync calls, so writes at the block level are dispatched immediately. AFQ chooses I/O requests to dequeue at the block and system-call levels using the stride algorithm . Whenever a block request is dispatched to disk, AFQ charges the responsible processes for the disk I/O. The I/O cost is based on a simple seek model.


[Linux’s] Block-Deadline scheduler does poorly when trying to limit tail latencies, due to its inability to reorder block I/Os in the presence of filesystem ordering requirements. Split-level scheduling, with system-call scheduling capabilities and memory-state knowledge, is better suited to this task. We implement the Split-Deadline scheduler by modifying the Linux deadline scheduler (Block-Deadline)…

To show the benefits for real databases, SQLite and PostgreSQL are measured with both Split-Deadline and Block-Deadline.

…when running on top of Block-Deadline, 4% of transactions fail to meet their latency target, and over 1% take longer than 500ms. After further inspection, we found that the latency spikes happen at the end of each checkpoint period, when the system begins to flush a large amount of dirty data to disk using fsync. Such data flushing interferes with foreground I/Os, causes long transaction latency and low system throughput. The database community has long experienced this “fsync freeze” problem, and has no great solution for it. We show next that Split-Deadline provides a simple solution to this problem…. it effectively eliminates tail latency: 99.99% of the transactions are completed within 15 ms.


[In Split-Token] throttled processes are given tokens at a set rate. I/O costs tokens, I/O is blocked if there are no tokens, and the number of tokens that may be held is capped. Split-Token throttles a process’s system-call writes and block-level reads if and only if the number of tokens is negative. System-call reads are never throttled (to utilize the cache). Block writes are never throttled (to avoid entanglement). Our implementation uses memory-level and block-level hooks for accounting. The scheduler promptly charges tokens as soon as buffers are dirtied, and then revises when the writes are later flushed to the block level (§3.2), charging more tokens (or refunding them) based on amplification and sequentiality. Tokens represent bytes, so accounting normalizes the cost of an I/O pattern to the equivalent amount of sequential I/O (e.g., 1 MB of random I/O may be counted as 10 MB).

Under evaluation with QEMU, split-token is much better than the SCS scheduler at eliminating noisy I/O neighbour problems. The authors also evaluate split-token in an HDFS context:

To show that local split scheduling is a useful foundation to provide isolation in a distributed environment, we integrate HDFS with Split-Token to provide isolation to HDFS clients. We modify the client-to-worker protocol so workers know which account should be billed for disk I/O generated by the handling of a particular RPC call. Account information is propagated down to Split-Token and across to other workers (for pipelined writes)… We conclude that local scheduling can be used to meet distributed isolation goals; however, throttled applications may get worse-than-expected performance if the system is not well balanced.

In conclusion…

While our experiments indicate that simple layering must be abandoned, we need not sacrifice modularity. In our split framework, the scheduler operates across all layers, but is still abstracted behind a collection of handlers. This approach is relatively clean, and enables pluggable scheduling. Supporting a new scheduling goal simply involves writing a new scheduler plug-in, not re-engineering the entire storage system. Our hope is that split-level scheduling will inspire future vertical integration in storage stacks. Our source code is available at http://research.cs.wisc.edu/adsl/Software/split.