IX: A protected dataplane operating system for high throughput and low latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay et al. OSDI 2014

This is the second of Simon Peter’s recommended papers in the ‘Data Center OS Design’ Research for Practice guide. Like Arrakis, IX splits the operating system into a control plane and data plane for networking. To quote Simon Peter:

The papers differ in how network I/O policy is enforced. Arrakis reaches for utmost performance by relying on hardware to enforce per-application maximum I/O rates and allowed communication peers. IX trades performance for software control over network I/O, thus allowing the precise enforcement of the I/O behavior of a particular network protocol, such as TCP congestion control.

Belay et al. describe four challenges facing large-scale data center applications:

  1. Microsecond tail latency, required to enable rich interactions between large numbers of services without impacting the user experience. When user requests can involve hundreds of servers, the impact of The Tail at Scale mean that each service node should ideally provide tight bounds on the 99% percentile latency.
  2. High packet rates sustainable under large numbers of concurrent connections and high connection churn.
  3. Isolation between applications when multiple services share servers.
  4. Resource efficiency

This should be possible to address with the wealth of hardware resources in modern servers…

Unfortunately, commodity operating systems have been designed under very different hardware assumptions. Kernel schedulers, networking APIs, and network stacks have been designed under the assumptions of multiple applications sharing a single processing core and packet inter-arrival times being many times higher than the latency of interrupts and system calls. As a result, such operating systems trade off both latency and throughput in favor or fine-grain resource scheduling.

The overheads associated with this limit throughput, furthermore requests between service tiers of datacenter apps often consist of small packet sizes meaning that common NIC hardware optimizations have a marginal impact.  IX draws inspiration from middleboxes (firewalls, load-balancers, and software routers for example) that also need microsecond latency and high packet rates, and extends their approach to support untrusted general purpose applications.

There are four principles that underpin the design of IX:

  1. Separation and protection of control plane and data plane
  2. Run to completion with adaptive batching
  3. Native zero-copy API with explicit control flow
  4. Flow consistent, synchronization free processing

Control Plane and Data Plane

The control plane is responsible for resource configuration, provisioning, scheduling, and monitoring. The data plane runs the networking stack and application logic. A data plane is allocated a dedicated core, large memory pages, and dedicated NIC queues. Each IX dataplane supports a single multithreaded application, operating as a single address-space OS.

We use modern virtualization hardware to provide three-way isolation between the control plane, the dataplane, and untrusted user code.

Within the dataplane there are elastic threads which initiate and consume network I/O, and background threads to perform application-specific tasks. Elastic threads must not block, and have exclusive use of a core or hardware thread. Multiple background threads may share an allocated hardware thread. The implementation is based on Dune and requires the VT-x virtualization features of Intel x86-64 systems. “However, it could be ported to any architecture with virtualization support.”

Run to completion and adaptive batching

IX dataplanes run to completation all stages needed to receive and transmit a packet, interleaving protocol processing (kernel mode) and application logic (user mode) at well-defined transition points.

This avoids any need for intermediate buffering. Batching is applied at every stage of the network stack, including system calls and queues. This leaps out at me as something traditionally at odds with low latency. However, there are mechanisms in place to combat this: batching is adaptive and only ever occurs in the presence of congestion, and upper bounds are set on the number of batched packets.

Using batching only on congestion allows us to minimize the impact on latency, while bounding the batch size prevents the live set from exceeding cache capacities and avoids transmit queue starvation.

At higher loads with their test workload, setting the batch size to 16 as opposed to 1 improved throughput by 29%.

Zero-copy API

IX eschews the POSIX API in favour of true zero-copy operations in both directions.

  The dataplane and application cooperatively manage the message buffer pool. Incoming packets are mapped read-only into the application, which may hold onto message buffers and return them to the dataplane at a later point. The application sends to the dataplane scatter/gather lists of memory locations for transmission but, since contents are not copied, the application must keep the content immutable until the peer acknowledges reception.

The dataplane enforces flow control correctness, the application controls transmit buffering.

Synchronization free processing

We use multi-queue NICs with receive-side scaling (RSS [43]) to provide flow-consistent hashing of incoming traffic to distinct hardware queues. Each hardware thread (hyperthread) serves a single receive and transmit queue per NIC, eliminating the need for synchronization and coherence traffic between cores in the networking stack.

It is the job ofthe control plane to map RSS flow groups to queues to balance traffic.

Elastic threads in the dataplane also operate in a synchronization and coherence free manner in the common case. System call implementations are designed to be commutative following the Scalable Commutativity Law. Each elastic thread manages its own dedicated resources, and the use of flow-consistent hashing at the NICs ensures that each thread operations on a disjoint subset of TCP flows.

Evaluation

The authors evaluate IX against a baseline Linux kernel and against mTCP. IX achieves higher ‘goodput’ than both of these in the NetPIPE ping-pong benchmark with a 10GbE network setup, which is an indication of its lower latency.

In a throughput and scalability test, IX needed only 3 cores to saturate a 10GbE link, whereas mTCP required all 8 cores. IX achieved 1.9x the throughput of mTCP, and 8.8x that of Linux. IX achieved line rate and was limited only by the 10GbE bandwidth. When testing ability to handle a large number of concurrent connections, IX performed 10x better than Linux at its peak.

Finally the team test two memcached workloads and report averaged and 99th percentile latency at different throughputs. The Linux baseline setup was carefully tuned so that 99th percentile latency did not exceed 200-500µs following established practices. The ETC workload represents the highest capacity deployment in Facebook, and the USR workload is the deployment with the most GET requests (99%).

Table 2 shows the results for a 99%-ile SLA of under 500μs: