Slim: OS kernel support for a low-overhead container overlay network

Slim: OS kernel support for a low-overhead container overlay network Zhuo et al., NSDI’19

Container overlay networks rely on packet transformations, with each packet traversing the networking stack twice on its way from the sending container to the receiving container.

There are CPU, throughput, and latency overheads associated with those traversals.

In this paper, we ask whether we can design and implement a container overlay network, where packets go through the OS kernel’s network stack only once. This requires us to remove packet transformation from the overlay network’s data-plane. Instead, we implement network virtualization by manipulating connection-level metadata at connection setup time, saving CPU cycles and reducing packet latency.

Slim comes with some caveats: it requires a kernel module for secure deployment, has longer connection establishment times, doesn’t fit with packet-based network policies, and only handles TCP traffic. For UDP, ICMP, and for its own service discovery, it also relies on an existing container overlay network (Weave Net). But for longer lasting connections managed using connection-based network policies it delivers some impressive results:

  • memcached throughput up by 71%, with latency reduced by 42%, and CPU utilisation reduced by 56%.
  • Nginx CPU utilisation reduced by 22-24%
  • PostgreSQL CPU utilisation reduced by 22%
  • Apache Kafka CPU utilisation reduced by 10%

Since Slim both builds on and compares to Weave Net, I should say at this point that Weave Net was the very first open source project from Weaveworks, the “GitOps for Kubernetes” company. Accel is an investor in Weaveworks, and I am also a personal investor. If you’re using Kubernetes, you should definitely check them out. Anyway, on with the show…

Container networking

In theory there are four possible modes for container networking: a bridge mode for containers on the same host; host mode in which containers use the IP address of their host network interface; macvlan mode (or similar hardware mechanisms) to give each container its own IP address; and overlay mode in which each container is given its own own virtual network interface and each application has its own network namespace.

In practice, there are management and deployment challenges with the bridge, host, and macvlan approaches, so overlay networks such as Weave Net are the preferred solution.

Overlay packets are encapsulated with host network headers when routed on the host network. This lets the container overlay network have its own IP address space and network configuration that is disjoint from that of the host network; each can be managed completely independently. Many container overlay network solutions are available today— such as Weave, Flannel, and Docker Overlay— all of which share similar internal architectures.

Overlay network overheads

The overheads in overlay networking come from the per-packet processing inside the OS kernel: delivering a packet on the overlay network requires one extra traversal of the network stack and also packet encapsulation and decapsulation.

Here are some test measurements comparing Weave Net in fast dataplane mode to host networking to give an example:

In this test, compared to direct host networking, for two containers on the same host (Intra) the overlay network reduces throughput by 23% and increases latency by 34%. For containers communicating across hosts (Inter), throughput reduces by 48% and latency increases by 85%. The overheads are lower when communicating on the same host since packet encapsulation is not required.

Compared to host networking, the CPU utilisation also increases by 93%.

There are several known techniques to reduce the data plane overhead. Packet steering creates multiple queues, each per CPU core, for a network interface and uses consistent hashing to map packets to different queues. In this way, packets in the same network connection are processed only on a single core. Different cores therefore do not have access to the same queue, removing the overhead due to multi-core synchronization.

The authors integrated the above Receive Packet Steering (RPS), and also an enhancement called Receive Flow Steering (RFS— which further ensures that interrupt processing occurs on the same core as the application— into Weave Net. With this enhancement, throughput is within 9% of that achieved with host networking, but it makes almost no difference to latency.

Introducing Slim

The big idea in Slim is to reduce CPU utilisation and latency overheads by having packets traverse the network stack only once. That means you can’t do per-packet processing. Instead, Slim works at the connection level.

Slim virtualizes the network by manipulating connection-level metadata. SlimSocket exposes the POSIX socket interface to application binaries to intercept invocations of socket-related system calls. When SlimSocket detects an application is trying to set up a connection, it sends a request to SlimRouter. After SlimRouter sets up the network connection, it passes access to the connection as a file descriptor to the process inside the container. The application inside the container then uses the host namespace file descriptor to send/receive packets directly to/from the host network. Because SlimSocket has the exact same interface as the POSIX socket, and Slim dynamically links SlimSocket into the application, the application binary need not be modified.

Given that Slim is out of the picture once the connection is established, a separate mechanism is needed to support control plane and data plane policies. SlimRouter stores control plane policies and enforces them at connection setup time. If the policy changes, SlimRouter scans existing connections and removes the file descriptors for any connection violating the new policy. This requires the support of a kernel module, SlimKernModule. To avoid containers learning the IP addresses of host machines, SlimKernModule (in secure mode) also prohibits unsafe system calls on file descriptors (e.g. getpeername). Existing kernel mechanisms are used to enforce data plane policies.

This is what it looks like when Slim is used with blocking I/O:

Calls to the POSIX socket interface are intercepted by the SlimSocket shim and forward to the SlimRouter. For non-blocking I/O (e.g., select, epoll) Slim also intercepts these API calls and maintains mappings for epoll file descriptor sets. The SlimRouter needs to know the IP address and port mappings in order to establish connections. It does this using a Weave Net overlay network!

When the client calls connect, it actually creates an overlay network connection on the standard container overlay network. When the server receives an incoming connection on the standard overlay network, SlimSocket queries SlimRouter for the physical IP address and port and sends them to the client side inside the overlay connection. In secure mode (§4.3), the result queried from SlimRouter is encrypted. SlimSocket on the client side sends the physical IP address and port (encrypted if in secure mode) to its SlimRouter and the SlimRouter establishes the host connection. This means connection setup time is longer in Slim than that on container overlay networks based on packet transformation.

Weave Net is also used for packets that require data plane handling such as ICMP and UDP.

Evaluation

Microbenchmarks compare Slim to Weave Net with RFS. Creating on a TCP connection with Weave Net takes 270 µs. With Slim it takes 556µs (440µs in insecure mode). For applications with persistent connections, this additional overhead will not be significant. Compared to Weave Net, Slim reduces CPU overhead by 22-41%.

Slim and Weave Net are then further compared on four application workloads based on Memcached, Nginx, PostgreSQL, and Apache Kafka respectively.

For Memcached, Slim improves throughput by 71% and reduces latency by 42%, with 25% lower CPU utilisation.

For Nginx, PostgreSQL the main advantage of Slim is reduced CPU utilisation (around 22% reduction). For Kafka the CPU utilisation reduction is around 10%, but latency is also reduced by 28%.

Slim’s source code is available at https://github.com/danyangz/slim.