The Design and Implementation of Open vSwitch

The Design and Implementation of Open vSwitch – Pfaff et al. 2015

Another selection from this month’s NSDI 2015 programme, this time from the operational systems track. What inspired the creation of Open vSwitch? What has most influenced its design? And what’s next?

As virtualized (or containerized) workloads grew, physically provisioning networks to support them became far too slow and limiting. This led to the emergence of network virtualization:

In network virtualization, virtual switches become the primary provider of network services for VMs, leaving physical datacenter networks with transportation of IP tunneled packets between hypervisors. This approach allows the virtual networks to be decoupled from their underlying physical networks, and by leveraging the flexibility of general purpose processors, virtual switches can provide VMs, their tenants, and administrators with logical network abstractions, services and tools identical to dedicated physical networks.

Early virtual switches were not flexible enough for the rapidly changing world, and this allowed Open vSwitch to quickly gain popularity:

Today, on Linux, its original platform, Open vSwitch works with most hypervisors and container systems, including Xen, KVM, and Docker. Open vSwitch also works “out of the box” on the FreeBSD and NetBSD operating systems and ports to the VMware ESXi and Microsoft Hyper-V hypervisors are underway.

Open vSwitch strives to balance the performance needed by production workloads with the programmability required by network virtualization. In this last respect, Open vSwitch is an implementation of the OpenFlow switch specification. From the introduction to that specification.:

An OpenFlow Switch consists of one or more flow tables and a group table, which perform packet lookups and forwarding, and an OpenFlow channel to an external controller. The switch communicates with the controller and the controller manages the switch via the OpenFlow protocol. Using the OpenFlow protocol, the controller can add, update, and delete flow entries in flow tables, both reactively (in response to packets) and proactively. Each flow table in the switch contains a set of flow entries; each flow entry consists of match fields, counters, and a set of instructions to apply to matching packets.

Open vSwitch needs to be able to maximise resources available for user workloads, cope with networks where a switch may have thousands of peers in a mesh of point-to-point IP tunnels between hypervisors, handle constant changes to forward state as VMs boot, migrate, and shut-down, and efficiently manage long packet processing pipelines.

The flexibility of OpenFlow was essential in the early days of SDN but it quickly became evident that advanced use cases, such as network virtualization, result in long packet processing pipelines, and thus higher classification load than traditionally seen in virtual switches. To prevent Open vSwitch from consuming more hypervisor resources than competitive virtual switches, it was forced to implement flow caching.

The dominant factors in the design of Open vSwitch appear to be the trade-offs between userspace and in-kernel functionality, and the flow caching mechanism for fast routing of packets.

The user-space daemonovs-vswitchd is essentially the same across all environments. The datapath kernel module is written specially for the host OS for performance.

In 2007, when the development of the code that would become Open vSwitch started on Linux, only in-kernel packet forwarding could realistically achieve good performance, so the initial implementation put all OpenFlow processing into a kernel module.

But developing in the kernel and distributing and updating kernel modules made this impractical, and it became clear that an in-kernel OpenFlow implementation would not be accepted upstream into Linux. So all of the OpenFlow processing was moved into a user-space daemon, and the kernel module became a much simpler cache with hash-based lookup, called a microflow cache.

This allowed radical simplification, by implementing the kernel module as a simple hash table rather than as a complicated, generic packet classifier, supporting arbitrary fields and masking. In this design, cache entries are extremely fine-grained and match at most packets of a single transport connection: even for a single transport connection, a change in network path and hence in IP TTL field would result in a miss, and would divert a packet to userspace, which consulted the actual OpenFlow flow table to decide how to forward it.

Although the microflow cache worked well in many situations, it unfortunately suffered from serious performance degradation when faced with large numbers of short-lived connections. This resulting in a compromise split between kernel and user-space:

We replaced the microflow cache with a megaflow cache. The megaflow cache is a single flow lookup table that supports generic matching, i.e., it supports caching forwarding decisions for larger aggregates of traffic than connections. While it more closely resembles a generic OpenFlow table than the microflow cache does, due to its support for arbitrary packet field matching, it is still strictly simpler and lighter in runtime for two primary reasons. First, it does not have priorities, which speeds up packet classification… Second, there is only one megaflow classifier, instead of a pipeline of them, so userspace installs megaflow entries that collapse together the behavior of all relevant OpenFlow tables.

The microflow cache is retained as a first-level cache consulted before the megaflow cache. Several optimisations to the megaflow cache were made including staging hash lookups across metadata, L2, L3, and L4 fields (to support matches on field subsets), and use of a trie structure to optimise prefixes for IPv4 and v6.

An evaluation of a dataset containing 24 hrs of data from 1000 hypervisors running Open vSwitch showed that small megaflow caches are sufficient in practice (50% had less than 107, at the 99% percentile there were 7,033 flows), and cache hit rates are high (97.7%). 80% of the hypervisor used 5% of less of their CPU cycles on ovs-switchd (the cost of the kernel module is not easily separable). With a simple configuration, performance is equivalent to the in-kernel Linux Bridge – as more rules are added Open vSwitch performs much better and introduces less overhead.

Future directions include improving network performance through user space networking with DPDK.

Improving the virtual switch performance through userspace networking is a timely topic due to NFV. In this model, packets are passed directly from the NIC to VM with minimal intervention by the hypervisor userspace/kernel, typically through shared memory between NIC, virtual switch, and VMs. To this end, there is an ongoing effort to add both DPDK and netmap support to Open vSwitch. Early tests indicate the Open vSwitch caching architecture in this context is similarly beneficial to kernel flow cache.

The Linux community is also working to reduce the overhead of going through the kernel.

The authors conclude:

Open vSwitch has simple origins but its performance has been gradually optimized to match the requirements of multi-tenant datacenter workloads, which has necessitated a more complex design. Given its operating environment, we anticipate no change of course but expect its design only to become more distinct from traditional network appliances over time.

the morning paper

a random walk through Computer Science research, by Adrian Colyer
Made delightfully fast by strattic

The Design and Implementation of Open vSwitch