Heracles: Improving Resource Efficiency at Scale

Heracles: Improving Resource Efficiency at Scale – Lo et al. 2015

Until recently, scaling from Moore’s law provided higher compute per dollar with every server generation, allowing datacenters to scale without raising the cost. However, with several imminent challenges in technology scaling, alternate approaches are needed.

Those approaches involve increasing server utilization, which is still surprisingly low (10-50% in many commercial datacenters). To improve on this requires sharing resources between different applications, and in particular resource sharing for latency critical (LC) as well as best effort (BE) tasks. Using latency critical and batch workloads from Google, Heracles demonstrates an unprecedented 90% average server utilisation without latency violations.

Increasing utilisation has follow-on benefits for power consumption and for TCO.

Looking at the power utilization, Heracles allows significant improvements to energy efficiency. Consider the 20% load case: EMU (Effective Machine Utilisation) was raised by a significant amount, from 20%to 60%-90%. However, the CPU power only increased from 60% to 80%. This translates to an energy efficiency gain of 2.3-3.4x. Overall, Heracles achieves significant gains in resource efficiency across all loads for the LC task without causing SLO violations.

The increased power costs per CPU are more than offset by the overally throughput and TCO savings.

Heracles’ ability to raise utilization to 90% translates to a 15% throughput/TCO improvement over the baseline. This improvement includes the cost of the additional power consumption at higher utilization…. If we assume a cluster for LC workloads utilized at an average of 20%, as many industry studies suggest, Heracles can achieve a 306% increase in throughput/TCO.

In other words, Google can run their datacenters at about 1/3 cost of a typical company, and the gap keeps widening. Yes, they use Borg and containers, but Heracles takes resource sharing to a whole new level, and to do that, Google needed to take isolation to a whole new level.

The challenge lies with latency-critical applications and the difficulty of sharing resources without violating Service Level Objectives (SLOs).

These user-facing services are typically scaled across thousands of servers and access distributed state stored in memory or Flash across these servers. While their load varies significantly due to diurnal patterns and unpredictable spikes in user accesses, it is difficult to consolidate load on a subset of highly utilized servers because the application state does not fit in a small number of servers and moving state is expensive. The cost of such underutilization can be significant.

Why can’t we just put non-latency sensitive tasks (best effort) on the same machines and boot them off when things get tight?

A promising way to improve efficiency is to launch best-effort batch (BE) tasks on the same servers and exploit any resources underutilized by LC workloads… The main challenge of this approach is interference between colocated workloads on shared resources such as caches, memory, I/O channels, and network links. LC tasks operate with strict service level objectives (SLOs) on tail latency, and even small amounts of interference can cause significant SLO violations.

Hence most work on utilisation to date has focused on throughput workloads. Heracles exploits new hardware isolation features as well as software isolation to enable resource sharing between LC and BE tasks:

Recently introduced hardware features for cache isolation and fine-grained power control allow us to improve colocation. This work aims to enable aggressive colocation of LC workloads and BE jobs by automatically coordinating multiple hardware and software isolation mechanisms in modern servers. We focus on two hardware mechanisms, shared cache partitioning and fine-grained power/frequency settings, and two software mechanisms, core/thread scheduling and network traffic control. Our goal is to eliminate SLO violations at all levels of load for the LC job while maximizing the throughput for BE tasks.

Primary Causes of Interference

Before we can improve isolation, we need to understand what the primary sources of interference are:

The primary shared resources are the cores in one or more CPU sockets. A static partitioning of cores to LC and BE tasks is not flexible enough. Intel HyperThreads further complicate matters as a HyperThread executing a BE task can interface with an LC task on instruction bandwidth, shared L1/L2 cache, and TLBs.
Interference on the shared last-level cache (LLC) is detrimental for colocated tasks. “To address this issue, Intel has recently introduced LLC cache partitioning in server chips. This functionality is called Cache Allocation Technology (CAT), and it enables way-partitioning of a highly-associative LLC into several subsets of smaller associativity.”
Latency Critical services put pressure on DRAM bandwidth, and hence are sensitive to DRAM bandwidth interference. There are no hardware isolation mechanisms to deal with this in commercially available chips.
Network traffic interference can also be an issue and requires the use of dynamic traffic control mechanisms.
Finally, power is an additional source of interfecence between colocated tasks:

All modern multi-core chips have some form of dynamic overclocking, such as Turbo Boost in Intel chips and Turbo Core in AMD chips. These techniques opportunistically raise the operating frequency of the processor chip higher than the nominal frequency in the presence of power headroom… the performance of LC tasks can suffer from unexpected drops in frequency due to colocated tasks. […] A dynamic solution that adjusts the allocation of power between cores is needed to ensure that LC cores run at a guaranteed minimum frequency while maximizing the frequency of cores for BE tasks.

Not considered by Heracles (because the LC workloads under consideration by the authors do not use disks or SSDs), storage can also be a source of interference:

The LC workloads we evaluated do not use disks or SSDs in order to meet their aggressive latency targets. Nevertheless, disk and SSD isolation is quite similar to network isolation. Thus, the same principles and controls used to mitigate network interfer- ence still apply.

The challenge in managing all of this is that the mechanisms interact:

A major challenge with colocation is cross-resource interactions. A BE task can cause interference in all the shared resources discussed. Similarly, many LC tasks are sensitive to interference on multiple resources. Therefore, it is not sufficient to manage one source of interference: all potential sources need to be monitored and carefully isolated if need be. In addition, interference sources interact with each other… In theory, the number of possible interactions scales with the square of the number of interference sources, making this a very difficult problem.

Isolation in Heracles

Heracles can place on LC workload with several BE tasks. “Since BE tasks are abundant, this is sufficient to raise utilization in many datacenters. We leave colocation of multiple LC workloads to future work.” It uses a hierarchy of controllers managing four different isolation mechanisms.

Cores

For core isolation, Heracles uses Linux’s cpuset cgroups to pin the LC workload to one set of cores and BE tasks to another set (software mechanism). This mechanism is necessary, since in §3 we showed that core sharing is detrimental to latency SLO. Moreover, the number of cores per server is increasing, making core segregation finer-grained. The allocation of cores to tasks is done dynamically. The speed of core (re)allocation is limited by how fast Linux can migrate tasks to other cores, typically in the tens of milliseconds.

A top-level controller polls the load and tail-latency of the LC tasks every 15 seconds. When the load drops below 80%, BE execution is enabled. If the latency slack – the difference between the SLO and the current tail latency – is negative, then BE execution is disabled.

If slack is less than 10%, the subcontrollers are instructed to disallow growth for BE tasks in order to maintain a safety margin. If slack drops below 5%, the subcontroller for cores is instructed to switch cores from BE tasks to the LC workload.

Last-Level Cache

Heracles uses the Cache Allocation Technology (CAT) available in recent Intel chips (hardware mechanism). CAT implements way-partitioning of the shared LLC. In a highly-associative LLC, this allows us to define non-overlapping partitions at the granularity of a few percent of the total LLC capacity. We use one partition for the LC workload and a second partition for all BE tasks. Partition sizes can be adjusted dynamically by programming model specific registers (MSRs), with changes taking effect in a few milliseconds.

DRAM

We enforce DRAM bandwidth limits in the following manner: we implement a software monitor that periodically tracks the total bandwidth usage through performance counters and estimates the bandwidth used by the LC and BE jobs. If the LC workload does not receive sufficient bandwidth, Heracles scales down the number of cores that BE jobs use.

DRAM controllers provide registers that track bandwidth usage. When they reach 90% of peak streaming, cores are removed from BE tasks.

Heracles uses an offline model that describes the DRAM bandwidth used by the latency-sensitive workloads at various loads, core, and LLC allocations. We verified that this model needs to be regenerated only when there are significant changes in the workload structure and that small deviations are fine. There is no need for any offline profiling of the BE tasks, which can vary widely compared to the better managed and understood LC workloads. There is also no need for offline analysis of interactions between latency-sensitive and best effort tasks. Once we have hardware support for per-core DRAM bandwidth accounting, we can eliminate this offline model.

Combined Core and Cache Allocation Controller

Heracles uses a single subcontroller for core and cache allocation due to the strong coupling between core count, LLC needs, and memory bandwidth needs. If there was a direct way to isolate memory bandwidth, we would use independent controllers.

BE tasks are added to the mix in two phases: first their share of LLC is increased, and then their share of cores…

Initially, a BE job is given one core and 10% of the LLC and starts in the GROW_LLC phase. Its LLC allocation is increased as long as the LC workload meets its SLO, bandwidth saturation is avoided, and the BE task benefits. The next phase (GROW_CORES) grows the number of cores for the BE job. Heracles will reassign cores from the LC to the BE job one at a time, each time checking for DRAM bandwidth saturation and SLO violations for the LC workload. If bandwidth saturation occurs first, the subcontroller will return to the GROW_LLC phase. The process repeats until an optimal configuration has been converged upon. The search also terminates on a signal from the top-level controller indicating the end to growth or the disabling of BE jobs. The typical convergence time is about 30 seconds.

Power

Heracles uses CPU frequency monitoring, Running Average Power Limit (RAPL), and per-core DVFS (hardware features). RAPL is used to monitor CPU power at the per-socket level, while per-core DVFS is used to redistribute power amongst cores. Per-core DVFS setting changes go into effect within a few milliseconds. The frequency steps are in 100MHz and span the entire operating frequency range of the processor, including Turbo Boost frequencies.

When we’re at full power and LC workload is running at too low a frequency, power is stolen from BE tasks.

When the operating power is close to the TDP (Thermal Dissipation Power) and the frequency of the cores running the LC workload is too low, it uses per-core DVFS to lower the frequency of cores running BE tasks in order to shift the power budget to cores running LC tasks.

Network Isolation

Heracles uses Linux traffic control (software mechanism). Specifically we use the qdisc scheduler with hierarchical token bucket queueing discipline (HTB) to enforce bandwidth limits for outgoing traffic from the BE tasks. The bandwidth limits are set by limiting the maximum traffic burst rate for the BE jobs (ceil parameter in HTB parlance). The LC job does not have any limits set on it. HTB can be updated very frequently, with the new bandwidth limits taking effect in less than hundreds of milliseconds.

Dealing with combined effects

Given the precise level of control achievable, Heracles still has to find the right settings for a given workload. This is treated as an optimisation problem, with an objective of maximising utilisation while satisfying SLO constraints. The insight that makes this all tractable is that the interference between resource control methods mostly occurs when a resource is saturated….

If Heracles can prevent any shared resource from saturating, then it can decompose the high-dimensional optimization problem into many smaller and independent problems of one or two dimensions each. Then each sub-problem can be solved using sound optimization methods, such as gradient descent.

Conclusions

We evaluated Heracles and several latency-critical and batch workloads used in production at Google on real hardware and demonstrated an average utilization of 90% across all evaluated scenarios without any SLO violations for the latency-critical job. Through coordinated management of several isolation mechanisms, Heracles enables colocation of tasks that previously would cause SLO violations. Compared to power-saving mechanisms alone, Heracles increases overall cost efficiency substantially through increased utilization.

The bar keeps getting higher and higher. Containers are just one small piece of the overall puzzle. Still want to run your own datacenters?