Blade: A data center garbage collector

Blade: A data center garbage collector – Terei & Levy 2015

Thanks to Justin Mason (@jmason) for bringing today’s choice to my attention.

GC times are a major cause of latency in the tail – Blade aims to fix this. By taking a distributed systems perspective rather than just a single node view, Blade collaborates with the application to move load away from a node that is about to garbage collect. It’s a simple scheme, but very effective. As a bonus, it works best with the highest throughput ‘stop-the-world’ garbage collection algorithm – which also happens to be one of the simplest.

Recently, there has been an increasing push for low-latency at the tail in distributed systems. This has arisen from the needs of modern data center applications which consist of hundreds of software services, deployed across thousands of machines. For example, a single Facebook page load can involve fetching hundreds of results from their distributed caching layer, while a Bing search consists of 15 stages and involves thousands of servers in some of them. These applications require latency in microseconds with tight tail guarantees…. Unfortunately, garbage collection introduces high tail-latencies due to long pause times.

Java and Go are both garbage collected.

While many distributed systems require average and tail-latencies in microseconds, garbage collection pause times can range from milliseconds for small workloads to seconds for large workloads.

Blade interfaces with the runtime system in a very simple manner. You register a Blade callback handler that will be invoked when the system wants to schedule a gc, and also have access to a method to cause a gc to start now. The gc handler callback is passed a collection id (so that calling startGC is idempotent), the current heap size, and an indication as to how long the gc might take. The handler can either request that the gc start immediately, or hold collection until it explicitly calls startGC. And that’s it!

Consider a load-balanced HTTP service. When a node wants to gc, it can contact the load balancer and ask to be taken out of the traffic distribution for a period. Once the gc has completed, traffic can be sent back to the node again. A very simple idea, but very effective.

The approach is to have a HTTP server explicitly notify the load-balancer when it needs to perform a collection, and then wait for the load-balancer to schedule it. Once the collection has been scheduled, the load-balancer will not send any new requests to the HTTP server, and the HTTP server will finish any outstanding requests. Once all requests are drained, it can start the collection, and once finished, notify the load-balancer and begin receiving new requests. In most situations, the load-balancer will schedule a HTTP server to collect immediately. However, it may decide to delay the collection if a critical number of other HTTP servers are currently down for collection.

Blade was integrated into the Go Gorilla Web Toolkit, and tests performed on a simple three-node cluster. The following table shows the dramatic impact on tail latency (and note how the tail latency is effectively hidden if you only considered mean and/or median latency). The GC off column is a baseline comparison in which garbage collection is switched off entirely.

GC off Blade GC on
Mean 2.312 2.311 2.403
Median 2.296 2.294 2.297
Std. Dev. 0.579 0.582 3.395
Max 7.847 7.443 164.206
Avg. GC pause 0.0 12.243 12.339

Tail latency is reduced from 164ms to 7ms, and gc overhead has no impact so long as the remaining servers are below their peak capacity. Above this, the cost of gc will still be evenly spread across all active servers, eliminating the tail effect.

A second example concerns integration into a Raft-based consensus system – etcd is chosen.

We use BLADE with Raft as follows. First, when a follower needs to GC, we follow a protocol similar to the HTTP load-balanced cluster. The follower notifies the leader of it’s intention to GC and waits to be scheduled. The leader schedules the collection as long as doing so will leave enough servers running for a majority to be formed and progress made. We only consider servers offline due to GC for this, as servers down for other reasons could be down for an arbitrary amount of time…. The second situation, when a server is acting as leader for the cluster, is more interesting. Since the cluster cannot make progress when the leader is unavailable, we switch leaders before collecting. Once the leadership has been transferred, the old leader (now a follower) runs the same algorithm as presented previously for followers.

The following table shows the impact when this scheme is integrated into etcd (response times in ms):

Percentile Blade GC off GC on
95th 0.52 0.54 0.54
99th 0.57 0.57 0.56
99.9th 0.62 0.67 28.81
99.99th 0.70 1.03 86.98
99.999th 0.80 1.08 94.22
99.9999th 0.97 1.12 95.96

The reason BLADE even outperforms the GC-Off configuration is due to the penalty GC-Off pays from the extra system calls and lost locality from requesting new memory rather than ever recycling it.

Summary of GC schemes

The paper also includes a nice summary of the main approaches to GC.

Stop-the-world (as used by Go, Ruby, and Oracle’s JVM by default), with 10-40ms of pause time per GB of heap:

Stop-the-world collectors are the oldest, simplest and highest throughput collectors available [22]. A STW GC works by first completely stopping the application, then starting from a root set of pointers (registers, stacks, global variables) traces out the applications live set.

Concurrent

Concurrent collectors attempt to reduce the pause time caused by STW collectors by enabling the GC to run concurrently with application threads. They achieve this by using techniques such as read and write barriers to detect and fix concurrent modifications to the heap while tracing live data and/or relocating objects.

Pause times may be measured in a few milliseconds, but they also reduce throughput by 10-40% and increase memory usage by 20%.

Real-time

Garbage collectors designed for real-time systems take the approaches of concurrent collectors even further, many offering the ability to bound pause times. The best collectors can achieve bounds in the tens of microseconds [50, 51], however doing so comes at a high throughput cost ranging from 30%–100% overheads, and generally increased heap sizes of around 20%.

Reference Counting

A completely different approach to a tracing garbage collector is reference counting. Each object has an integer attached to it to count the number of incoming references, which once it reaches zero, indicates the object can be freed. It’s largely predictable behaviour and simple implementation makes it common, for example, Python, Objective-C and Swift all use reference counting.

On average reference counting has 30% lower throughput than tracing collectors.

The good news for users of Blade is that the simplest to implement (and highest throughput) scheme works best with it:

Another important outcome from using BLADE in a distributed system, is the changed requirements for the garbage collector. As BLADE deals with the latency impact, in most situations a concurrent collector will no longer be the best fit. Instead, a simpler and high-throughput stop-the-world collector is best suited. These collectors are already readily available in most languages, unlike high- performance concurrent collectors.