Large-scale cluster management at Google with Borg

Large-scale cluster management at Google with Borg – Verma et al. 2015

Borg has been running all of Google’s workloads for the last ten years, and the learnings from Borg are being packaged into kubernetes so that the rest of the world can benefit from them. An important paper then as the rest of us rush towards a container-based world.

The cluster management system we internally call Borg admits, schedules, starts, restarts, and monitors the full range of applications that Google runs. This paper explains how…

And when they say “the full range of applications,” that’s a pretty impressive list:

Many application frameworks have been built on top of Borg over the last few years, including our internal MapReduce system, FlumeJava, Millwheel, and Pregel. Most of these have a controller that submits a master job and one or more worker jobs; the first two play a similar role to YARN’s application manager. Our distributed storage systems such as GFS and its successor CFS, Bigtable, and Megastore all run on Borg.

(also, don’t forget Gmail, Google Docs etc.).

There’s a wealth of great information in this paper, so I can only highlight a subset here. Let’s take a look at Borg from a few different perspectives: Borg as a warehouse-scale computer OS, Borg from a user’s viewpoint, and Borg as the inspiration for Kubernetes.

Borg as a warehouse-scale computer OS

A datacenter site has multiple buildings, inside a single building is a cluster connected by a high-performance datacenter-scale network fabric. A cluster usually hosts one large Borg cell. The median cell size is about 10,000 machines, but of course some cells are much larger. The machines within a cell are quite heterogeneous in nature, Borg hides theses differences from applications.

50% of our machines run 9 or more tasks; a 90%ile machine has about 25 tasks and will be running about 4500 threads. Although sharing machines between applications increases utilization, it also requires good mechanisms to prevent tasks from interfering with one another. This applies to both security and performance. We use a Linux chroot jail as the primary security isolation mechanism between multiple tasks on the same machine… all Borg tasks run inside a Linux cgroup-based resource container.

Each machine in the cell runs the Borg agent (Borglet). The Borgmaster coordinates all activity in the cell and is implemented as a five-machine Paxos group. The Borgmaster coordinates with a scheduler.

We are not sure where the ultimate scalability limit to Borg’s centralized architecture will come from; so far, every time we have approached a limit, we’ve managed to eliminate it. A single Borgmaster can manage many thousands of machines in a cell, and several cells have arrival rates above 10 000 tasks per minute. A busy Borgmaster uses 10–14 CPU cores and up to 50GiB RAM.

The scoring algorithm which the scheduler uses to help decide on work placements is carefully designed with scalability in mind – see the full paper for details. In fact, cells could in theory be even larger than they are – the limits are not defined by Borg’s scalability, but by a desire to minimize the impact of faults.

Borgmaster uses a combination of techniques that enable it to achieve 99.99% availability in practice: replication for machine failures; admission control to avoid overload; and deploying instances using simple, low-level tools to minimize external dependencies. Each cell is independent of the others to minimize the chance of correlated operator errors and failure propagation. These goals, not scalability limitations, are the primary argument against larger cells.

When looking at Borg, it’s helpful to understand that many decisions were taken with a goal of improving utilization:

One of Borg’s primary goals is to make efficient use of Google’s fleet of machines, which represents a significant financial investment: increasing utilization by a few percentage points can save millions of dollars.

Some consequences of this goal include:

  • Mixing of production and non-production workloads in a single cell (segregating them would require 20-30% more machines in the median cell). Non-production workloads optimistically take advantage of the cautious resource reservations made by production jobs.
  • Cells are shared by thousands of users. Experimentation with moving larger users to dedicated cells showed that this work require 2-16x the number of cells, and about 20-150% more machines in total.
  • The creation of large cells. These both allow large computations to be run, and also reduce fragmentation. Again, using smaller cells would require significantly more machines.
  • Allowing fine-grained resource requests rather than limiting choices to a few pre-determined sizes or buckets. Experiments showed that the latter approach would require 30-50% more resources in the median case.
  • A resource reclamation scheme that allocates over-provisioned resources to non-production workloads – about 20% of the total workload runs in this way.

Interestingly, data locality is not supported by the Borg scheduler: presumably the cluster network is good enough, and the extra scheduling constraints would also work against the utilization goals.

Borg from a user’s perspective

Borg’s users are Google developers and system administrators (site reliability engineers or SREs) that run Google’s applications and services. Users submit their work to Borg in the form of jobs, each of which consists of one or more tasks that all run the same program (binary)… A Borg job’s properties include its name, owner, and the number of tasks it has. Jobs can have constraints to force its tasks to run on machines with particular attributes such as processor architecture, OS version, or an external IP address.

“Each task maps to a set of Linux processes running in a container on a machine.” Tasks can also have properties – such as their resource requirements. Borg programs are statically linked which reduces dependencies on the runtime environment.

Users operate on jobs by issuing remote procedure calls (RPCs) to Borg, most commonly from a command-line tool, other Borg jobs, or our monitoring systems (§2.6). Most job descriptions are written in the declarative configuration language BCL. This is a variant of GCL, which generates protobuf files, extended with some Borg-specific keywords.

Running tasks can be updated by pushing a new configuration to Borg and requesting an update. Tasks can ask for a SIGTERM notification before they are pre-empted by a SIGKILL.

Quota and priority control what gets run when and where. The first gate is controlled by quota:

Quota is used to decide which jobs to admit for scheduling. Quota is expressed as a vector of resource quantities (CPU, RAM, disk, etc.) at a given priority, for a period of time (typically months). The quantities specify the maximum amount of resources that a user’s job requests can ask for at a time (e.g., “20 TiB of RAM at prod priority from now until the end of July in cell xx”). Quota-checking is part of admission control, not scheduling: jobs with insufficient quota are immediately rejected upon submission.

The language used to describe quotas strongly suggests an internal cost-accounting and chargeback mechanism: “Higher-priority quota costs more than quota at lower-priority,” “we encourage users to purchase no more quota than they need, many users overbuy because it insulates them against future shortages when their application’s user base grows. We respond to this by over-selling quota at lower priority levels.”

Quota allocation is handled outside of Borg, and is intimately tied to our physical capacity planning, whose results are reflected in the price and availability of quota in different datacenters.

One jobs have passed quota-control, job priority determines its actual scheduling:

Every job has a priority, a small positive integer. A high-priority task can obtain resources at the expense of a lower-priority one, even if that involves preempting (killing) the latter. Borg defines non-overlapping priority bands for different uses, including (in decreasing-priority order): monitoring, production, batch, and best effort (also known as testing or free). For this paper, prod jobs are the ones in the monitoring and production bands.

(Note that ‘monitoring’ takes the very highest priority – if you lose visibility into what is going on, things can start to head downhill very rapidly…).

Each task is given its own Borg Name Service stable name and entry in Chubby, as well as a unique DNS name based on this. Tasks have built-in health monitoring:

Almost every task run under Borg contains a built-in HTTP server that publishes information about the health of the task and thousands of performance metrics (e.g., RPC latencies). Borg monitors the health-check URL and restarts tasks that do not respond promptly or return an HTTP error code. Other data is tracked by monitoring tools for dashboards and alerts on service level objective (SLO) violations.

“Sigma” provides a web-based user-interface for a user to manage and view their jobs and tasks. Everything is recorded in a store called ‘Infrastore,’ “a scalable read-only data store with an interactive SQL-like interface via Dremel.”

This data is used for usage-based charging, debugging job and system failures, and long-term capacity planning. It also provided the data for the Google cluster workload trace.

Powered by these tools, Google’s SREs manage ‘a few tens of thousands of machines’ per person.

Borg as the inspiration for Kubernetes

Virtually all of Google’s cluster workloads have switched to use Borg over the past decade. We continue to evolve it, and have applied the lessons we learned from it to Kubernetes.

  • Kubernetes’ labels are a more flexible mechanism than job names in Borg (although of course, labels can also be used to simulate job name behaviour).
  • Kubernetes allocates one IP per pod and service, which alleviates port management challenges that arise in Borg’s IP-per-machine policy
  • Borg’s alloc mechanism for grouping tasks inspired Kubernetes’ pods.
  • Since Google quickly found out that cluster management is more than just task management (the applications that run on Borg benefit from many services such as load balancing and naming), Kubernetes supports a service abstraction.
  • Operating Borg taught Google that debugging information needs to be surfaced to all users. Hence,

Kubernetes aims to replicate many of Borg’s introspection techniques. For example, it ships with tools such as cAdvisor for resource monitoring, and log aggregation based on Elasticsearch/Kibana and Fluentd. The master can be queried for a snapshot of its objects’ state. Kubernetes has a unified mechanism that all components can use to record events (e.g., a pod being scheduled, a container failing) that are made available to clients.

  • Borg evolved into the kernel of an ecosystem of services that coordinate to manage user jobs.

The Kubernetes architecture goes further: it has an API server at its core that is responsible only for processing requests and manipulating the underlying state objects. The cluster management logic is built as small, composable micro-services that are clients of this API server, such as the replication controller, which maintains the desired number of replicas of a pod in the face of failures, and the node controller, which manages the machine lifecycle.