Dapper, A Large Scale Distributed Systems Tracing Infrastructure

Dapper, A Large Scale Distributed Systems Tracing Infrastructure – Sigelman et al. (Google) 2010

I’m going to dedicate the rest of this week to a series of papers addressing the important question of “how the hell do I know what is going on in my distributed system / cloud platform / microservices deployment?” As we’ll see, one important lesson is that the overall system structure has to be inferred – under constant deployments it doesn’t make sense to talk about a central master configuration file. Other important techniques include comprehensive instrumentation, sampling (to cope with the volume of data), centralised storage for information, and an API with tools built on top for exploring the results.

First up is Google’s Dapper. Hard to believe this paper is five years old! There are three supporting systems referenced in it: one for capturing debug information, one for exceptions, and of course Dapper itself for collecting traces. The basic idea of Dapper is easy to grasp, it’s the necessity of something like Dapper and the uses to which it is put that are the really interesting part of this story for me.

The short version:

  • Google uses a very common set of building blocks across all of its software, so by instrumenting these building blocks Dapper is able to automatically generate a lot of useful trace information without any application involvement. Dapper provides a trace context associated with a thread, and tracks the context across async callbacks and RPCs.
  • Programmers may optionally supplement the trace information via an annotation model to put application-specific information into the trace logs
  • Traces are sampled using an adaptive sampling rate (storing everything would require too much storage and network traffic, as well as introducting too much application overhead)
  • Traces are stored in BigTable, with one row in a trace table dedicated to each trace (correlation) id.

And importantly,

  • The Dapper team provided an API onto the tracing infrastructure, which has enabled many useful tools to be built on top.

Motivation

Why do we need distributed tracing in the first place?

Modern Internet services are often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by different teams, perhaps in different programming languages, and could span many thousands of machines across multiple physical facilities…. Understanding system behavior in this context requires observing related activities across many different programs and machines.

Suppose you’re trying to troubleshoot such an application. The first problem is that it’s hard to even pin down which services are used: “new services and pieces may be added and modified from week to week, both to add user-visible features and to improve other aspects such as performance or security.” And since the general model is that different teams have responsibility for different services, it’s unlikely that anyone is an expert in the internals of all of them. Finally, all of this takes place on shared infrastructure: “services and machines may be shared simultaneously by many different clients, so a performance artifact may be due to the behavior of another application.”

When systems involve not just dozens of subsystems but dozens of engineering teams, even our best and most experienced engineers routinely guess wrong about the root cause of poor end-to-end performance. In such situations, Dapper can furnish much-needed facts and is able to answer many important performance questions conclusively.

Lots of services, different teams responsible for them, continuous deployment – it’s not just Google that builds systems in this way any more. These are of course some of the same patterns underpinning the microservices approach. Thus the Dapper paper also serves as an illustration of the necessity of a distributed tracing infrastructure in a microservices world. In such a world, it’s impossible to know in advance what you’re going to need – hence monitoring needs to be ubiquitous and continuously deployed. This in turn means that it must be low overhead. It also rules out any reliance on developers to manually insert and maintain all of the required trace points:

A tracing infrastructure that relies on active collaboration from application-level developers in order to function becomes extremely fragile, and is often broken due to instrumentation bugs or omissions, therefore violating the ubiquity requirement.

Implementation

A Dapper trace represents a tree of activity across multiple services. Each trace is identified by a probabilistically unique 64-bit integer known as its trace id. Nodes in the trace tree are called spans, and represent the basic unit of work. Edges between spans indicated causality. Spans have a human-readable span name as well as a span id and a parent id. Developers may supplement the automatically provided trace information through an annotation mechanism. Currently, 70% of all Dapper spans and 90% of all Dapper traces have at least one application-specified annotation. Safeguards are in place to avoid overwhelming the system:

In order to protect Dapper users from accidental overzealous logging, individual trace spans have a configurable upper-bound on their total annotation volume. Application-level annotations are not able to displace the structural span or RPC information regardless of application behavior.

Span data is written to local log files, and then pulled from there by Dapper daemons, sent over a collection infrastructure, and finally ends up being written into BigTable. “A trace is laid out as a single Bigtable row, with each column corresponding to a span.” Trace logging and collection happens out-of-band from the main request flow.

The median latency for trace data collection – that is, the time it takes data to propagate from instrumented application binaries to the central repository – is less than 15 seconds. The 98th percentile latency is itself bimodal over time; approximately 75% of the time, 98th percentile collection latency is less than two minutes, but the other approximately 25% of the time it can grow to be many hours.

In terms of overhead, the most expensive thing Dapper does is writing trace entries to disk. This overhead is kept to a minimum by batching writes and executing asynchronously. In a worst-case scenario test, Dapper never used more than 0.3% of one core of a production machine. Dapper also has a very small memory footprint, and contributes to less than 0.01% of the network traffic in Google’s production environment.

Key to keeping the overhead low is collecting only a sample of traces:

The Dapper overhead attributed to any given process is proportional to the number of traces that process samples per unit time. The first production version of Dapper used a uniform sampling probability for all processes at Google, averaging one sampled trace for every 1024 candidates. This simple scheme was effective for our high-throughput online services since the vast majority of events of interest were still very likely to appear often enough to be captured…. We are in the process of deploying an adaptive sampling scheme that is parameterized not by a uniform sampling probability, but by a desired rate of sampled traces per unit time. This way, workloads with low traffic automatically increase their sampling rate while those with very high traffic will lower it so that overheads remain under control.

Is 1 in 1024 trace records really enough to catch problems??

New Dapper users often wonder if low sampling probabilities – often as low as 0.01% for high-traffic services – will interfere with their analyses. Our experience at Google leads us to believe that, for high-throughput services, aggressive sampling does not hinder most important analyses. If a notable execution pattern surfaces once in such systems, it will surface thousands of times. Services with lower volume – perhaps dozens rather than tens of thousands of requests per second – can afford to trace every request; this is what motivated our decision to move towards adaptive sampling rates.

A second level of sampling happens when writing to BigTable. Without it, the trace infrastructure comes close to saturating BigTable:

Our production clusters presently generate more than 1 terabyte of sampled trace data per day. Dapper users would like trace data to remain available for at least two weeks after it was initially logged from a production process. The benefits of increased trace data density must then be weighed against the cost of machines and disk storage for the Dapper repositories. Sampling a high fraction of requests also brings the Dapper collectors uncomfortably close to the write throughput limit for the Dapper Bigtable repository.

Write sampling is based on the trace id:

We leverage the fact that all spans for a given trace – though they may be spread across thousands of distinct host machines – share a common trace id. For each span seen in the collection system, we hash the associated trace id as a scalar z, where 0 ≤ z ≤ 1. If z is less than our collection sampling coefficient, we keep the span and write it to the Bigtable. Otherwise, we discard it. By depending on the trace id for our sampling decision, we either sample or discard entire traces rather than individual spans within traces.

What is it good for?

The Dapper team built an API for querying trace records supporting access by trace id, bulk processing of trace records, and indexed access allowing for lookup by service name, host machine, and timestamp in that order. An interactive web-based UI sits on top of this API.

According to our logs, roughly 200 different Google engineers use the Dapper UI on a typical weekday; over the course of the week, accordingly, there are approximately 750-1000 distinct users. Those numbers are consistent from month to month modulo internal announcements of new features. It is common for users to send out links to specific traces of interest which will inevitably generate much one-time, short-duration traffic in the trace inspector.

Since Dapper shows what is really going on, as opposed to hunches that engineers might have, it has found many uses inside of Google. The Ads Review team used Dapper extensively to help them rewrite one of their services from the ground up, and credit it with helping them to achieve a two-order of magnitude improvement in latency. Causality captured in trace trees can also be used to infer service dependencies:

At any given time, a typical computing cluster at Google is host to thousands of logical “jobs”; sets of processes performing a common function. Google maintains many such clusters, of course, and indeed we find that the jobs in one computing cluster often depend on jobs in other clusters. Because dependencies between jobs change dynamically, it is not possible to infer all inter-service dependencies through configuration information alone. Still, various processes within the company require accurate service dependency information in order to identify bottlenecks and plan service moves among other things. Google’s appropriately-named “Service Dependencies” project has made use of trace annotations and the DAPI MapReduce interface in an effort to automate service dependency determination.

(Note here the passing reference to the impossibility of maintaining a central/master service dependency configuration file).

Dapper is used to provide context for exception traces:

Google maintains a service which continually collects and centralizes exception reports from running processes. If these exceptions occurred in the context of a sampled Dapper trace, the appropriate trace and span ids are included as metadata in the exception report. The frontend to the exception monitoring service then provides links from specific exception reports to their respective distributed traces.

Dapper has also been used to understand tail latency, the network usage of different services, and to determine causes of load in shared storage systems.