Skip to content

Flexible Paxos: Quorum intersection revisited

September 27, 2016

Flexible Paxos: Quorum intersection revisited Howard et al., 2016

Paxos has been around for 18 (26) years now, and extensively studied. (For some background, see the 2 week mini-series on consensus that I put together last year). In this paper, Howard et al. make a simple(?) observation that has significant consequences for improving the fault-tolerance and throughput of Paxos-based systems. After so much time and so much study, you can’t help but admire that.

Paxos proceeds in two phases: in the first phase a proposer is chosen (through a quorum), known as the leader; in the second phase the leader proposes a value to be agreed upon by the participants (acceptors) and brings a quorum to agreement. Paxos (like many other consensus systems) uses majority agreement for its quorums (i.e., at least N/2 + 1 out of N acceptors), which guarantees an intersection between the participants in phase one and phase two. Since we normally want to make agreement over a series of values (referred to in the literature as slots), we need to run a distinct instance of Paxos each time. Multi-Paxos recognises that the first phase (leader election) is independent of the proposal phase, so why not do that once and then retain that leader through a whole series of phase two agreements?

Here’s the key new insight: Paxos is more conservative than necessary in requiring a majority quorum in each phase. All we actually need to guarantee is that the phase one and phase two quorums intersect. We don’t need to guarantee that phase one quorums across successive rounds intersect, nor that phase two quorums intersect. If Q1 represents the participants in the first phase quorum, and Q2 the participants in the second phase quorum, then |Q1| + |Q2| > N (for N overall participants) will provid the needed guarantee. And that gets really interesting when you consider that (hopefully) leader elections happen much less frequently than phase 2 slot-decisions. So you can trade off e.g. requiring a larger quorum when you do need to elect a new leader for having a smaller quorum requirement during regular operations. The authors call this variant Flexible Paxos (FPaxos). If you’re like me, you’ll want to see some kind of proof or justification for the quorum intersection claim – I spent the first 8 pages of the paper hungry for it! It’s in section 5 starting on page 9. There’s also a model checked formal specification in TLA+ in the appendix.

In this paper, we demonstrate that Paxos, which lies at the foundation of many production systems, is conservative. Specifically, we observe that each of the phases of Paxos may use non-intersecting quorums. Majority quorums are not necessary as intersection is required only across phases.

It has previously been noted that Paxos could be generalized to replace majority quorums with any quorum system guaranteeing any two quorums will have a non-empty intersection. FPaxos takes this generalisation one step further by relaxing the conditions under which such guarantees are needed in the first place.

Howard et al. (to the best of their knowledge) are the first to prove and implement the generalization. While preparing their publication Sougoumarane independently made the same observation on which the work is based, and released a blog post summarising it last month.

Why does all this matter?

The fundamental theorem of quorum intersection states that [the resilience of a quorum system] is inversely proportional to the load on (hence the throughput of) the participants. Therefore, with Paxos and its intersecting quorums, one can only hope to increase throughput by reducing the resilience, or vice versa…. by weakening the quorum intersection requirement, we can break away from the inherent trade off between resilience and performance.

Howard et al. go on to illustrate the practical implications of the relaxed quorum requirement for FPaxos using majority quorums, simple quorums, and grid quorums. These however can be considered “naive” quorum systems, and FPaxos actually opens up a whole new world:

There already exists an extensive literature on quorum systems from the fields of databases and data replication, which can now be more efficiently applied to the field of consensus. Interesting example systems include weighted voting, hierarchies, and crumbling walls.

[And that’s another addition to my known unknowns list😉 ].

Majority quorums

If we stick with majority quorums, FPaxos allows us to make a simple improvement in the case when the number of acceptors n is even. Instead of needing n/2 +1 participants in Q2, we can reduce this to n/2.

Such a change would be trivial to implement and by reducing the number of acceptors required to participate in replication, we can reduce latency and improve throughput. Furthermore, we have also improved the fault tolerance of the system. As with Paxos, if at most n/2 -1 failures occur then we are guaranteed to be able to make progress. However unlike with Paxos, if exactly n/2 acceptors fails and the leader is still up then we are able to continue to make progress and suffer no loss of availability.

Simple quorums

Simple quorums are for me the most natural way of understanding the benefits of FPaxos. “We will use the term simple quorums to refer to a quorum system where any acceptor is able to participate in a quorum and each acceptor’s participation is counted equally. Simple quorums are a straightforward generalization of majority quorums.”

Since we require only that |Q1| + |Q2| > N, and we know that phase 2 happens a lot more often than phase one, we can reduce the quorum size requirement in phase 2 and make a corresponding increase in phase 1. Take a classic Paxos deployment with 5 replicas. Whereas traditionally each phase requires at least three nodes (acceptors) to be up, we can tweak this to require four acceptors in a leader election quorum, but only two acceptors for Q2.

FPaxos will always be able to handle up to |Q2| – 1 failures. However, if between |Q2| to N – |Q2| failures occur, we can continue replication until a new leader is required.

If we chose 5 replicas with |Q1| = 4 and |Q2| = 2, we therefore will always be able to handle any single failure, and we’ll be able to continue replication with the loss of up to three replicas, so long as we don’t need a new leader.

Here are some results from the evaluation showing how the latency and throughput of FPaxos compares to regular Paxos with varying Q2 quorum sizes:

Grid quorums

Grid quorum schemes arrange the N nodes into a matrix of N1 columns by N2 rows, where N1 × N2 = N and quorums are composed of rows and columns. As with many other quorum systems, grid quorums restrict which combinations of acceptors can form valid quorums. This restriction allows us to reduce the size of quorums whilst still ensuring that they intersect.

Whereas with simple quorums any reduction in |Q2| must be paid for by a corresponding increase in |Q1|, with grid quorums we can make different trade-offs between quorum size, quorum selection flexibility, and fault tolerance.

Since Paxos requires all quorums to intersect, one suitable scheme would be to require one row and one column to form a quorum : (a) in the figure above.

In FPaxos we can safely reduce our quorums to one row of size N1 for Q1, and one column of size N2 for Q2. This construction is interesting as quorums from the same phase will never intersect, and may be useful in practice for evenly distributing the load of FPaxos across a group of acceptors.

A further enhancement

FPaxos only requires that a given Q1 will intersect with all Q2’s with lower proposal numbers. Given a mechanism to learn which Q2s have participated in lower numbered proposals, we have further flexibility in choosing a Q1 quorum.

The implications of this enhancement can be far reaching. For example, in a system of N = 100 f nodes, a leader may start by announcing a fixed Q2 of size f+1 and all higher proposal numbers (and readers) will need to intersect with only this Q2. This allows us to tolerate _N-f_failures…

(And many other interesting variations are possible).

The bottom line

Generalizing existing systems to use FPaxos should be quite straightforward. Exposing replication (phase 2) quorum size to developers would allow them to choose their own trade-off between failure tolerance and steady state latency… [Secondly], by no longer requiring replication quorums to intersect, we have removed an important limit on scalability. Through smart quorum construction and pragmatic system design, we believe a new breed of scalable, resilient, and performant consensus algorithms is now possible.

Time evolving graph processing at scale

September 26, 2016

Time evolving graph processing at scale Iyer et al., GRADES 2016

Here’s a new (June 2016) paper from the distinguished AMPlab group at Berkeley that really gave me cause to reflect. The work addresses the problem of performing graph computations on graphs that are constantly changing (because updates flow in, such as a new follower in a social graph). Many graphs of interest have this property of constantly evolving. In part, that’s what makes them interesting. You could always take a snapshot of e.g. the graph as it was at the end of the previous day and compute on that, but some applications need more up to date results (e.g. detecting traffic hotspots in cellular networks), and many applications would benefit from real-time results. GraphTau is a solution to this problem, implemented on top of GraphX which is in turn implemented on top of Spark’s RDDs. It’s a convergence of stream processing and graph processing.

I’m seeing a lot of streaming recently, and a lot of convergence. That topic probably warrants a separate post. In the meantime…

Graph-structured data is on the rise, in size, complexity and the dynamism they exhibit. From social networks to telecommunication networks, applications that generate graph-structured data are ubiquitous… the dynamic nature of these datasets gives them a unique characteristic – the graph-structure underlying the data evolves over time. Unbounded, real-time data is fast becoming the norm, and thus it is important to process these time-evolving graph-structured datasets efficiently.

(Aside: applications generating graph-structured data certainly are ubiquitous – pretty much any relational database has graph structure the minute you introduce foreign keys. It’s applications generating graph-structured data and that require extensive traversals or graph-specific computations that we’re really interested in here).

For time-evolving graph-structured datasets, the authors identify three core requirements:

  1. The ability to execute iterative graph algorithms in real-time
  2. The ability to combine graph-structured data with unstructured and tabular data
  3. The ability to run analytics over windows of input data

While some specialized systems for evolving-graph processing exist, these do not support the second and third requirements.  GraphTau is “the first time-evolving graph processing system, to our knowledge, built on a general purpose dataflow framework.” GraphTau is built on top of GraphX, which maintains graphs internally as a pair of RDDs: a vertex collection and an edge collection.

(Note that Apache Flink has Gelly, which builds graph processing on top of a streaming dataflow core, but does not support iterative processing over evolving graphs to the best of my knowledge.)

The main idea in GraphTau is to treat time-evolving graphs as a series of consistent graph snapshots, and dynamic graph computations as a series of deterministic batch computations on discrete time intervals. A graph snapshot is simply a regular graph, stored as two RDDs, the vertex RDD and the edge RDD. We define GraphStream as a sequence of immutable, partitioned datasets (graphs represented as two RDDs) that can be acted on by deterministic operators. User programs manipulate GraphStreams to produce new GraphStreams, as well as intermediate state in the form of RDDs or graphs.

A DeltaRDD is an RDD whose elements are updates that need to be applied to a graph. A stream of such updates is called a DeltaDStream.  GraphStreams can be built from a DeltaDStream or directly from a vertex DStream and an edge DStream.

There are two supported computational models, called pause-shift-resume and online rectification.

Pause-shift-resume

Some classes of graph algorithms can cope with the graph being modified while the algorithm is still converging. For example, if a graph changes during an evaluation of PageRank you’ll still get an answer, which studies have shown will be within a reasonable error to the actual answer you’d get if you started the algorithm again from scratch with the now current graph.

Under these conditions, the pause-shift-resume (PSR) model is appropriate.

In this model, GraphTau starts running a graph algorithm as soon as the first snapshot of a graph is available. Upon the availability of a new snapshot, it pauses the computation on the current graph, shifts the algorithm specific data to the new snapshot, and resumes the computation on the new graph.

Online rectification

Algorithms such as connected-components will produce incorrect results under the PSR model (consider an edge or vertex that is removed during processing).

For such algorithms, GraphTau proposes the online rectification model. In this model, GraphTau rectifies the errors caused by the underlying graph modificationts in an online fashion using minimal state.

In the connected component example, it is necessary for every vertex to keep track of its component id over time. The approach works for any algorithm based on label propagation, at the expense of needing to keep algorithm-specific state.

The question of time

GraphStream splits time into non-overlapping intervals, and stores all the inputs received during these intervals in batches (worker nodes are synchronized using NTP).  Such intervals are based on receive time, there is also an option to process based on external timestamps (event time) which requires either the introduction of limited slack time to wait for late events, or application specific code to correct for late records.

Each interval’s updates reflects all of the input received until then. This is despite the fact that the DeltaRDD and its updated graph snapshot are distributed across nodes. As long as we process the whole batch consistently (e.g. ordered by timestamps), we will get a consistent snapshot. This makes distributed state much easier to reason about and is the same as “exactly once” processing of the graph updates even with faults or stragglers.

GraphStream inherits its recovery mechanisms from GraphX and its RDDs: parallel recovery of lost state and speculative execution.

Programming with GraphTau

The GraphStream interface supports transform, merge, streamingBSP, and forEachGraph operations as well an updateLocalState operator to allow for event processing and state tracking.

  • mergeByWindow merges all graphs from a sliding window of past time intervals into one graph
  • forEachGraph applies a function to each graph generated from the GraphStream
  • transformWith combines two graph streams with various join and cogroup operators.
  • the streamingBSP operator supports differential computation

This [streamingBSP] operator enables efficient implementation of a large class of incremental algorithms on time-evolving graphs. We signal the availability of the new graph snapshot using a variable in the driver program. In each iteration of Pregel, we check whether a new graph is available. If so, we do not proceed to the next iteration on the current graph. Instead, we resume computation on the new graph reusing the result, where only vertices in the new active set continue message passing. The new active set is a function of the old active set and the changes between the new graph and the old graph. For a large class of algorithms (e.g. incremental PageRank), the new active set includes vertices from the old set, any new vertices, and vertices with edge additions and deletions.

Here’s what the Page Rank example looks like:

Even on a simple six-node graph where one edge is added after 10 iterations, this saves 13/34 iterations overall.

Here’s another example GraphTau program, showing the ability to unify data and graph stream processing.

This example computes top users in terms of triangle counts from a Twitter attention graph. A DStream ds is created from the external Twitter feed, and then a GraphStream is built from it. Triangle count is applied to each graph snapshot, and then we compute the number of times a user is a top user over a sliding window of ten seconds, outputting results every second.

Preliminary evaluation shows that GraphTau’s performances matches or exceeds that of specialized systems on a streaming connected components task based on a cellular dataset.

The last word…

In this paper, we presented GraphTau, a time-evolving graph processing system built on a data flow framework that addresses this demand. GraphTau represents time-evolving graphs as a series of consistent graph snapshots. On these snapshots, GraphTau enables two computational model, the Pause-Shift-Resume model and the Online Rectification model which allows the application of differential computation on a wide variety of graph algorithms. These techniques enable GraphTau to achieve significant performance improvements.

Texture networks: feed-forward synthesis of textures and stylized images

September 23, 2016

Texture Networks: Feed-forward synthesis of textures and stylized images Ulyanov et al., arXiv, March 2016

During the summer break I mostly stayed away from news feeds and twitter, which induces terrible FOMO (Fear Of Missing Out) to start with. What great research was published / discussed that I missed? Was there a major industry announcement I’m completely ignorant of? One thing I’m glad I didn’t miss was the Prisma app that produces quite beautiful stylized versions of photos from your smartphone. It’s a great example of deep technology behind a simple interface, and also of the rapid packaging and exploitation of research results – today’s choice is the paper describing the technology breakthrough that makes Prisma possible, and it was released to arXiv in March 2016. The source code and models described in the paper can also be found on GitHub.

Gatys et al. recently (2015) showed that deep networks can generate beautiful textures and stylized images from a single texture example. If you want to style a lot of images though (to provide styling-as-a-service for example), you’ll find that their technique is slow and uses a lot of memory. To generate images of equivalent quality, an implementation of Gatys et al. required about 10 seconds and 1.1GB of memory, whereas the approach described by Ulyanov et al. in this paper requires about 20ms and only 170MB of memory. Significantly faster and cheaper therefore, and although the algorithm doesn’t quite match the results of Gatys et al. for all images, it’s still very good.

Just in case you haven’t seen it, here are some examples. First, generating textures in the style of sample image:

And combining a style image with a content image:

If you download the app, you can play with examples using your own photos.

One of the possibilities I’m personally excited about is the opportunities the image creation speed opens up for applying the technique to movies. I like the images, but when I saw the movies created by Ruder et al. using an extension of the Gatys technique I was really blown away [paper,explanation and video]. Update: I just learned about the Artisto app that does this for you!

High-level approach

In general, one may look at the process of generating an image x as the problem of drawing a sample from a certain distribution p(x). In texture synthesis, the distribution is induced by an example texture instance x0 such that we can write x ~ p(x|x0). In style transfer, the distributed is induced by an image x0 representative of the visual style (e.g. an impressionist painting) and a second image x1 representative of the visual content (e.g. a boat), such that x ~ p(x|x0,x1).

Gatys et al. cast this as an optimisation problem looking to minimise the difference between certain image statistics of the generated image, and the statistics of the example image(s). They use an iterative optimisation procedure with back propagation to gradually change the values of the pixels in the generated image until the desired statistics are achieved.

In contrast, in the texture networks approach a feed-forward generation network produces the image, which requires only a single evaluation of the network and does not incur in the cost of backpropagation.

A separate generator network is trained for each texture or style and, once trained, it can synthesize an arbitrary number of images of arbitrary size in an efficient feed-forward manner.

The loss function used in training the generator network is derived from Gatys et al. and compares image statistics extracted from a fixed pre-trained descriptor CNN. This is used to measure the mismatch between the prototype texture and the generated image. The texture loss function compares feature activations across all spatial locations. A similar content loss function compares feature activations at corresponding spatial locations, and therefore preserves spatial information.

Analogously to Gatys et al. we use the texture loss alone when training a generator network for texture synthesis, and we use a weighted combination of the texture loss and the content loss when training a generator network for stylization.

Textures

A texture generator network is trained to transform a noise vector sampled from a certain distribution into texture samples that match, according to the texture loss function, a certain prototype texture x0, a three colour channel tensor.

We experimented with several architectures for the generator network g… we found that multi-scale architectures result in images with small texture loss and better perceptual quality while using fewer parameters and training faster. […] Each random noise tensor is first processed by a sequence of convolutional and non-linear activation layers, then upsampled by a factor of two, and finally concatenated as additional feature channels to the partially processed tensor from the scale below.

(Click on image for larger view).

Each convolutional block contains three convolutional layers containing respectively 3×3, 3×3, and 1×1 filters applied using circular convolution to remove boundary effects. Each convolutional layer is followed by a ReLU activation layer.

When learning using stochastic gradient descent each iteration draws a mini-batch of noise vectors, performs forward evaluation of the generator network to obtain the corresponding images, and computes the loss vs x0.

… After that, the gradient of the texture loss with respect to the generator network parameters θ is computed using backpropagation, and the gradient is used to update the parameters.

Styling

For stylized image generator networks the network is modified to take as input in addition to the noise vector z , the image y to which the noise should be applied.

The generator network is then trained to output an image x that is close in content to y and in texture/style to a reference texture x0.

The architecture is the same as that used for texture synthesis, _with the important difference that the noise tensors at the K scales are concatenated (as additional feature channels) with downsampled versions of the input image y. The learning objective is to minimize the combination of the content and texture loss.

In practice, we found that learning is surprisingly resilient to overfitting and that it suffices to approximate the distribution on natural images with a very small pool of images (e.g 16).

Broader applicability

The success of this approach highlights the suitability of feed-forward networks for complex data generation and for solving complex tasks in general. The key to this success is the use of complex loss functions that involve different feed-forward architectures serving as “experts” assessing the performance of the feed-forward generator.

Why should I trust you? Explaining the predictions of any classifier

September 22, 2016

“Why Should I Trust You? Explaining the Predictions of Any Classifier Ribeiro et al., KDD 2016

You’ve trained a classifier and it’s performing well on the validation set – but does the model exhibit sound judgement or is it making decisions based on spurious criteria? Can we trust the model in the real world? And can we trust a prediction (classification) it makes well enough to act on it? Can we explain why the model made the decision it did, even if the inner workings of the model are not easily understandable by humans? These are the questions that Ribeiro et al. pose in this paper, and they answer them by building LIME – an algorithm to explain the predictions of any classifier, and SP-LIME, a method for building trust in the predictions of a model overall. Another really nice result is that by explaining to a human how the model made a certain prediction, the human is able to give feedback on whether the reasoning is ‘sound’ and suggest features to remove from the model – this leads to classifiers that generalize much better to real world data.

Consider two classifiers (Algorithm 1 and Algorithm 2 in the figure below) both trained to determine whether a document is about Christianity or atheism. Algorithm 2 performs much better in hold-out tests, but when we see why it is making its decisions, we realise it is actually much worse…

Magenta words are those contributing to the atheism class, green for Christianity. The second algorithm is basing its decision on “Posting”, “Host”, “Re” and “nntp” – words that have no connection to either Christianity or atheism, but happen to feature heavily in the headers of newsgroup postings about atheism in the training set.

What makes a good explanation?

It must be easily understandable by a human!

For example, if hundreds or thousands of features significantly contribute to a prediction, it is not reasonable to expect any user to comprehend why the prediction was made, even if individual weights can be inspected.

And it must meaningfully connect input variables to the response:

..which is not necessarily tue of the features used by the model, and thus the “input variables” in the explanation may need to be different than the features.

Furthermore, an explanation must have local fidelity: it should correspond to how the model behaves in the vicinity of the instance being predicted.

The ideal explainer, should also be able to explain any model, and thus be model-agnostic.

A key insight – local interpretation

Creating a globally faithful interpreter of a model’s decisions might require a complete description of the model itself. But to explain an individual decision we only need to understand how it behaves in a small local region. The idea reminds me a little bit of differentiation – overall the shape of the curve may be very complex, but if we look at just a small part we can figure out the gradient in that region.

Here’s a toy example from the paper – the true decision boundary in the model is represented by the blue/pink background. In the immediate vicinity of the decision (the bold red cross) though we can learn a much simpler explanation that is locally faithful even if not globally faithful.

The LIME algorithm produces Local Interpretable Model-agnostic Explanations.

The overall goal of LIME is to identify an interpretable model over the interpretable representation that is locally faithful to the classifier.

For text classification, an interpretable representation could be a vector indicating the presence or absence of a word, even though the classifier may use more complex word embeddings. For image classification an interpretable representation might be an binary vector indicating the ‘presence’ or ‘absence’ of a contiguous patch of similar pixels.

LIME works by drawing samples in the vicinity of the input to be explained and learning a linear classifier using locally weighted square loss, with a limit K set on the number of interpretable features.

Since [the algorithm] produces an explanation for an individual prediction, its complexity does not depend on the size of the dataset, but instead on time to compute f(x) [a model prediction] and on the number of samples N. In practice, explaining random forests with 1000 trees using scikit-learn on a laptop with N = 5000 takes under 3 seconds without any optimizations such as using gpus or parallelization. Explaining each prediction of the Inception network for image classification takes around 10 minutes.

From local explanation to model trust

The central idea here is that if we understand and trust the reasoning behind an individual prediction, and we repeat this process for a number of predictions that give good coverage of the input space, then we can start to build global trust in the model itself.

We propose to give a global understanding of the model by explaining a set of individual instances. This approach is still model agnostic, and is complementary to computing summary statistics such as held-out accuracy. Even though explanations of multiple instances can be insightful, these instances need to be selected judiciously, since users may not have the time to examine a large number of explanations. We represent the time/patience that humans have by a budget B that denotes the number of explanations they are willing to look at in order to understand a model. Given a set of instances X, we define the pick step as the task of selecting B instances for the user to inspect.

Examining the instances X, we know the features that are locally important in making the prediction at X. Features that are locally important for many instances are globally important. Instances B are picked so as to cover the globally important features first, and to avoid redundancy in explanation between them.

With a little help from my friends

Using human subjects recruited via Amazon Mechanical Turk – by no means machine learning experts, but with a basic knowledge of religion – the team provided explanations for the predictions of two different models classifying documents as atheist or Christian and asked the subjects which would generalize better (perform the best in the real world). Using LIME coupled with the mechanism just described to create representative instances, the human subjects were able to choose the correct model 89% of the time.

A second experiment asked Amazon Mechanical Turk users to identify which words from the explanations should be removed from subsequent training, for the worst classifier.

If one notes that a classifier is untrustworthy, a common task in machine learning is feature engineering, i.e. modifying the set of features and retraining in order to improve generalization. Explanations can aid in this process by presenting the important features, particularly for removing features that the users feel do not generalize.

The users are not ML experts, and don’t know anything about the dataset. Starting with 10 users, 10 classifiers are trained (one for each subject, with their suggested words removed). These are presented to five users each, resulting in another 50 classifiers. Each of these are presented to five users, giving 250 final models.

It is clear… that the crowd workers are able to improve the model by removing features they deem unimportant for the task… Each subject took an average of 3.6 minutes per round of cleaning, resulting in just under 11 minutes to produce a classifier that generalizes much better to real world data.

High agreement among users on the words to be removed indicated that users are converging to similar correct models.. “This evaluation is an example of how explanations make it easy to improve an untrustworthy classifier – in this case easy enough that machine learning knowledge is not required.”

The Morning Paper on Operability

September 21, 2016

I gave a 30 minute talk at the Operability.io conference yesterday on the topic of “The Morning Paper meets operability.” In a first for me, I initially prepared the talk as a long blog post, and then created a set of supporting slides at the end. Today’s post is the text of that talk – no new papers therefore, but hopefully a fun reminder of some of the good stuff we’ve looked at over the last couple of years. Plenty of other research I could have mentioned too of course, if only I had more time…!

Abstract

The Morning Paper is a daily blog covering fundamental research and the latest results in computer science. Over the last two years Adrian has covered over 400 papers, and in this talk he will introduce a selection of lessons learned and relevant research results that might help us tame some of the complexities involved in operating modern applications and infrastructure. From failure sketching to pivot tracing, provenance to profiling, expect a rapid-fire succession of ideas and research designed to inspire you never to settle for the status quo.

Introduction

Hi, my name’s Adrian Colyer. I’m a venture partner with Accel here in London where it’s my job to help find and build great technology companies in Europe. Prior to that I held CTO roles at Pivotal, VMware, and SpringSource. I also write a blog called ‘The Morning Paper’ where I post a summary of a computer science research paper every weekday. I’ve just entered my third year of doing this, which equates to over 400 paper write-ups.

When starting to think about this talk, my impression was that the majority of papers (or at least, of the ones that I read) don’t really talk about operational issues. But I was pleasantly surprised, as I looked back through my collection, just how many papers there were that touched on issues relevant to operations. For the next 25 minutes or so I’m going to share some of my personal highlights with you, and I hope that something here catches your interest and inspires you to dig deeper.

Operability starts with design and development

If I could only share one paper with you today, it would be this one: “On designing and deploying internet-scale services” by James Hamilton from LISA ’07. I could easily spend the full half-an-hour just on this. Hamilton reminds us that operability is not something you bolt-on at the end, but is an integral part of the design of a system:

We have long believed that 80% of operations issues originate in design and development… when systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues however, either have their genesis in design and development or are best solved there.

When you think about it, that’s a powerful argument for DevOps, being made back in 2007. Here’s another quote from the same paper talking about the need for service teams that have shared responsibility for dev and ops:

If the development team is frequently called in the middle of the night, automation is the likely outcome. If operations is frequently called, the usual reaction is to grow the operations team.

Hamilton’s three tenets for operable services will seem very familiar to us today, even so it’s good advice and interesting to see what the most important high level things are in Hamilton’s opinion:

  1. Expect failures to happen regularly and handle them gracefully
  2. Keep things as simple as possible
  3. Automate everything

Beneath these tenets there’s so much advice packed into the paper that I worked through it and turned it into a checklist that I’ve made available as a gist on GitHub.

Perhaps now you can see why we could easily spend the whole half hour just on this one paper! There’s more I want to share with you though.

Figuring out what’s going on – the big picture

I read a collection of papers on the topic of “how hell do I know what is going on in my distributed system / microservices deployment?” One of the classics here is Google’s Dapper paper. In it Google describe how they instrument everything in order to generate trace information, then sample those traces and store the samples in a database with an API on top for querying.

Here are three lessons I took from the paper:

  1. The system configuration and behaviour has to be inferred – under constant deployments it doesn’t make sense to talk about a central master config file.

Modern Internet services are often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by different teams, perhaps in different programming languages, and could span many thousands of machines across multiple physical facilities…. Understanding system behavior in this context requires observing related activities across many different programs and machines.

  1. Even your best engineers often get it wrong when they’re working from guesses and intuition.

When systems involve not just dozens of subsystems but dozens of engineering teams, even our best and most experienced engineers routinely guess wrong about the root cause of poor end-to-end performance. In such situations, Dapper can furnish much-needed facts and is able to answer many important performance questions conclusively.

  1. If your tracing infrastructure relies on application developers ‘actively collaborating’ it will become extremely fragile and often broken / out of date. I’ve previously studied the same phenomenon in large commercial software systems during my time working as the lead of the AspectJ project.

Google get around this last issue by having common shared infrastructure components that everyone uses, so they can add instrumentation there. But many companies don’t, including Facebook, so what can they do?

Facebook’s Mystery Machine may hold some of the answers.

… many systems are like the Facebook systems we study; they grow organically over time in a culture that favors innovation over standardization (e.g., “move fast and break things” is a well-known Facebook slogan). There is broad diversity in programming languages, communication middleware, execution environments, and scheduling mechanisms. Adding instrumentation retroactively to such an infrastructure is a Herculean task.

Never fear, the logs are here!

Our key observation is that the sheer volume of requests handled by modern services allows us to gather observations of the order in which messages are logged over a tremendous number of requests.

Log messages are required to have: * request identifier * host identifier * host-local timestamp * a unique event label

The output from all the logs is captured in an UberTrace system, which samples a small fraction for end-to-end tracing.

The Mystery Machine uses the traces generated by UberTrace to create a causal model of how software components interact during the end-to-end processing of a Facebook request. It then uses the causal model to perform several types of distributed systems performance analysis…

  • Finding the critical path
  • Quantifying slack for segments not on the critical path (this knowledge is later used to improve average latency by making connections with considerable slack lower priority)
  • Identifying segments correlated with performance anomalies

You might also be interested in Gorilla, an in-memory TSDB that Facebook built to handle the 1 trillion data points per day gathered by their monitoring system. Gorilla can store 26 hours of data in memory with HBase used for longer-term storage. One of the tools built on top of Gorilla is a time series search that can look for metrics that show a correlation with errors, helping engineers look for root causes.

Facebook’s Mystery Machine needs a request id in log records that can be used to correlate entries in a trace, but what if you don’t even have that?

lprof makes the absolute best out of whatever you do have in your logs, and is capable of inferring causal dependencies to identify individual requests and profile their performance behaviour. It can track service requests across helper threads and remote service invocations.

The information in the logs is sufficiently rich to allow the recovering of the inherent structure of the dispersed and intermingled log output messages, thus enable useful performance profilers like lprof.

The lprof implementation works with Java and combines static bytecode analysis with log statement analysis. The static analysis looks at log printing statements and generates regular expressions to match log statements produced by them, and does dataflow analysis to determine which variables included in a log statement are modified and which aren’t. The ones that aren’t are candidate request identifiers… A temporal analysis (e.g. message A must come before message B) is used to help in the case where there is no unique id per request. It also uncovers communication between threads.

The team analysed HDFS, Cassandra, YARN, and HBase in this way.

Our evaluation shows that lprof can accurately attribute 88% of the log messages from widely-used production quality distributed systems and is helpful in debugging 65% of the sampled real-world performance anomalies.

But what if the information you need isn’t in the logs in the first place? There always seems to be either too much information so that you drown in a sea of irrelevance, or too little in the crucial area where you really need it. Pivot tracing enables dynamically configuring and installing of monitoring at runtime, with low system overhead and causality capture between events across multiple processes and applications. It uses a load-time weaver (similar in spirit to AspectJ) and a custom query language for the information you want to extract. Overheads are low at around 0.3%, and there are some excellent examples of pivot tracing query results being used to solve real bugs in Hadoop.

Narrowing it down

Failure Sketching is a technique for automated root cause analysis of in-production failures.

The greatest challenge is posed by bugs that only recur in production and cannot be reproduced in-house. Diagnosing the root cause and fixing such bugs is truly hard. In [57] developers noted: “We don’t have tools for the once every 24 hours bug in a 100 machine cluster”. An informal poll on Quora asked “What is a coder’s worst nightmare?,” and the answers were “The bug only occurs in production and can’t be replicated locally,” and “The cause of the bug is unknown.”

Let me show you what failure sketching does for you, and then we can talk about how it does it. Here’s an example of an automatically created failure sketch, using the authors’ tool called Gist, for a bug in Apache which is caused by a double free:

It makes me smile that the authors apologise for the greyed-out statements that do appear in their generated sketch, but would not be part of the ideal sketch. I’d take this kind of help any day!

So how do we get to such a simple thing? The starting point for Gist is the program source code and binary, and a stack trace. So you know where you have a problem, but you don’t know why. With many systems that automatically catch and report exceptions, you can see how easy it would be to automate starting the Gist process…

Gist then performs a static analysis of the program code to figure out what the possible causes might be. If you really want to know, “Gist uses an interprocedural, path-insensitive and flow-sensitive backward slicing algorithm.”

Gist then instructs its runtime (embedded in the operating system) to instrument the program and gather additional traces. Then it sits there and waits. The program continues to execute, and gist observes failing and successful runs and determines the difference between them. From this analysis, it produces the failure sketches for developers to investigate.

Gist uses Intel Processor Trace hardware support to ensure it has incredibly low overhead and can be left in place to monitor all requests while hunting out a bug, thus ensuring it never misses the failure when it does occur. Using Intel PT also means that the software being diagnosed does not have to be changed in any way.

We evaluated our Gist prototype using 11 failures from 7 different programs including Apache, SQLite, and Memcached. The Gist prototype managed to automatically build failure sketches with an average accuracy of 96% for all the failures while incurring an average performance overhead of 3.74%. On average, Gist incurs 166x less runtime performance overhead than a state-of-the-art record/replay system.

That’s really neat, and I hope it makes its way to a system near you soon. But in the meantime, if you don’t have a modified kernel and Intel PT hardware to hand, here’s a handy technique that can help to reduce a failing test case to the minimum needed to reproduce the problem. The base idea has been around a long time, and it’s called delta debugging (simplifying and isolating failure-inducing input) …

It can take a program input (e.g. an HTML page that causes a browser crash, or a program that causes a compiler crash) and automatically reduce it down to just what is essential to reproduce the bug, without any understanding of the meaning of the input text at all.

Take for example this C program that crashed the gcc compiler:


#define SIZE 20

double mult(double z[], int n)
{
  int i, j ;
  i = 0;
  for (j = 0; j < n; j++) {
    i = i + j + 1;
    z[i] = z[i] * (z[0] + 1.0);
  }
  return z[n];
}

void copy(double to[], double from[], int count)
{
  int n = (count + 7) / 8;
  switch (count % 8) do {
    case 0: *to++ = *from++;
    case 7: *to++ = *from++;
    case 6: *to++ = *from++;
    case 5: *to++ = *from++;
    case 4: *to++ = *from++;
    case 3: *to++ = *from++;
    case 2: *to++ = *from++;
    case 1: *to++ = *from++;
  } while(--n > 0);
  return mult(to,2);
}

int main(int argc, char *argv[])
{
  double x[SIZE], y[SIZE];
  double *px = x;
  
  while (px < x + SIZE)
    *px++ = (px - x) * (SIZE + 1.0);
  return copy(y,x,SIZE);
}

Without understanding anything about the syntax or semantics of C (so for example the minimizer has no understanding about whether or not a change it tries results in a valid C program) this test case was automatically minimized to:


t(double z[],int n){int i,j;for(;;){i=i+j+1;z[i]=z[i]*z[0]+0;}return z[n];}

An 894 line HTML document that crashed Mozilla on printing was reduced down to just “<SELECT>” in the same way. What we’re looking for is a test case in which any single change causes the failure to disappear. For the C and HTML examples, the unit of change is a single character. Delta debugging operates a little bit like a greedy search:

If the failing test case only contains 1 unit of change, then it is minimal by definition. Otherwise we can begin with a binary search…

  • Partition the failing input (sequence of changes) into two subsets with similar sizes, lets call them Δ1 and Δ2 ( reducing with n=2 subsets)
  • Test Δ1 and Δ2
  • If Δ1 reproduces the failure, we can reduce to Δ1 and continue the search in this subset. We can do likewise with Δ2 if that reproduces the failure.

But what if neither Δ1 nor Δ2 reproduces the failure? Perhaps both tests pass, or perhaps they result in input that is not well-formed from the perspective of the program under test and cause the test to be ‘unresolved.’ We have two possible strategies open to us:

  1. We can test larger subsets of the failing input – this will increase the chances that the test fails, but gives us a slower progression to the 1-minimal solution.
  2. We can test smaller subsets of the failing input – this has a lower chance of producing a failing test, but gives us faster progression if it does.

The ddmin algorithm cleverly combines both of these and proceeds as follows:

  • Partition the failing input into a larger number of subsets (twice as many as on the previous attempt). Test each small subset Δi, as well as its (large) complement: failing input – Δi. From testing each Δi and its complement we get four possible outcomes:
  • If the test of any Δi reproduces the failure, then we can reduce to this subset and continue with this as our new input (reducing with n=2 subsets).
  • If the test of any Δi complement reproduces the failure, then continue from this complement and reduce it with n-1 subsets.
  • If no test reproduces the failure, and any subset contained more than one change, then try again with finer-grained subsets. Step up to 2n if possible, and if 2n > the number of changes in the input, then just divide into subsets of one change each.
  • If no test reproduces the failure, and all subsets contain only one change, we are done – we have a 1-minimal test input.

There’s a variation on delta debugging called Hierarchical Delta Debugging that can exploit structure in hierarchical input sources (HTML, XML, json, programs, etc.) to find minimal test cases more quickly, and to produce minimized versions that look more natural to developers.

For the C program example we just saw, ddmin requires 680 tests to minimize it, whereas HDD can do so with only 86 tests and produces something much easier to read. HDD works by successively finding 1-minimal sets of nodes at each level of the tree, starting at the top level. An HDD implementation in Python is just 150 lines of code.


mult(double *z, int n) 
{
  int i;
  int j;
  for(;;) {
    i = i + j + 1;
    z[i] = z[i] * (z[0] + 0);
  }
}

DEMi (the Distributed Execution Minimizer – “Minimizing Faulty Executions of Distributed Systems”) takes this idea and extends it to distributed systems, where the vast state space makes it really hard to track down problems.

DEMi starts with a sequence of events known to cause a failure (for example, generated by a fuzzing tool), and then tries to reduce it to the minimum needed to reproduce the failure. There are three stages:

  1. Finding the minimal causal sequence of external events such that removing any one of them no longer triggers the fault. DEMi uses a combination of delta debugging and something called Dynamic Partial Order Reduction which prunes commutative schedules from the search space. A number of heuristics described in the paper help to guide the search.
  2. Minimize internal (in-process) events by searching for smaller schedules that still contain the same external sequence of events, and that still trigger the violation.
  3. Try to minimize the data payloads of the external messages.

The current DEMi implementation is built on top of Akka and AspectJ. It proved very effective at finding and minimizing failure-inducing execution traces in akka-raft and Apache Spark.

Bug found or reproduced? #events: Total(External) Minimized #events time (secs)
raft-45 reproduced 1140(108) 37(8) 498
raft-46 reproduced 1730(108) 61(8) 250
raft-56 found 780(108) 23(8) 197
raft-58a found 2850(108) 226(31) 43345
raft-58b found 1500(208) 40(9) 42
raft-42 reproduced 1710(208) 180(21) 10558
raft-66 found 400(68) 77(15) 334
spark-2294 reproduced 1000(30) 40(3) 97
spark-2294-caching reproduced 700(3) 51(3) 270
spark-3150 reproduced 600(20) 14(3) 26
spark-9256 found 600(20) 16(3) 15

Why oh why

We just looked at some approaches for narrowing down the causes of failures. DBSherlock is a system that can help to explain the causes of performance anomalies. It’s integrated into the DBSeer open source framework and works with transactional databases, although the approach could easily be applied in other domains too.

The starting point for DBSherlock is a performance anomaly captured by the metrics DBSeer is collecting every second. A user can highlight a region of interest and ask “what caused this?”

DBSherlock then looks at all of the time series data that has been gathered and tries to figure out a set of predicates that best separate the abnormal region from normal regions. For example, an anomaly caused by a network slowdown may be explained by:

network send < 10KB
and network receive < 10KB
and client wait times > 100ms
and cpu usage < 5

DBSherlock also includes causal models (which it learns through interaction with users) so that it may be able to give a high-level diagnosis such as “Rotation of the redo log file.”

DB Sherlock starts by discretizing the time series within the ranges of interest

and then labeling the partitions as normal or abnormal:

This will be noisy, so a filling and filtering process helps to find a sharper edge separating normal and abnormal regions:

In the first phase all partitions with a non-empty neighbour of the opposite type are marked and then switched to empty.

In the second phase the gaps are filled, with a configurable bias towards marking partitions as normal.

If at the end of this process the average normal and average abnormal values are seperated by more than some threshold, a predicate is generated.

If a DB admin reviews the predicates and after investigation confirms the true cause has a higher level explanation (e.g. the rotation of the redo log file example), then this is added to the system as a causal model and linked with the predicates. For every future diagnosis, DBSherlock calculates the confidence of its causal models and offers them as possible explanations when the confidence is above some configurable threshold.

Our extensive experiments show that our algorithm is highly effective in identifying the correct explanations and is more accurate than the state-of-the-art algorithm. As a much needed tool for coping with the increasing complexity of today’s DBMS, DBSherlock is released as an open-source module in our workload management toolkit.

Here are some examples of the types of anomalies explained by DBSherlock in the experiments:

Closing the loop

We started this talk by discussing the value of bringing dev and ops together. In Runtime Metric Meets Developer Cito et al. have an interesting twist on this they call ‘feedback driven development.’

A unifying theme of many ongoing trends in software engineering is a blurring of the boundaries between building and operating software products. In this paper, we explore what we consider to be the logical next step in this succession: integrating runtime monitoring data from production deployments of the software into the tools developers utilize in their daily workflows (i.e., IDEs) to enable tighter feedback loops. We refer to this notion as feedback-driven development (FDD).

They collect operations data and map it onto a source code artefact graph, which is then used to annotated and inform development.

The early examples mostly centre around the display of profiling style information, but based off of operational data, not a locally run profiler. It’s an interesting idea to watch though, and you could easily imagine information about runtime exceptions and so on showing up too.

Timeless advice

We’re almost out of time, so let me close with one last paper – written by a medical doctor but oh so applicable to IT as well, How Complex Systems Fail. It’s packed with great advice and insights, including this one:

The complexity of complex systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure, they are regarded as a minor factor during operations. Complex systems therefore run in degraded mode as their normal mode of operation.

The path to more robust system performance, says Cook, is designing systems where operators can more easily discern the edge of the envelope between acceptable failures and pending catastrophe, and how their actions move the system performance towards or away from that edge. I hope you go on to check out the paper after this talk, it’s well worth it.

Outro

So that concludes our whirlwind tour through a selection of papers that speak to operational issues: from designing for operability to figuring out what’s going on, narrowing in on root causes, and explaining behaviour.

If you like this kind of thing, you can follow The Morning Paper at https://blog.acolyer.org, or follow me on Twitter (I’m @adriancolyer) where I announce each day’s paper. On the blog you’ll also find an option to subscribe to the mailing list edition if you’d prefer to have the paper write-up delivered straight to your inbox.

Appendix: Paper write-ups referenced in this talk

Mastering the game of Go with deep neural networks and tree search

September 20, 2016

Mastering the Game of Go with Deep Neural Networks and Tree Search Silver, Huang et al., Nature vol 529, 2016

Pretty much everyone has heard about AlphaGo’s tremendous Go playing success beating the European champion by 5 games to 0. In all the excitement at the time, less was written about how AlphaGo actually worked under the covers. Today’s paper choice is a real treat since it gives that us that inside look, and by virtue of being published in Nature, explains things in a way that makes them accessible to a broader audience than just deep learning specialists.

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence due to its enormous search space and the difficulty of evaluating board positions and moves. We introduce a new approach to computer Go that uses value networks to evaluate board positions and policy networks to select moves.

The scale of the challenge

In theory Go is a game of perfect information and a search tree containing bd sequences of moves, where b is the game’s breadth (number of legal moves per position), and d is it depth (games) should tell us the perfect play in each position. Unfortunately, for Go b is around 250, and d is around 150. Exhaustive search is not possible!

Solution overview

There are two basic tactics we can use to make the problem more tractable: reduce the depth of the tree, and/or reduce the breadth of the tree.

The depth of the search may be reduced by position evaluation: truncating the search tree at state s and replacing the subtree below s b an approximate value function that predicts the outcome from state s.

We can reduce the breadth of the tree by sampling actions from a policy that is a probability distribution over possible moves in a given position.

Monte-Carlo rollouts search to maximum depth without branching at all, by sampling long sequences of actions for both players from a policy p. Averaging over such rollouts can provide an effective position evaluation, achieving super-human performance in backgammon and Scrabble, and weak amateur level play in Go.

A Monte-Carlo tree search (MCTS) uses Monte-Carlo rollouts to estimate the value of each state in the search tree. The more simulations you do, the more accurate the values become…

The policy used to select actions during search is also improved over time, by selecting children with higher values. Asymptotically, this policy converges to optimal play, and the evaluations converge to the optimal value function. The strongest current Go programs are based on MCTS…

AlphaGo uses deep neural networks to learn a value network used to reduce search depth, and a policy network used to reduce search breadth. The board is expressed as a 19×19 image and convolutional layers are used to construct a representation of the position that becomes the input to these networks. AlphaGo combines the policy and value networks with MCTS to construct the final gameplay engine.

The policy network takes a representation of the board, passes it through the convolutional layers, and outputs a probability distribution over legal moves:

The value network also passes the board representation through convolutional layers, and outputs a scalar value predicting the expected outcome.

The neural networks are trained used a 3 stage pipeline:

  1. A supervised learning policy network is trained directly from expert human moves.
  2. A reinforcement learning policy network improves upon the network from stage one by playing games against itself. “This adjusts the policy towards the correct goal of winning games, rather than maximising predictive accuracy.”
  3. A value network is trained to predict the winner of games played by the RL policy network of stage 2 against itself.

Training stage 1: Supervised learning

A 13-layer policy network was trained from 30 million positions. The network alternates between convolutional layers and rectifier non-linearities. A final softmax layer outputs the probability distribution over all legal moves. The network predicted expert moves with an accuracy of 57%, and small improvements in accuracy led to large improvements in playing strength.

In addition, a faster but less accurate rollout policy network was trained using a linear softmax of small pattern features. It achieved 24.2% accuracy, but was able to select a move in 2μs rather than 3ms for the policy network.

Training stage 2: Reinforcement learning through game play

The RL network is identical in structure to the SL network from stage one, with weights initialised to the same values. Games are played between the policy network and randomly selected previous iterations of the policy network.

When played head-to-head, the RL policy network won more than 80% of games against the SL policy network. We also tested against the strongest open-source Go program, Pachi, a sophisticated Monte-Carlo search program, ranked at 2 amateur dan on KGS, that executes 100,000 simulations per move. Using no search at all, the RL policy network won 85% of games against Pachi. In comparison, the previous state-of-the-art, based only on supervised learning of convolutional networks, won 11% of games against Pachi and 12% against a slightly weaker program Fuego.

Training stage 3: Reinforcement learning for position evaluation

Ideally we’d like to know the value function under perfect play, but the closest we can get is to estimate it using the RL policy network from stage 2.

We generated a new self-play dataset consisting of 30 million distinct positions, each sampled from a separate game. Each game was played between the RL policy network and itself until the game terminated.

A single evaluation of the trained function approached the accuracy of Monte-Carlo rollouts using the RL policy network, but using 15,000 times less computation.

Your move!

AlphaGo combines the policy and value networks in an MCTS algorithm that selects actions by lookahead search. Figure 3 in the paper provides a very clear illustration of how this works:

(Click for larger view).

How good?

We know that AlphaGo beat the European Go champion 5-0 using a distributed version of AlphaGo with 40 search threads, 1202 CPUs, and 176 GPUs. A single machine version of AlphaGo is still many dan ranks stronger than any previous Go program – winning 494 out of 495 games against other leading Go programs.

What next?

Go is exemplary in many ways of the difficulties faced by artificial intelligence: a challenging decision-making task; an intractable search space; and an optimal solution so complex it appears infeasible to directly approximate using a policy or value function. The previous major breakthrough in computer Go, the introduction of Monte-Carlo tree search, led to corresponding advances in many other domains: for example general game-playing, classical planning, partially observed planning, scheduling, and constraint satisfaction. By combining tree search with policy and value networks, AlphaGo has finally reached a professional level in Go, providing hope that human-level performance can now be achieved in other seemingly intractable artificial intelligence domains.

Deep neural networks for YouTube recommendations

September 19, 2016

Deep Neural Networks for YouTube Recommendations Covington et al, RecSys ’16

The lovely people at InfoQ have been very kind to The Morning Paper, producing beautiful looking “Quarterly Editions.” Today’s paper choice was first highlighted to me by InfoQ’s very own Charles Humble. In it, Google describe how they overhauled the YouTube recommendation system using deep learning. The description of the system is certainly of interest, and we’ll get to that in a moment. The thing that strikes me most about the paper though is how it represents a changing of the guard as we enter the deep learning era…

In conjugation with other product areas across Google, YouTube has undergone a fundamental paradigm shift towards using deep learning as a general purpose solution for nearly all learning problems… In this paper, we describe the system at a high level and focus on the dramatic performance improvements brought by deep learning.

The YouTube system is built on top of Google Brain, or as we now know it, TensorFlow. To give an idea of scale, the models learn approximately one billion parameters and are trained on hundreds of billions of examples. The basic problem is posed as “given this user’s YouTube activity history, which videos are they most likely to watch next?” The system is structured in two parts. First, a candidate generation model selects a small number (hundreds) of candidate videos from the overall corpus, and then a ranking model scores these candidates. The highest scoring videos are presented to the user as recommendations.

The two-stage approach to recommendation allows us to make recommendations from a very large corpus (millions) of videos while still being certain that the small number of videos appearing on the device are personalized and engaging for the user.

Dealing with this kind of scale is one of the three key challenges for YouTube. The other two are freshness and noise.

The freshness challenge reflects the facts that YouTube has a very dynamic corpus with hours of video uploaded every second, and users prefer newer content:

Recommending recently uploaded (“fresh”) content is extremely important for YouTube as a product. We consistently observe that users prefer fresh content, though not at the expense of relevance. In addition to the first-order effect of simply recommending new videos that users want to watch, there is a critical secondary phenomenon of bootstrapping and propagating viral content.

Since they’re trained on historical data, machine learning systems can exhibit an implicit bias towards the past. To correct for this, the age of a training example is included as a feature during training, and when making predictions it is set to zero.  This makes a big difference as the following graph shows – the blue line shows the model predictions without the age feature, and the red line with it. The green line is the actual distribution.

The noisy data challenge is due to to the fact that “we rarely observe the ground truth of user satisfaction and instead model noisy implicit feedback signals.” Watch time is the ultimate arbiter:

Ranking by click-through rate often promotes deceptive videos that the user does not complete (“clickbait”) whereas watch time better captures engagement.

Candidate Generation

Candidate generation is inspired by continuous bag of words language models used in word vectors. A high dimensional embedding is learned for each video, and these are fed into a feed-forward neural network. The input vector combines a user’s watch history, search history, demographic and geographic features.

To compute the ‘watch vector’ subset of the input a sequence of video IDs is mapped to a sequence of embeddings (video embeddings are learned jointly with all other model parameters) and then these are simply averaged. A similar technique is used to average tokenized search queries via word embeddings.

The input layer is followed by several layers of fully connected Rectified Linear Units (ReLU).

The task of the deep neural network is to learn user embeddings u as a function of the user’s history and context that are useful for discriminating among videos with a softmax classifier.

At serving time, choosing the N most likely classes (i.e, videos) to pass to the ranking stage can be reduced to a nearest neighbour search.

Best performance was obtained by always predicting a user’s ‘next watch’ given their history so far (rather than holding out a random item in the history and predicting it from the other items). This better matches the natural consumption pattern of videos:

Episodic series are usually watched sequentially and users often discover artists in a genre beginning with the most broadly popular before focusing on smaller niches.

Ranking

The ranking process works with the much smaller (hundreds) of candidate videos produced by the first phase. It is therefore feasible to take advantage of many more features describing the video and the user’s relation to it.

We use a deep neural network with similar architecture as candidate generation to assign an independent score to each video impression using logistic regression. The list of videos is then sorted by score and returned to the user.

The ranking models use hundreds of features, and significant effort still needs to be expended engineering these:

Despite the promise of deep learning to alleviate the burden of engineering features by hand, the nature of our raw data does not easily lend itself to be input directly into feedforward neural networks. We still expend considerable engineering resources transforming user and video data into useful features. The main challenge is in representing a temporal sequence of user actions and how these actions relate to the video impression being scored.

The most important signals are those describing a users interaction with the item itself and other similar items. The overall goal of the ranking model is to predict expected watch time:

Our goal is to predict expected watch time given training examples that are either positive (the video impression was clicked) or negative (the impression was not clicked). Positive examples are annotated with the amount of time the user spent watching the video. To predict expected watch time we use the technique of weighted logistic regression, which was developed for this purpose.

For both candidate generation and ranking, increasing the width and depth of hidden layers improves results. Even though ranking is a classic machine learning problem, the deep learning approach still outperformed previous linear and tree-based methods for watch time prediction. For the full details, see the paper as linked at the top of this post.