Skip to content

Orca: differential bug localization in large-scale services

October 19, 2018

Orca: differential bug localization in large-scale services Bhagwan et al., OSDI’18

Earlier this week we looked at REPT, the reverse debugging tool deployed live in the Windows Error Reporting service. Today it’s the turn of Orca, a bug localisation service that Microsoft have in production usage for six of their large online services. The focus of this paper is on the use of Orca with ‘Orion,’ where Orion is a codename given to a ‘large enterprise email and collaboration service that supports several millions of users, run across hundreds of thousands of machines, and serves millions of requests per second.’ We could it ‘Office 365’ perhaps? Like REPT, Orca won a best paper award (meaning MR scooped 2 out of the three awards at OSDI this year!).

Orca is designed to support on-call engineers (OCEs) in quickly figuring out the change (commit) that introduced a bug to a service so that it can be backed out. (Fixes can come later!). That’s a much harder task than it sounds in highly dynamic and fast moving environments. In ‘Orion’ for example there are many developers concurrently committing code. Post review the changes are eligible for inclusion in a build. An administrator periodically creates new builds combining multiple commits. A build is the unit of deployment for the service, and may contain from one to hundreds of commits.

There’s a staged roll-out process where builds move through rings. A ring is just a pre-determined set of machines that all run the same build. Builds are first deployed onto the smallest ring, ring 0, and monitored. When it is considered safe the build will progress to the next ring, and so on until it is finally deployed world wide.

Roughly half of all Orion’s alerts are caused by bugs introduced through commits. An OCE trying to trace an alert back to a commit may need to reason through hundreds of commits across a hierarchy of builds. Orca reduced the OCE workload by 3x when going through this process. No wonder then that it seems to be spreading rapidly within Microsoft.

Example post-deployment bugs

Over a period of eight months, we analyzed various post-deployment bugs and the buggy source-code that caused them. Table 2 (below) outlines a few characteristic issues.

The commits column in the table above shows the number of candidate commits that an OCE had to consider while triaging the issue. Post-deployment bugs fall pre-dominantly into the following categories:

  • Bugs specific to certain environments (aka ‘works for me!’)
  • Bugs due to uncaptured dependencies (e.g. a server-side implementation is modified but the corresponding client change has not been made)
  • Bugs that introduce performance overheads (that only emerge once a large number of users are active)
  • Bugs in the user-interface whereby a UI feature starts misbehaving and customers are complaining.

Studying the bugs and observing the OCEs gave a number of insights that informed Orca’s design:

  • Often the same meaningful terms occur both in the symptom and the cause. E.g. in bug no 1 in the above table the symptom is that the People Suggestion feature has stopped working, and the commit introducing the problem has a variable ‘suggest’.
  • Testing and anomaly detection algorithms don’t always find a bug immediately. A bug can start surfacing in a new build, despite being first introduced in a much older build. For example bug 3 in the table appeared in a build that contained 160 commits, but the root cause was in the previous build with 41 commits.
  • Builds may contain hundreds of commits, so manually attributing bugs can be a long task.
  • There are thousands of probes in the system, and probe failures and detections are continuously logged.

Based on these insights Orca was designed as a custom search engine using the symptom as the query text. A build provenance graph is maintained so that Orca knows the candidate set of commits to work through, and a custom ranking function is used to help rank commits in the search results based on a prediction of risk in the commit. Finally, the information already gathered from the logged probe failures give a rich indication of likely symptom search terms, and these can be used to track the frequency of terms in query and use Inverse Query Frequency ICF (cf. Inverse Document Frequency, IDF) in search ranking.

In addition to being used by OCEs to search symptoms, multiple groups have also integrated Orca with their alerting system to get a list of suspect commits for an alert and include it in the alert itself.

How Orca works

The input query to Orca is a symptom of the bug… The idea is to search for this query through changes in code or configurations that could have caused this bug. Thus the “documents” that the tool searches are properties that changed with a commit, such as the names of files that were added, removed or modified, commit and review comments, modified code and modified configuration parameters. Orca’s output is a ranked list of commits, with the most likely buggy commit displayed first.

The first step is to tokenize the terms from the presenting symptoms, using heuristics specially built for code and log messages (e.g. splitting large strings according to Camel-casing). In addition to single tokens, some n-gram tokens are also created. After tokenization, stop words are filtered out, as well as commonly used or irrelevant terms found through analysing about 8 million alerts in Orion’s log store (e.g., the ‘Exception’ postfix on class names).

Now we need to find the set of commits to include in the search. Orca creates and maintains a build provenance graph. The graph captures dependencies between builds and across rings.

Builds that are considered stable in a given ring can be promoted to the next ring, which gives rise to an inter-ring edge. Interestingly not all fixes roll forward through this process though. A critical bug found in a build in a later ring can be fixed directly in that ring and then the fix is back-ported to earlier rings. Orca uses the graph to trace a path back from the the build in which the symptom is exhibiting, to the origin build in ring 0. The candidate list of commits to search includes all commits in the builds on this path, together with any back-ported commits.

Given the set of commits, Orca uses differential code analysis to prune the search space. ASTs of the old and new versions of the code are compared to discover relevant parts of the source that have been added, removed and modified in a commit. The analysis finds differences in classes, methods, references, conditions, and loops. For all of the changed entities, a heuristic determines what information to include in the delta. For example, if two lines in a function have been changed the diff will include the entire text of the two changed lines (old version and new version) as well as the name of the function. This approach catches higher-level structures that a straight lexical analysis of the diff would miss.

The output of the differential code analysis is search for tokens from the tokenised symptom description, using TF-IQF as a ‘relevance’ score. These scores are first computed on a per-file basis and then aggregated to give an overall score for the commit. In the evaluation, ‘max’ was found to work well as the evaluation function.

We can now return the ranked search results. However, the authors found that very often multiple commits had the same or very similar scores. To break ties between commits with the same scores, a commit risk prediction model is used.

Commit risk prediction

We have built a regression tree-based model that, given a commit, outputs a risk value for it which falls between 0 and 1. this is based on data we have collected for around 93,000 commits made over 2 years. Commits that caused bugs in deployed are labeled ‘risky’ while those that did not, we labeled ‘safe.’ We have put in considerable effort into engineering the features for this task…

This is a super-interesting area in its own right. The paper only includes a brief summary of the main features that inform this model:

  • Developers who are new to the organisation and the code base tend to create more post-deployment bugs. So there are several experience-related features in the model.
  • Files mostly changed by a single developer tend to have fewer bugs than files touched by several developers. So the model includes features capturing whether a commit includes files with many owners or few.
  • Certain code-paths when touched tend to cause more post-deployment bugs than others, so the model includes features capturing this.
  • Features such as file types changed, number of lines changed, and number of reviewer comments capture the complexity of the commit.


Since its deployment for ‘Orion’ in October 2017, Orca has been used to triage 4,400 issues. The authors collected detailed information for 48 of these to help assess the contribution that Orca makes to the process. Recall that an OCE is presented with a symptom and may have to search through hundreds of commits to find the culprit. The following table shows how often Orca highlights the correct commit in a top-n list of results (look at the top line, and the ‘ALL’ columns):

As deployed, Orca presents 10 search results. So the correct commit is automatically highlighted to the OCE as part of this list in 77% of cases. The build provenance graph contributes 8% to the recall accuracy, and the commit risk-based ranking contributes 11%.

The impact on overall OCE workload is significant. The following chart shows the number of expected commits an OCE investigates with and without Orca for the 48 issues in the study.

… using Orca causes a 6.5x reduction in median OCE workload and a 3x (67%) reduction in the OCE’s average workload… For the 4 bugs that were caught only because of the build provenance graph, the OCE had to investigate an average of 59.4 commits without Orca, and only 1.25 commits with it. This is a 47.5x improvement.

A closing thought:

Though we describe Orca in the context of a large service and post-deployment bugs, we believe the techniques we have used also apply generically to many Continuous Integration / Continuous Deployment pipelines. This is based on our experience with multiple services that Orca is operational on within our organization.

REPT: reverse debugging of failures in deployed software

October 17, 2018

REPT: reverse debugging of failures in deployed software Cui et al., OSDI’18

REPT (‘repeat’) won a best paper award at OSDI’18 this month. It addresses the problem of debugging crashes in production software, when all you have available is a memory dump. In particular, we’re talking about debugging Windows binaries. To effectively understand and fix bugs, a developer wants to be able to follow the path leading up to the point of failure. Not just the control flow, but also the data values involved. This is known as reverse debugging. What’s so clever about REPT is that it enables reverse debugging without the high overheads of record/replay systems (typically up to 200%). It combines low overhead hardware tracing (Intel PT) to record a programs control flow, with a novel binary analysis technique to recover (a very good percentage of) data flow information. Evaluated on 16 real-world bugs in software such as Chrome, Apache, PHP, and Python, REPT enabled effective reverse debugging for 14 of them, including 2 concurrency bugs.

REPTs offline binary analysis and reverse debugging is integrated into WinDbg, and the Windows Error Reporting service (WER) is enhanced to support REPT so that developers can request Intel PT enriched memory dumps.

We have received anecdotal stories from Microsoft developers in using REPT to successfully debug failures reported to WER. The very first production bug that was successfully resolved with the help of REPT had been left unfixed for almost two years because developers could not reproduce the crash locally… With the reverse engineering enabled by REPT, the developer was able to step through the function based on the reconstructed execution history and quickly find out the root cause and fix the bug. In summary, a two-year-old bug was fixed in just a few minutes thanks to REPT.

Reverse debugging and Intel PT

To be practical for production deployment REPT needs to:

  1. Impose a very low runtime performance overhead
  2. Recover execution history accurately and efficiently
  3. Work with unmodified source code/binaries
  4. Apply to broad classes of bugs including e.g. concurrency bugs

To address the first requirement REPT uses the per-thread circular buffer mode of Intel PT to capture traces (whole execution tracing introduces too much performance overhead). When a traced process fails, its final state and the recorded Intel PT traces are saved in a single memory dump. The overhead of Intel PT in this mode has been shown to be below 2% for a range of applications.

Starting from this memory dump which gives us the control flow and final memory state the main task of REPTs offline analysis is now to recover the data flow. Working solely from the memory dump avoids the need for any modifications to existing programs. Performing analysis on instructions at the binary level means REPT is agnostic to programming languages and compilers.

Recovery challenges

In an ideal world, given the final memory state and the instruction sequence we would simply walk backwards over the instructions and recover previous states. In practice, things aren’t quite so simple.

  • Many instructions are irreversible (e.g. xor rax, rax) since they destroy information.
  • Many destination addresses for memory writes cannot be determined statically.
  • It’s not always possible to infer the order of concurrent shared memory accesses.

REPT has to overcome all of these hurdles.

Reversing instructions

Some instructions are reversible, so we can move backwards through the instruction sequence and recover state based on them. For example, if we have state { rax=3, rbx= 1} and the instruction add rax, rbx then the state immediately prior to execution of that instruction must have been { rax=2, rbx=1}.

Many instructions are irreversible. But if we have been able to recover some state earlier in the instruction sequence we may be able to work forwards from that. REPT combines these two ideas, working iteratively backward and forward through the instruction sequence to recover data values, until no new values can be recovered.

Consider the following example:

In iteration 1 we start with the state {rax=3, rbx= 0} and work backwards through the instruction sequence, filling in what we can infer (e.g., that rax had value 3 after the execution of I_2) and marking as unknown (‘?’) what we can’t.

Iteration 2 begins at the start of the instruction sequence and works forward. Here we can deduce that rbx must be 1 after the execution of instruction I_1, and hence that it must be one after the execution of instruction I_2 also.

Iteration 3 works backwards again, and this time we’re able to fill in another gap in our knowledge and deduce that rax must be 2 after the execution of I_1, and so must also have been 2 prior to the execution of I_1 as well.

Recovering memory writes

So far we’ve been dealing with instructions that don’t access memory. Things get a little more complex with memory accesses in the picture.

For a memory write instruction whose destination is unknown we cannot correctly update the value for the destination memory. A missing update may introduce an obsolete value, which would negatively impact subsequent analysis.

REPT uses error correction to help figure out write destinations. It uses memory values that are possibly valid (assuming they haven’t been overwritten) to infer other values using the procedure we just examined. If the inferred values later turn out to be invalid based on conflicts then we start an error correction process. Consider the following example, where ‘g’ represents the memory address of a global variable:

In the first iteration we work backwards as before, and since we don’t know the value of rbx for I_4 we don’t change any memory values.

In the second iteration we do the forward analysis and uncover that rax must be 1 after I_2. That generates a conflict when we come to infer the value of rax after instruction I_3. We think it should be 3+1 = 4, but we already have a stored value of 3!

Our analysis keeps the original value of 3 because it was inferred from the final program state which we assume is correct. In the third iteration of the backward analysis, based on rax’s value before and after the instruction I_3, we can recover [g]’s value to be 2.

While performing the forward and backward analysis, REPT maintains a data inference graph capturing everything we know about registers and memory locations accessed by instructions. There are separate nodes in the graph for reading from an address/register (use nodes) and for writing (def nodes). A value edge from node A to node B means that REPT’s uses A’s value to infer B’s. An address edge from node A to node B means that A’s value is used to compute B’s address.

Within the graph, value conflicts occur when an inferred value does not match the existing value (the example we just walked through). An edge conflict occurs when a newly identified def node of a memory location breaks a previously assumed def-use relationship. When either type of conflict is detected the invalidation process is run. Starting from the initial node we propagate ‘unknown’ (see section 3.3.2 in the paper for details) to remove incorrect assumptions, and then we get back to our forward and backward analysis phases again.

Handling concurrency

We have yet another issue when multiple instructions sequences are executed simultaneously on multiple cores, because there are a very large number of ways to order those instructions.

The timing information from Intel PT is pretty good, and we can use that to establish a partial order among instructions sequences (if one subsequence’s start time is after another subsequence’s end time we can infer their relative execution order). That may still leave us with a (smaller) set of concurrent subsequences.

Given multiple instruction sequences executed simultaneously on multiple cores, REPT first divides them into subsequences, then merges them into a single conceptual instruction sequence based on the inferred orders. For two subsequences whose order cannot be inferred, REPT arbitrarily inserts one before the other in the newly constructed sequence.

The arbitrary selection can only matter if there are writes, in which case REPT removes inference edges in the constructed graph to limit the use of the uncertain memory accesses during inference.

Our observation is that REPT’s analysis works as long as there are no two separate concurrent writes such that one affects the inference of another’s destination. We acknowledge that this possibility exists and depends on the granularity of timing information. Given the timestamp granularity supported by modern hardware, we deem this as a rare case in practice.


REPT is evaluated on failures caused by 16 real-world bugs as shown in the table below.

Intel PT was configured at the most fine-grained timestamp logging level, and a circular buffer of 256K bytes per thread. Where applicable, Time Travel Debugging (TTD) was used to capture and establish ground truth of data values.

Starting with experiments on a single thread, REPT is able to correctly infer values with high accuracy (above 90% for tens of thousands of instructions). PHP-2012-2386 is an outlier here as it involves a large number of memory allocations right before the program failure.

REPT is then evaluated on two race condition bugs (Pbzip2 and Python-315330) to test multi-thread accuracy. For Pbzip2 95.33% of values between the defect and the failure are recovered correctly. For Python-315330 only 75.72% of values are recovered. This lower score is attributed to the large number of instructions (over half a million) between the defect and the failure in the Python case.

The runtime overhead of Intel PT is below 2%. The offline analysis running on a single quad-core machine with 16GB RAM completes in under 20 seconds for all 14 bugs that REPT can successfully address.

There are 2 out of the 16 bugs were REPT is unsuccessful. REPT doesn’t work for Chrome-776677 because the collected trace contains an in-place code update for jitted code which fails Intel PT trace parsing. REPT doesn’t work for LibreOffice-88914 because it is a deadlock bug triggering an infinite loop that fills the circular buffer and causes the prior program history to be lost.

In addition to reverse debugging, we believe one can leverage the execution history recovered by REPT to perform automatic root cause analysis. The challenge is that the data recovery of REPT is not perfect, so the research question is how to perform automatic root cause analysis based on the imperfect information provided by REPT.

Capturing and enhancing in situ system observability for failure detection

October 15, 2018

Capturing and enhancing in situ system observability for failure detection Huang et al., OSDI’18

The central idea in this paper is simple and brilliant. The place where we have the most relevant information about the health of a process or thread is in the clients that call it. Today the state of the practice is to log and try to recover from a failed call at the client, while a totally separate failure detection infrastructure is responsible for figuring out whether or not things are working as desired. What Panorama does is turn clients into observers and reporters of the components they call, using these observations to determine component health. It works really well!

Panorama can easily integrate with popular distributed systems and detect all 15 real-world gray failures that we reproduced in less than 7s, whereas existing approaches detect only one of them in under 300s.

Panaroma is open source and available at

Combating gray failures with multiple observers

Panaroma is primarily design to catch gray failures, in which components and systems offer degraded performance but typically don’t crash-stop. One example of such a failure is a ZooKeeper cluster that could no longer service write requests event though the leader was still actively exchanging heartbeat messages with its followers. Detecting gray failures typically requires observing a component from multiple perspectives.

In this paper, we advocate detecting complex production failures by enhancing observability (a measure of how well component’s internal states can be inferred from their external interactions)… when a component becomes unhealthy, the issue is likely observable through its effects on the execution of some, if not all, other components.

In the ZooKeeper incident example, even though the ZooKeeper heartbeat detectors did not uncover the partial failure, the request time-outs in the client Cassandra process were certainly noticed!

Client processes are well placed to observe issues in providers, but their exception handling code typically looks something like the fragment below, handling an error (or at least logging it!), but not reporting it to any health management system.

Collaborative failure detection is not a new idea, but detection is normally done within peers or layers.

Panorama pushes the detection scope to an extreme by allowing any thread in any process to report evidence, regardless of its role, layer, or subsystem. The resulting diverse sources of evidence enhance the observability of complex failures.

Panorama’s high-level design

Instead of separate monitoring code measuring superficial signals, Panaroma enhances existing client code that lies near the boundaries between components (i.e., inter-thread and inter-process calls). Based on a static analysis, applications are instrumented to capture and report health observations, which are stored in a co-located local observation store (LOS). A local decision engine in the LOS makes judgements about the status of observed components’ status. Then a central verdict server allows easy querying of judgements and arbitrates amongst the decentralised LOSes.

Components using Panorama register with a local Panorama instance and receive a handle to be used for reporting. The Panorama LOS maintains a watch list of subjects being observed by its reporters. Observations about a subject are stored locally in a dedicated table, and also propagated to all LOSes with that subject in their watch list.

The local decision engine analyses the observation log for a subject (which will include both local and remote observations) and records a verdict in the verdict table. The verdict is not propagated because the verdict algorithm is deterministic.

The decision process groups observations about a subject by observer, and then inspects the observations in each group starting with the most recent. For each context (think capture point) in the group a status will be assigned, which is unhealthy if the latest status is unhealthy or the healthy status does not have a recent majority. The summaries across all observers are then aggregated and status decided using a simple majority.

Because these reports come from errors and successes in the execution paths of requestor components instead of artificial, non-service signals, our experience suggests that a simple decision algorithm suffices to detect complex failures.

The status of a component can be HEALTHY, DEAD, and a few levels of UNHEALTHY. It can also be PENDING, which arises when indirection and asynchrony interject between a requestor and provider…

Dealing with indirection and async calls: observability patterns

Say a provider responds immediately to a request, and then goes on to do work in a background thread which crashes. The scheme as discussed so far won’t cope with that scenario. In fact, the simple synchronous request-response pattern is only one of four communication patterns the Panorama team need to deal with:

(a) synchronous request response
(b) indirect request, direct reply
(c) direct request, indirect reply
(d) indirect request and reply

Each of these patterns have a different effect on observability. “Placing observation hooks without considering the effects of indirection can cause incompleteness (though not inaccuracy) in failure detection.


Panorama includes an offline tool that statically analyses a program’s source code (using Soot), finds critical points for observation capture, and injects hooks for reporting observations (using AspectJ – yay!). A typical example of an observation point would be an exception handler invoked when an exception occurs at an observation boundary. We also need positive (i.e. success) observations too of course. Observations are reported to the Panorama library where they are buffered and then sent in one aggregate message.

The indirection patterns are handled by instrumenting both the origin and the sink of split requests.

When we have an observation at an origin, but not yet a corresponding sink, this gives rise to the PENDING status we saw earlier. We the sink receives positive evidence a HEALTHY observation will follow.

We find that the ob-origin and ob-sink separation is useful in detecting not only issues involving indirection but also liveness issues.


Panorama was applied to ZooKeeper, HDFS, HBase, and Cassandra, at both the process and thread level. The integration required minimal changes.

In applying the analyzer to a system, we need annotations about what boundary-crossing methods to start with, what methods involve indirection, and what patterns it uses. The annotation effort to support this is moderate (see Table below). HDFS requires the most annotation effort, which took one author about 1.5 days to understand the HDFS source code, identify the interfaces and write the annotation specification.

The following figure shows the number of raw observations reported over time from two instrumented processes in ZooKeeper.

Detecting crash failures

Panorama is designed to detect complex failures, not just fail-stop errors, but it had better be able to detect fail-stop issues as well! The authors injected a number of fail-stop faults and measured how long it took Panorama to report them compared to the detection time of the relevant system’s built-in fault detection.

Panorama is always competitive with the built-in detection, and sometimes much faster.

Detecting gray failures

For gray failure testing, the authors reproduced 15 real-world production gray failures across all four systems. Each of these caused severe service disruption when they originally occurred, and in each case the system was still perceived as healthy so no recovery actions were taken.

Panorama detects the gray failure in all 15 cases, with detection times between 0.2s and 7s, and most less than 3s.

In addition to detecting the gray failures, Panorama also does a great job of pinpointing the cause of the problem with detailed context and observer information.

False positives?

Because Panorama can gather observations from any component in a system, there is a potential concern that noisy observations will lead to many false alarms. But, empirically, we find that this does not happen….

An observed ZooKeeper ran for 25 hours with multiple clients running a variety of workloads non-stop. In total 797,219 verdicts were generated, of which all but 705 (0.8%) were HEALTHY. All of the negative observations were made during the first 22 seconds of the run when the system was bootstrapping and unstable, the remainder of the 25 hours saw no reported faults.


When a Panorama instance is active, it consumed about 0.7% of CPU on average, and up to around 50MB of memory for a highly active instance. The latency increase and throughput decrease for each instrumented system is below 3%.

The last word

Panorama proposes a new way of building a failure detecting service by constructing in-situ observers. The evaluation results demonstrate the effectiveness of leveraging observability for detecting complex production failures.

Panorama looks to be the perfect complement to any chaos engineering initiative.

Automatic discovery of tactics in spatio-temporal soccer match data

October 12, 2018

Automatic discovery of tactics in spatio-temporal soccer match data Decroos et al., KDD’18

Here’s a fun paper to end the week. Data collection from sporting events is now widespread. This fuels an endless thirst for team and player statistics. In terms of football (which shall refer to the game of soccer throughout this write-up) that leads to metrics such as completed passes, distance covered, intercepts, shots-on-goal, and so on. Decroos et al. want to go one level deeper though, and use the data to uncover team tactics. The state of the art today for tactical analysis still involves watching hours of video footage.

This paper focuses on the problem of detecting tactics from professional soccer matches based on spatio-temporal data.

Now when I think of tactics, a key component in my mind is the team shape and movement of players off the ball. Unfortunately Decroos et al., don’t have the data available to analyse that. So they have to do what they can based on more limited information.

Our dataset consists of event data for the English Premier League for the 2015/2016 season. This event data was manually collected by humans who watch video feeds of the matches through special annotation software. Each time an event happens on the pitch, the human annotates the event with, amongst others, a timestamp, the location, an appropriate type (e.g. foul, pass, or cross) and the players who are involved. Depending on the type of the event, additional information is available, for example, the end location and type of a pass or the outcome of a tackle.

Overall the dataset contains 652,907 events with 39 different event types.

A match is represented as a sequence of such events. Can we do anything useful with those event streams to gain insights into the way a team plays?

The approach taken by the authors has five key steps:

  1. The event stream for a match is divided into a series of phases of play.
  2. The phases are then clustered based on their spatio-temporal component
  3. The clusters are ranked, with input from the analyst as to which features are of most importance to them
  4. Pattern mining is then applied to the clusters to discover frequent sequential patterns
  5. The discovered patterns are ranked

Determining phases of play

A phase is simply a sequence of consecutive events. A new phase starts whenever one of the following two conditions is met:

  1. There is a pause of at least 10 seconds between events, or
  2. Possession switches from one team to the other

Here’s an example phase from a Manchester City game:

And this chart shows the distribution of phase lengths. The follow-on analysis only considers phases with at least three events in them.

Clustering phases

The goal of the second step is to identify similar spatio-temporal phase via clustering. We do this for two reasons. One, this helps reduce the space of possible patterns that we need to search in step four. Two, a team is likely to employ multiple different attacking tactics, such as corners, attacking through the middle, down the flank, each of which will be characterized by different spatial characteristics. Clustering gives us a natural way to divide the data along these lines.

It’s a standard clustering algorithm in which each element starts out in a cluster of one, and then clusters are iteratively merged together until a stop criteria is met. In each iteration the two closest clusters are combined, where the distance between two clusters is taken as the distance between the two elements (one from each cluster) that are furthest away from each other.

The interesting twist is what to use as the distance function between event sequences? We’re interested in similar sequences of events, for which dynamic time warping (DTW) is a standard approach. It’s not a true distance function as it doesn’t satisfy the triangle equality, but it seems to work well enough for this purpose.

Ranking clusters

Typically, the quality of clusters is judged by statistics such as average pairwise distance, maximal pairwise distance and minimal pairwise distance. However, these evaluation functions are less likely to be relevant to a domain expert. (For example,) a soccer coach might be most interested in a cluster with phases that frequently lead to shots and goals.

In the current work, clusters are ranked based on the number of shots they contain.

Mining for patterns

Having identified clusters of similar sequences it’s now time to look for patterns within them.

We employ the CM-SPADE pattern miner, which is a more conventional sequential pattern mining algorithm found in the SPMF toolbox. This pattern miner is more restrictive in the type of learned patterns, but offers better scalability in terms of speed and memory.

Before we can hand a sequence of to the pattern miner we need to convert the sequence of events into a sequence of itemsets (unordered sets). Most pattern miners like to work with discrete data, so the x and y location of an event causes difficulties. The solution is to discretise the location information, which is done in a very coarse manner as follows:

Itemsets are constructed by combining the event type representation and discretised location. That gives human interpretable representations such as following for the Manchester City phase of play we saw earlier:

Ranking patterns

Final ranking of patterns is done by assigning a weight to each event type indicating its relative importance (e.g. shots on goal have a larger weight) and then ranking by considering the types of events in the pattern, the length of the pattern, and the pattern’s support. Shots have weight 2, passes have weight 0.5, and every other event has weight 1.

Does it work?

The evaluation is based on data from all 38 matches of the 2015/16 season. The top 10 clusters for each team are then mined for patterns.

Here’s an example of clusters found for Manchester City.

A closer look at one of these clusters shows a clear attacking pattern starting from the right flank.

Zooming in on three teams, Arsenal, Leicester City, and Manchester United the technique is sufficient to highlight differences in their play. For example, Leicester City generate many more shots from the left flank than the right flank, and very few top ranking sequences that start with a goal kick. They also have a number of clusters where the phase of play starts in the opponent’s half of midfield, and these generate 64 shots between them. “This indicates a direct counter-attacking style with shorter sequences.”

Arsenal’s play involves long sequences of passing the ball around, with lots of action through the midfield:

Manchester City’s style falls somewhere between the two:

The following table compares the top 10 clusters for each team:

The patterns generated by our approach were presented to a company that provides data-driven advice to soccer clubs and soccer associations with respect to player recruitment and opponent analysis. The company has expressed an interest in building a product based on this approach and implementing it in the near future to be included in their services.

Regarding my comments at the start of this write-up on the movement of players on and off the ball being a key part of tactics (think e.g. of the formation of a defensive line, how high up the pitch a team forms that, how quickly they press-up, etc.), the authors have the following to say: “it would be interesting and much more informative to have full optical-tracking data for all players and the ball. However, tackling such a setting would require radically different techniques.” I’m sure such data will eventually be forthcoming, and I look forward to seeing what can be done with it. E.g., could we build a predictive model for the behaviour of each individual player? And could we then combine those in a multi-agent environment and use reinforcement learning to discover effective tactics against them? Now that would be interesting!

Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding

October 10, 2018

Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding Hundman et al., KDD’18

How do you effectively monitor a spacecraft? That was the question facing NASA’s Jet Propulsion Laboratory as they looked forward towards exponentially increasing telemetry data rates for Earth Science satellites (e.g., around 85 terabytes/day for a Synthetic Aperture Radar satellite).

Spacecraft are exceptionally complex and expensive machines with thousands of telemetry channels detailing aspects such as temperature, radiation, power, instrumentation, and computational activities. Monitoring these channels is an important and necessary component of spacecraft operations given their complexity and cost. In an environment where a failure to detect and respond to potential hazards could result in the full or partial loss of spacecraft, anomaly detection is a critical tool to alert operations engineers of unexpected behavior.

Anomaly detection systems deployed today typically consist of tiered alarms indicating when values fall outside of pre-defined limits. There are also limited instances of expert systems and nearest-neighbour based approaches being tried, but their limitations prevented widespread adoption. A more accurate and scalable approach to anomaly detection that makes better use of limited engineering resources is required.

Any such system needs to work with data that is highly context dependent and often heterogeneous, noisy, and high-dimensional. Any anomalies reported should come with a degree of interpretability to aid diagnosis, and a balance must be struck between false positives and false negatives. This paper focuses on reported anomalies and telemetry data for the Soil Moisture Active Passive (SMAP) satellite and the Mars Science Laboratory (MSL) rover, Curiosity.

An open source implementation of the the methods described in this paper, together with incident data and telemetry, is available at

LSTMs to the rescue?

In the terminology of this paper there are three categories of anomaly of interest:

  • Point anomalies are single values that fall within low-density regions of values
  • Collective anomalies are sequences of values that are anomalous, whereas any single value in the sequence by itself might not be, and
  • Contextual anomalies are single values that do not fall within low-density regions of values overall (i.e., they’re not ‘out of limits’), but they are still anomalous with regards to the sequence of local values they are embedded in.

Compared to previous generations of automated anomaly detection, recent advances in deep learning, compute capacity, and neural network architectures hold promise of a performance breakthrough:

LSTMs and related RNNs represent a significant leap forward in efficiently processing and prioritizing historical information valuable for future prediction…. The inherent properties of LSTMs makes them an ideal candidate for anomaly detection tasks involving time-series, non-linear numeric streams of data. LSTMs are capable of learning the relationship between past data values and current data values and representing that relationship in the form of learned weights.

Other advantages of LSTMs for anomaly detection include:

  • they can handle multivariate time-series data without the need for dimensionality reduction
  • they don’t require domain knowledge of the specific application, allowing for generalisation
  • they can model complex non-linear feature interactions
  • there is no need to specify time windows for shared data values

NASA’s LSTM-based system has three parts:

  1. LSTM-based value prediction, learned in an unsupervised fashion from normal command and telemetry sequences.
  2. An unsupervised thresholding method that looks at the differences between the value predictions and the actual values to determine whether prediction errors are indicative of true anomalies
  3. A set of filters to further mitigate false positives

Value prediction

The LSTM models used for value prediction use input sequences of length 250 and are configured as follows:

A single model is created for each telemetry channel and used to predict values for that channel only. This helps with traceability when investigating reported anomalies (versus predicting m values from a single model) and avoids LSTM problems with accurately predicting m-dimensional outputs for large m.

The time series inputs comprise the telemetry values together with a one-hot encoding of the commands sent to the spacecraft.

Each model is tasked with predicting only the next value (i.e., an output sequence of length 1).

Dynamic error thresholds

Taking the delta between the predicted value and the actual next value in the telemetry stream gives us a stream of prediction errors. A window of h historical error values is then used to smooth these errors using an exponentially-weighted moving average:

The set of errors are smoothed to dampen spikes in errors that frequently occur with LSTM-based predictions — abrupt changes in values are often not perfectly predicted and result in sharp spikes in error values even when this behavior is normal.

Given the smoothed values, we can now apply a simple threshold test to detect candidate anomalies. But how should we set the threshold?

Let the sequence of smoothed error values be \mathbf{e}_s. The threshold \epsilon is selected from the set

\displaystyle \mathbf{\epsilon} = \mu (\mathbf{e}_s) + z \sigma (\mathbf{e}_s)

using the argmax function, where z is drawn from an ordered set of positive values representing the number of standard deviations above \mu (\mathbf{e}). The evaluation used { 2.5, 3.0, 3.5, … 10.0 }.

Once argmax(\mathbf{\epsilon}) has been determined from this set, it is used to compute an anomaly score indicating the severity of the anomaly.

In simple terms, a threshold is found that, if all values above are removed, would cause the greatest percent decrease in the mean and standard deviation of the smoothed errors. The function also penalises for having larger numbers of anomalous values and sequences to prevent overly greedy behavior. Then the highest smoothed error in each sequence of anomalous errors is given a normalized score based on its distance from the chosen threshold.

Mitigating false positives

Since false positives are expensive a further pruning stage is now applied. We map max over each anomaly-containing smoothed window of prediction errors, and finally add the maximum smoothed error that isn’t anomalous to the resulting set, \mathbf{e}_{max}. Now \mathbf{e}_{max} is sorted in descending order. From the sorted sequence, the percentage decrease d^{(i)} at each step i is calculated. E.g., if we have a max sequence (10, 8, 6, 3, 1) then the first percentage decrease will be (10-8)/10 = 20%.

A threshold p is now chosen to represent the minimum percent different between subsequent anomalies (p = 0.13 in the evaluation). When a decrease d^{(i)} exceeds the threshold p, all preceding sequences remain flagged as anomalous. If a decrease does not exceed p, and nor do any subsequent decreases, then the following smoothed error sequences are reclassified as nominal.

With sufficient historical data, it is also possible to set a minimum anomaly score threshold such that any anomaly scoring below this is reclassified as nominal.


The evaluation is done base on telemetry data from SMAP and MARS, coupled with information on anomalies from Incident, Surprise, Anomaly (ISA) reports. For each unique stream of data containing one or more anomalous sequences a 5 day window of data is constructed around the primary anomaly. This window of data becomes part of the test data set. A further 2 day window immediately prior to the start of the test window is used for training.

Telemetry values are aggregated into one minute windows and evaluated in batches of 70 minutes, which matches the downlink schedule for SMAP. The size of the smoothing window, h , is set at 2100. Here are the results for each spacecraft:

The LSTM models achieved an average normalized absolute error of 5.9% predicting telemetry values one time step ahead…. The high proportion of contextual anomalies (41%) provides further justification for the use of LSTMs and prediction-based methods over methods that ignore temporal information. Only a small subset of the contextual anomalies — those where anomalous telemetry values happen to fall in low-density regions — could theoretically be detected using limit-based or density-based approaches.

MSL performs a much wider set of behaviours than SMAP, with varying regularity, which explains the lower precision and recall for MSL ISAs.


The methods presented in this paper have been implemented into a system that is currently being piloted by SMAP operations engineers. Over 700 channels are being monitored in near real-time as data is downloaded from the spacecraft and models are trained offline every three days with early stopping. We have successfully identified several confirmed anomalies since the initial deployment in October 2017.

The major issue being worked on now is to reduce the number of false positives (the bane of many an alerting system!): “Investigation of even a couple of false positives can deter users and therefore achieving high precision with over a million telemetry values being processed per day is essential for adoption.”

Also in future work is investigating the interactions and dependencies between telemetry channels, giving a view onto the correlation of anomalies across channels (which feels like it also ought to help reduce the false positive rate as well).

Online parameter selection for web-based ranking problems

October 8, 2018

Online parameter selection for web-based ranking problems Agarwal et al., KDD’18

Last week we looked at production systems from Facebook, Airbnb, and Snap Inc., today it’s the turned of LinkedIn. This paper describes the system and model that LinkedIn use to determine the items to be shown in a user’s feed:

It replaces previous hand-tuning of the feature mix and ranking:

In the past, a developer spent days to accurately estimate [the feature weights] with desired behavior with every significant drift and intervention. Using the approach described in this paper, we have completely eliminated the laborious manual tuning, significantly increased developer productivity as well as obtained better Pareto optimal solutions.

Here are representative results comparing the impact on three key metrics for a setting found by the new algorithm over a period of 36 hours, vs the best setting that could be found in 7 days of hand-tuning:

Viral Actions is the most important metric for the LinkedIn feed, ‘with far reaching consequences from more subsequent sessions to more revenue.’ The algorithm is tasked with maximising this metric while not allowing others to drop by more than 1%. (Increases in the other metrics as well are preferred of course).

The overall method is general enough to be applied in settings outside of LinkedIn too.

The LinkedIn feed

The LinkedIn feed includes a variety of content types such as professional updates shared by a member’s connections, native ads posted by advertisers, job recommendations, professional news, and new connection and course recommendations.

The central question here is what items should be shown in the feed of a given user, and in what order.

As each content type usually is associated with a business metric to optimize for, an efficient feed ranking model needs to consider the tradeoff amongst the various metrics…

The three main metrics detailed in the paper are:

  • Viral Actions (VA): the fraction of feed sessions where a user has liked, commented on, or shared a feed item.
  • Engaged Feed Sessions (EFS): the fraction of the total number of sessions where a user was active on the feed.
  • Job Applies (JA): the fraction of sessions where the user applied for a recommended job on the feed.

The ‘feed ecosystem’ (i.e., all the systems providing candidate feed items) is very dynamic. A Feed Mixer first calls each of the downstream feed providing services to retrieve the top-ranked suggestions per feed-type, and then it combines and re-ranks the suggestions across content types to give the final feed.

Each downstream service also provides probability measures for each suggested item, indicating the probability that the member will take the desired action when shown the item. “The estimates of these probability measures are from independent response prediction models and available to the Feed-Mixer in real-time at the time of scoring.”

Multi-objective optimisation

We have multiple different objectives to optimise for (VA, EFS, JA, and so on), which puts us in the world of Multi-Objective Optimization (MOO).

A reasonable approach for balancing multiple objectives is to rank items using weighted linear combinations of the predicted objective utilities.

That’s about as simple as it gets. So we can compute the score of an eligible item with a linear combination like this:

Where P_{metric} is the probability of the member m taking the desired action when shown the item u. A weight vector \mathbf{x} = (x_{EFS}, x_{JA}) controls the balance of the business metrics VA, EFS, and JA. So far so good, there’s just one small issue remaining:

Finding the weights (often called tuning or control parameters) is challenging. Practitioners often rely on running multiple A/B experiments with various values of the control parameters and select the one that works best based on the observed values of the metrics in the experiments. This is a manual and time-consuming process which often needs to be repeated when there is a change in the environment.

We’d like an automated way of finding a good set of control parameters instead.

Thompson sampling

One metric is selected as the primary (VA) in this case, and the other metrics are secondary. The objective therefore is to maximise utility based on the primary metric under the constraint that the utility in the secondary metrics stays above a pre-specified threshold (no more than a 1% drop in the LI case).

With a little massaging the linear combination equation is redefined using Lagrangian weights:

\sigma_{\epsilon}(z) is a smoothing function which evaluates close to 1 for positive z, and close to 0 for negative z. \lambda_i are weights for each of the i candidate items. So what’s going on here is a function that scores a candidate item by summing the \lambda_is for the constraints it satisfies, and adding the utility score in the primary metric.

The trick is to let the value of \lambda_i help us identify how many constraints are satisfied or violated.

This is achieved by setting \lambda_i such that the range the resulting score ends up telling us how many constraints were violated (see section 3 in the paper for details). Doing this we end up with a familiar global optimisation of an unknown function problem.

Whether a member takes the desired (per metric) action in a session when their feed is served with parameter \mathbf{x} is modelled as a random variable. After observing the aggregated data for a session this enables us to estimate the respective utilities.

The optimisation problem is solved through a modified version of Thompson sampling. The Thompson Sampling approach samples a new data point \mathbf{x}_t at iteration t from the posterior of the maximiser of f (F_t) given the data we have at stage t-1. The process stops once the variance of the distribution of F_t becomes small, and mode(F_t) is returned as the estimate of the global maximiser. To prevent convergence to a local optimum, the LinkedIn system also explores the entire region \mathcal{X} uniformly (as opposed to just sampling from the posterior) at every iteration t with some probability \epsilon > 0.

The basic algorithm is shown below:

The production system also includes an enhancement to this algorithm as follows: the system searches over a set of grid of values in the entire input space (Sobol points); once it has converged to a small enough region the search space is reduced and more points are generated within it (zooming). The algorithm is stopped when the objective value does not change further through the zooming mechanism.

Figure 4, below, shows a nice example of the algorithm in action.


  • (a) the algorithm is initialised by performing quasi-random sampling for the first 10 iterations
  • (b) within just 5 more iterations, the algorithm is starting to converge
  • (c) but wait! Stopping early at point (b) would be a mistake as with just one more random sampling step the distribution starts to shift and soon we escape the local optimum and a new peak emerges.
  • (d) now the initial two peaks have disappeared and we have just a single peak
  • (e) by iteration 30 the distribution has become very stable

This experiment shows the importance of choosing the \epsilon -greedy approach; without this adjustment to Thompson Sampling, there is a high probability of getting stuck at a local maximum instead of converging to the true global maximum.

Production system

We have successfully implemented the algorithm in practice and it is deployed in the engineering stack for LinkedIn Feed. We achieved substantial metric gains from this technique and drastically improved developer productivity.

The overall architecture includes online and offline components as shown below.

The offline component runs once per hour, calculating the metrics (utilities) based on accumulated tracking data. The metric observations are then used to update the posterior distribution of the maximiser. The resulting distribution is stored in a Voldemort-based KVS.

When a member visits the LinkedIn feed, the online parameter sampling module returns a sample (i.e., a set of weights to configure the feed) from the stored distribution. The sample is based on an MD5 of the user’s memberId, ensuring consistent samples for the same parameter distribution for the same user (so that the feed doesn’t jump around with page refreshes etc.). The sampled parameter value is fed into the MOO models to rank the items received from the feed services, and the resulting ranked list is sent to the UI.

I know you’ll be back: interpretable new user clustering and churn prediction on a mobile social application

October 5, 2018

I know you’ll be back: interpretable new user clustering and churn prediction on a mobile social application Yang et al., KDD’18

Churn rates (how fast users abandon your app / service) are really important in modelling a business. If the churn rate is too high, it’s hard to maintain growth. Since acquiring new customers is also typically much more expensive than expanding in existing accounts, churn hits you twice over. So it’s really important to understand what causes users to churn in your business, and ideally to be able to predict users likely to churn so that you can intervene. This paper describes ClusChurn, the churn prediction system deployed at Snapchat.

It has been argued for decades that acquiring new users is often more expensive than keeping old ones. Surprisingly, however, user retention and its core component, churn prediction, have received much less attention from the research community… While this paper focuses on the Snapchat data as a comprehensive example, the techniques developed here can be readily leveraged for other online platforms, where users interact with the platform functions as well as other users.

The central idea in ClusChurn is to cluster users into clusters that are meaningful from a business perspective, and then use the assignment of a given user to a cluster to help predict churn. The challenge is that we want to tell quickly whether a new user is likely to churn or not, so we can’t hang around for loads of data to see what cluster they belong to first. Thus Snap jointly learn both the user type (cluster) and user churn likelihood. A prototype implementation of ClusChurn based on PyTorch is available on GitHub.

Understanding Snapchat users – automated discovery of user types

The journey begins with an analysis of new user behaviour on Snapchat to see if meaningful user types (clusters) exist in the data. This information will inform the churn prediction algorithm developed in the next step. The analysis is done on two weeks of anonymous user data from a selected country, chosen because it provides a relatively small and complete network. There are 0.5M new users in the dataset (adding 250K users a week therefore!), and about 40M users totals with approximately 700M links.

The clustering is based on two sets of features: user features associated with their daily activities (e.g. the number of chats they send and receive); and ego-network features describing the structure of the friend network centred on the user. The user features are summarised in the table below. There are two ego-network features, user degree (how many friends a user has), and friend density. The density is calculated as the number of actual links between a user’s friends divided by the number of possible links. The ten user features plus the two network features gives a total of 12 features per day. Each of these 12 features is captured every day for two weeks for each user, given a 12-dimensional time series of length 14.

The daily activity metrics are all over the place (e.g, see below for a plot of ‘chat received’ counts). So to make things more manageable two parameters are derived: \mu is the mean of the daily measures, capturing activity volume, and l is the lag(1) of the daily measures capturing burstiness (“both measures are commonly used in time series analysis“).

Looking at the aggregate measures (e.g. for chats received, below) we see curves with different steepness and inflection points.

These can be modelled by a sigmoid function with parameters q and \phi capturing the shape of the curve.

For each of the 12 attributes, we can therefore compute the four parameters (\mu, l, q, \phi) giving us a 12×4 feature matrix.

Within the Snapchat network for the country, the authors discovered a smaller group of well-connected popular users, the core of the network. When new users make friends, many of their new friends are within the core. The hunch is that the ways new users connect to the core should be significant for user clustering and churn prediction.

With all this information in hand, we’re looking for a clustering framework that can provide insight into user types such that the type information can be readily turned into actionable items to facilitate downstream applications such as fast-response and targeted user retention.

Clustering is a three-step process:

  1. For each feature (e.g. chats received) k-means clustering with Silhouette analysis is used to create K clusters, where K is chosen by the algorithm. For each user we record which cluster they belong to for each feature. This process helps to find meaningful types of user with respect to each feature individually (e.g. users who constantly receive lots of chat messages).
  2. For each user, we derive a new feature combination vector where each element is the cluster centre of the cluster the user belongs to for that feature. (E.g. if the user is in cluster 1 for the first feature, with cluster centre c_1, and in cluster 3 for the second feature, with cluster centre c_3, then the first two elements of the feature combination vector will be c_1, c_3). The purpose of this step is to reduce the influence of noise and outliers.
  3. Now we apply k-means clustering with Silhouette analysis again, but this time on the feature combination vectors. “The multi-feature clustering results are the typical combinations of single-dimensional clusters, which are inherently interpretable.

Looking at the ego-network features in particular, three types of users emerge: type 1 users have relatively large network sizes and high densities; type 2 users have relatively small network sizes and low densities; and type 2 users have minimal values on both measures:

The type of network a user has is strongly correlated with their position in the whole social network. Type 1 users are mostly ‘tendrils’ with about 58% of direct friends in the core, type 2 users are primarily outsiders with about 20% of direct friends in the core, and type 3 users are mostly disconnected with almost no friends in the core.

Combining new users’ network properties with their daily activities, we finally come up with six cohorts of user types, which is also automatically discovered by our algorithm without any prior knowledge. Looking into the user clusters, we find their different combinations of features quite meaningful, regarding both users’ daily activities and ego-network structures. Subsequently, we are able to give the user types intuitive names, which are shown in Table 2.

If you look at the churn rates for different user types you can see that user type provides a lot of valuable insights. For example, the main difference between All-star users and Chatters is their activity on stories and lens; being active in these functions indicates a much lower churn rate.

(Yes, I think we did just rediscover that users with higher engagement churn less! But, with one level deeper insight telling us what kinds of engagement matter most…).

Predicting churn in new users

We now know that the cluster (user type) of new users is highly indicative of their likely churn. Unfortunately, analysis of real data shows that new users are most likely to churn at the very beginning of their journey. I.e., before we’ve had a chance to accumulate lots of data to place them in a cluster.

The goal is to accurately predict the likelihood of churn by looking at users’ very initial behaviors, while also providing insight into possible reasons behind their churn… However, as we only get access to the initial several days instead of the whole two-week behaviors, user types are also unknown and should be jointly inferred with user churn.

The churn and user-type inference model is developed via experiments done on an anonymous internal dataset with 37M users and 697M bi-directional links. Logistic regression and random forest churn prediction models are used as the baseline for comparisons. The resulting model is developed in three stages:

  1. The foundation is an LSTM sequence-to-sequence model applied to the multi-dimensional user and ego-network feature vectors. (The 12×14 vectors described at the top of this piece). A linear projection using the sigmoid function is connected to the output of the last LSTM layer to produce a user churn prediction.
  2. To deal with sparse, skewed, and correlated activity data, an activity embedding layer is added in front of the LSTM layers. Empirically, a single fully connected layer proved to be sufficient.
  3. During training, we do know the real user types as computed by the three-step clustering process. Given K user types, the next twist is to train K sub-LSTMs, one for each user type.

We parallelize the K sub-LSTMs and merge them through attention to jointly infer hidden user types and user churn.

A positive attention weight is assigned to each user type to indicate the probability that the user is of a particular type. This weight is computed as a similarity of the corresponding typed output sequence (from the sub-LSTM) and a global unique typing vector jointly learned during the training process.

The following figure shows how the churn prediction accuracies improves (and compares to the baseline) as we add in these various features.


Deployment at Snap

ClusChurn is in production deployment at Snap Inc., where it delivers real-time analysis and prediction results to multiple systems including user modelling, growth, retention, and so on. In future work the authors hope to understand the connections between the different user types, as well as modelling user evolution patterns (how they change type over time) and how such evolution affects their activity and churn rate.