Delayed impact of fair machine learning Liu et al., ICML’18

“Delayed impact of fair machine learning” won a best paper award at ICML this year. It’s not an easy read (at least it wasn’t for me), but fortunately it’s possible to appreciate the main results without following all of the proof details. The central question is how to ensure fair treatment across demographic groups in a population when using a score-based machine learning model to decide who gets an opportunity (e.g. is offered a loan) and who doesn’t. Most recently we looked at the equal opportunity and equalized odds models.

The underlying assumption of course for studied fairness models is that the fairness criteria promote the long-term well-being of those groups they aim to protect. The big result in this paper is that you can easily up end ‘killing them with kindness’ instead. The potential for this to happen exists when there is a feedback loop in place in the overall system. By overall system here, I mean the human system of which the machine learning model is just a small part. Using the loan/no-loan decision that is a popular study vehicle in fairness papers, we need to consider not just (for example) the opportunity that someone in a disadvantaged group has to qualify for a loan, but also what happens in the future as a result of that loan being made. If the borrower eventually defaults, then they will also see a decline in their credit score, which will make it harder for the borrower to obtain additional loans in the future. A successful lending event on the other hand may increase the credit score for the borrower.

Consequential decisions… reshape the population over time. Lending practices, for example, can shift the distribution of debt and wealth in the population. Job advertisements allocate opportunity. School admissions shape the level of education in a community.

If we want to manage outcomes in the presence of a feedback loop, my mind immediately jumps to control theory. Liu et al. don’t go all the way there in this paper, but they do introduce “a one-step feedback model of decision making that exposes how decisions change the underlying population over time.

We argue that without a careful model of delayed outcomes, we cannot foresee the impact a fairness criterion would have if enforced as a constraint on a classification system. However, if such an accurate outcome model is available, we show that there are more direct ways to optimize for positive outcomes than via existing fairness criteria.

### Fairness criteria

Three main selection policies are used as the basis of the study, and to keep things simple we’ll consider a setting in which the population is divided into two groups, A and B. Group A is considered to be the disadvantaged group (“A” for Advantaged would have made a lot more sense!). An institution makes a binary decision for each individual in each group, called selection (e.g., should this individual be offered a loan).

• In the unconstrained or max profit model a lender makes loans so long as they are profitable, without any other consideration.
• In the demographic parity model the institution maximises its utility subject to the constraint that the institution selects from both groups at an equal rate.
• In the equal opportunity model utility is maximised subject to the constraint that individuals with a successful outcome (e.g. repay their loan) are selected at an equal rate across both groups.

See ‘Equality of opportunity in supervised learning’ for a discussion of the equal opportunity model and why it is preferred to demographic parity.

### The outcome curve

The authors study what happens to the average (mean) score of a group under different selection policies. A policy is said to cause active harm if it causes the average score to go down, stagnation if the average score does not change, and improvement if the average score goes up. It’s also useful to compare a policy to some baseline, for which we’ll use the maximum profit / utility model. Compared to this baseline, a policy does relative harm if the average score under the policy is less than it would be under the maximum utility model, and relative improvement if the average score under the policy is greater than it would be under the maximum utility model.

Given a policy, there is some threshold score above which members of a group will be selected. We can think of the threshold as determining a selection rate, the proportion of a group selected by the policy. As the selection rate varies, so does the impact on the average score over time. We can plot this in an outcome curve.

The outcome curve shows three regions of interest: relative improvement, relative harm, and active harm. If the selection rate for a group under the maximum utility model is $\beta^{MaxUtil}$, then reducing the selection rate below this does relative harm to the group – i.e., the average score goes down. Intuitively, there are less positive outcomes for the group to drive a positive feedback loop (e.g. people receiving loans and paying them back).

If we increase the selection rate above the maximum utility (from the perspective of the institution remember) point, then to start with the group sees a relative improvement in their average scores. This holds so long as the additionally selected individuals have more positive outcomes than negative. The selection rate at which relative improvement is maximised is denoted by $\beta^{*}$.

If we push the selection rate too high though, then negative outcomes (loan defaults) rise and the average score improves less. If this improvement falls below the maximum utility point, we’re back in a region of relative harm. The rate above which relative improvement switches back to relative harm is denoted by $\bar{\beta}$.

Push the selection rate even higher still (beyond $\beta_0$, and the impact of the additional negative outcomes can mean that not only does the group see relative harm, but also active harm, i.e., they are worse off in absolute terms.

### Killing with kindness?

Under analysis, both demographic parity and equal opportunity fairness criteria can lead to all possible outcomes (improvement, stagnation, and decline). Interestingly, “under a mild assumption, the institution’s optimal unconstrained selected policy can never lead to decline“. So it’s therefore possible to have a fairness intervention with the unintended consequence of leaving the disadvantaged group worse off than they were before. There are also some settings where demographic parity causes decline, but equal opportunity does not. Furthermore, there are some settings where equal opportunity causes relative harm, but demographic parity does not.

If we assume that under each fairness criteria the institution otherwise maximises utility, then we can determine the selection rates under each policy from utility curves.

Whether the policies show relative harm, relative improvement, or active harm, depends on where the selection rates $\beta^{EqOpt}$ and $\beta^{DemParity}$ fall with respect to $\bar{\beta}$ and $\beta_0$.

### Outcome based fairness

…fairness criteria may actively harm disadvantaged groups. It is thus natural to consider a modified decision rule which involves the explicit maximization of [the average score of the disadvantaged group]… This objective shifts the focus to outcome modeling, highlighting the importance of domain specific knowledge.

For example, an institution may decide to set a selection rate for the disadvantaged group that maximises the improvement in their average score, subject to a constraint that the utility loss compared to the baseline is less than some threshold.

### A FICO example

The authors look at the impact of fairness constraints on FICO scores (credit worthiness) using a sample of 301,536 scores from 2003. Individuals were labeled as defaulted if they failed to pay a debt for at least 90 days on at least one account in the ensuing 18-24 month period. With a loss/profit ratio of -10, no fairness criteria surpass the active harm rate. But at a loss/profit ratio of -4, demographic parity overloans and causes active harm.

This behavior stems from a discrepancy in the outcome and profit curves for each population. While incentives for the bank and positive results for individuals are somewhat aligned for the majority group, under fairness constraints, they are more heavily misaligned in the minority group, as seen in the graphs (left) above.

Bounding data races in space and time Dolan et al., PLDI’18

Yesterday we looked at the case for memory models supporting local data-race-freedom (local DRF). In today’s post we’ll push deeper into the paper and look at a memory model which does just that.

Consider a memory store $S$ which maps locations to values. For non-atomic locations it’s not enough just to store a single memory value at the location. Instead non-atomic locations are mapped to histories H, where a history is a finite map from timestamps to values, and timestamps are totally ordered.

Every thread has a frontier, F, which is a map from non-atomic locations to timestamps.

Intuitively, each thread’s frontier records, for each nonatomic location, the latest write known to the thread. More recent writes may have occurred, but are not guaranteed to be visible.

For atomic locations, the memory store records a pair (F,x), with a single data value x rather than a history. F is a frontier, which will be merged with the frontiers of threads that operate on the location. This is expressed in the rules READ-AT and WRITE-AT in the operational semantics below. The READ-AT rule, for example, should be read as “when a thread with frontier F reads a memory location A with content (Frontier-A, x), then the memory location is unchanged and the new frontier of the thread is F union Frontier-A”.

(Enlarge)

The rules for reading and writing non-atomic locations are given by READ-NA and WRITE-NA. A thread can read an arbitrary element of the history of a non-atomic location, so long as the timestamp is greater than or equal to the timestamp in the thread’s frontier. A write to a non-atomic location by a thread must specify a timestamp later than that in the writing thread’s frontier.

Note a subtlety here: the timestamp need not be later than everything else in the history, merely later than any other write known to the writing thread.

The reduction rules for the small-step operational semantics operate on machine configurations consisting of a store S and a program P. A program is a finite set of threads, each with a frontier and an expression. Silent transitions are those that reduce expressions without involving memory operations. For transitions that involve memory (rule MEMORY), then the memory operation involved determines the thread’s new frontier and the new contents of the memory location.

A trace is a sequence of machine transitions starting from an initial state in which all locations have a timestamp 0. A trace is sequentially consistent if it contains no weak transitions. A weak transition is a machine step performing a memory operation at a non-atomic location where one of the following holds:

• the thread does not read the latest write to the location (reads a value with timestamp greater than its frontier, but not the very latest)
• the thread writes a value to the location with a timestamp greater than the thread’s frontier, but not greater than the largest timestamp in the location history (i.e., it is not the latest write to that location)

Two transitions conflict if they access the same atomic location and at least one is a write. If there is no happens-before relation between the two transitions then we have a data race. A happens-before relation can take one of two forms: either the transitions happen on the same thread, or the location involved is atomic and at least one of the transitions is a write.

If instead of considering the whole memory store we focus in on a subset of locations L, then a transition is L-sequential if it is not a weak transition, or if it is a weak transition but not on a location in L.

A machine is L-stable if all of the L-sequential transitions in its traces are data race free.

Intuitively, M is L-stable if there are no data races on L in progress when the program reaches state M. There may be data races before reaching M, and there may be data races after reaching M, but there are no data races between one operation before M and one operation afterwards.

From an L-stable machine M, the local DRF theorem tells us that we can reason locally over a set of transitions starting from M. If all of those transitions are L-sequential then the program is data race free through those transitions.

### Mapping to hardware

To facilitate mapping the operational model defined above to hardware (x86 and ARMv8) the authors also define an axiomatic semantics in terms of read and write events. Events are generated during program execution and form an event graph G. Three binary relations are defined over events:

• program order (po): E1 po E2 if E1 and E2 occur on the same thread and E1 is ordered before E2 on that thread.
• reads from (rf): E2 rf E1 if E1 is a write event and E2 is a read event, and both events have the same location and value
• coherance (co): E1 co E2 if E1 and E2 are write events to the same location

In addition, a from-reads (fr) relation is defined in terms of rf and co. E1 fr E2 if E1 reads a value which is later overwritten by E2.

The axiomatic semantics are mapped to the operational semantics to show that they define the same memory model.

We now show that our memory model can be compiled efficiently to the x86 and ARMv8 memory models, and define the optimisations that are valid for a compiler to do… In our memory model, not all of the po relation is relevant, which is an important property for both compiler and hardware optimisations. For instance, there is no constraint that two nonatomic reads must be done in program order.

There are some constraints on program (re-)order(ing) though:

• operations must not be moved before prior atomic operations
• operations must must be moved after subsequent atomic writes
• prior reads must not be moved after subsequent writes
• conflicting operations must not be reordered.

These constraints still permit a number of peephole optimisations including:

• Sequentialisation: [P || Q] =&gt; [P ; Q] (which is surprisingly invalid under C++ and Java memory models)
• Redundant load (elimination): [r1 = a; r2 = a;] =&gt; [r1 = a; r2 = r1]
• Store forwarding: [a = x; r1 = a] =&gt; [a = x; r1 = x]
• Dead store (elimination): [a = x; a = y] =&gt; [a = y]
• Common subexpression elimination: [r1 = a*2; r2 = b; r3 = a*2] =&gt; [r1 = a*2; r3 = r1; r2 = b]
• Loop-invariant code motion: [while(...) { a = b; r1 = c*c; ...}] =&gt; [r2 = c*c; while(...) { a = b; r1 = r2; ... }]
• Dead store elimination through reordering: [a = 1; b = c; a = 2] =&gt; [b = c; a = 1; a = 2] =&gt; [b = c; a = 2]
• Constant propagation via store forwarding: [a = 1; b = c; r = a] =&gt; [b = c; a = 1; r = a] =&gt; [b = c; a = 1; r = 1]

Furthermore, if a program fragment satisfies the local DRF property, then the compiler is free to apply any optimisations valid for sequential programs, not just the ones permitted by the memory model, to that program fragment.

Compilation to x86 looks like this (with details in section 7.3 of the paper):

And compilation to ARMv8 is as follows (details in section 7.3 of the paper):

In all cases compilation is sound: every behaviour that the hardware model allows of a compiled program is allowed by the software model of the original program.

### What price do we pay for local DRF?

We focus on quantifying the cost of nonatomic accesses on ARM and POWER architectures, since nonatomics in our memory model are free on x86. We leave the evaluation of the performance of our atomics for future work. Our aim is to evaluate the suitability of the memory model to be the default one in Multicore OCaml, a parallel extension of OCaml with thread-local minor heaps and shared major heap.

A suite of OCaml benchmarks are run looking at the overheads of the two ARMv8 compilation schemes (one using branch-after-load, and one using fence-before-store). The comparison also includes a strong release/acquire model which is strictly stronger than both.

On average, branch-after-load is 2.5% slower than the (non-parallel) OCaml baseline, and fence-before-store is 0.6% slower.

Our memory model is simpler than prevalent mainstream models and enjoys strong reasoning principles even in the presence of data races, while remaining efficiently implementable even on relaxed architectures… In future work we plan to extend our currently spartan model with other types of atomics. In particular, release-acquire atomics would be a useful extension: they are strong enough to describe many parallel programming idioms, yet weak enough to be relatively cheaply implementable.

Bounding data races in space and time Dolan et al., PLDI’18

Are you happy with your programming language’s memory model? In this beautifully written paper, Dolan et al. point out some of the unexpected behaviours that can arise in mainstream memory models (C++, Java) and why we might want to strive for something better. Then they show a comprehensible (!) memory model that offers good performance and supports local reasoning. The work is being done to provide a foundation for the multicore implementation of OCaml, but should be of interest much more broadly. There’s so much here that it’s worth taking our time over it, so I’m going to spread my write-up over a number of posts.

Today we’ll be looking at the concept of local data-race-freedom (local DRF) and why we might want this property in a programming language.

### Mainstream memory models don’t support local reasoning

Modern processors and compilers have all sorts of trickery they can deploy to make your program run faster. The optimisations don’t always play well with parallel execution though.

To benefit from these optimisations, mainstream languages such as C++ and Java have adopted complicated memory models which specify which of these relaxed behaviours programs may observe. However, these models are difficult to program against directly.

The data-race-freedom theorems enable programmers to reason about program behaviour globally. If all variables used for synchronisation between threads are atomic, and there are no concurrent accesses to non-atomic variables, then the DRF theorems guarantee no relaxed behaviours will be observed. Break the rules though, and all bets are off. What Dolan et al. highlight (we’ll see some examples in just a moment), is that there are many surprising cases where you think a piece of local code cannot possibly have a race, but in fact it does. You can’t easily apply modular or local reasoning.

This lack of local reasoning is a big issue, especially in safe languages (such as OCaml) that have compositional semantics: you should be able to understand parts of the program in isolation and reason about what happens when you put them together.

Let’s look at three C++/Java examples from section 2 of the paper which highlight the issue.

#### Data races in space

To support local reasoning, we’d like data races to be bound in space. This is true for Java, but not for C++. Consider the statement b = a + 10, executing in an environment in which there are no other accesses to a or b. Surely we can rely on b equaling a + 10 once this statement has executed?? It turns out this isn’t true!

Meet c. c is a nonatomic global variable that happens to be written to by two different threads:

If the elided computation in the first thread is pure (writes no memory, has no side-effects), then the compiler might optimise this code by introducing a new temporary variable t, like so:

If the elided computation also causes register pressure, we might need to save t somewhere. One option is to rematerialise t from c, since it’s value is already stored there…

Now we’ve introduced a data race on c that impacts our single line of code, and b = a + 10 can result in b being 1, regardless of the value of a!!

The particular optimisation used in step 3 here is just illustrative, and in fact is not generally implemented although it has been proposed for addition to LLVM. There are plenty of other similar races to worry about though.

#### Data races in time

To support local reasoning, we’d also like data races to be bound in time. This is not the case in both C++ and Java. Data races, it turns out, can time-travel! You might reasonably expect a data-race in the past to affect present behaviour, but so can data races in the future!!

As a first example, consider the program b = a; c = a; with no accesses to a, b, or c happening concurrently with the execution of those statements. We’d expect as a post-condition that b == c. However, this is not guaranteed to be true. Let flag be an atomic (volatile in Java) boolean, initially false. Now we’ll have two threads of execution using flag, like this:

After execution, we see that f is true. Therefore we know that both of the writes to a happen-before both of the reads (the read and write of flag synchronise). But due to aliasing effects, it’s still possible in Java to end up with b == 1 and c == 2. A compiler may optimise b = a to b = 2 without applying the same optimisation to c = a. Now if the first thread wins the race to set a before the flag, we have b not equal to c!

So, the effect of data races in Java is not bounded in time, because the memory model permits reads to return inconsistent values because of a data race that happened in the past. Surprisingly, non-sequential behaviour can also occur because of data races in the future…

Take a class class C { int x; }, and the following program sequence: C c = new C(); c.x = 42; a = c.x;. By creating a brand new instance of C, we eliminate the possibility of any historic data races. Surely we can rely on a being 42 after this sequence executes? In C++ and in Java, we can’t!

Given the two threads of execution below, then in the first thread the read of c.x and the write of g operate on separate locations. This means the Java memory model permits them to be reordered. If that happens we can now be exposed to a race whereby c.x can end up with the value 7.

### From global DRF to local DRF

The kind of examples we’ve just been looking at illustrate how hard it can be to figure out all of the potential behaviours in a parallel program. It’s a wonder things work in practice as well as they do! In all of the examples above, with the conditions specified that there are no other accesses to the variables in question, our (or is it just me?) intuition is that all of these fragments should be data race free. But that intuition is wrong. Instead of trying to balance all of the possibilities introduced by optimisations in our heads when looking at a fragment of code, could there instead be a model that makes our intuitions correct?

We propose a local DRF property which states that data races are bounded in space and time: accesses to variables are not affected by data races on other variables, data races in the past, or data races in the future. In particular, the following intuitive property holds:

If a location a is read twice by the same thread, and there are no concurrent writes to a, then both reads return the same value.

With such a local DRF property, all of the examples above have the expected behaviour. With only the standard global DRF property however, we would also need to show that there are no data races on any variables at any time during the program’s execution.

Where we’re heading is to be able to take a fragment of code such as the examples above, and extract L, the set of all locations accessed by the fragment. Then we must show that the machine state before executing the fragments is L-Stable (doesn’t have any data races in progress on locations in L). In this case L-Stability follows from the assumptions given for each example.

Local DRF then states that the fragments will have sequential behaviour at least until the next data race on L. Since the fragments themselves contain no data races on L, this implies that they have the expected behaviour.

I.e., I can reason about their behaviour locally. That sounds like a property I’d like my programming language to have!

In part II tomorrow, we’ll look at how local DRF can be achieved in practice.

HHVM is a virtual machine for PHP and Hack (a PHP extension) which is used to power Facebook’s website among others. Today’s paper choice describes the second generation HHVM implementation, which delivered a 21.7% performance boost when running the Facebook website compared to the previous HHVM implementation.

…the PHP code base that runs the Facebook website includes tens of millions of lines of source code, which are translated to hundreds of megabytes of machine code during execution.

I’m clearly suffering from an over-simplified understanding of what the Facebook web application actually does, but at the same time if I asked you to write a Facebook clone for just the website (not the backing services, not the mobile apps, etc.), would your initial estimate be on the order of tens of millions of lines of code???!

### HHVM high-level overview

The starting point for HHVM is source code in PHP or Hack. Hack is a PHP dialect used by Facebook and includes support for a richer set of type hints. From the perspective of HHVM though the two languages are fundamentally equivalent. In particular, Hack’s type hints are only used by a static type checker and are discarded by the HHVM runtime.

The reason for discarding these richer type hints at runtime is that Hack’s gradual type system is unsound. The static type checker is optimistic and it ignores many dynamic features of the language. Therefore, even if a program type checks, different types can be observed at runtime…

Source code is parsed by the HipHop compiler (hphpc) to produce an abstract syntax tree (AST). HHVM v2 adds a new HipHop Bytecode-to-Bytecode compiler (hhbbc) which translates the AST into HHBC, the bytecode of HHVM. A number of analyses and optimisations are performed during the translation into HHBC, all of which are done ‘ahead-of-time,’ i.e., before the program actually executes.

HHBC is a stack-based bytecode, meaning that the majority of the bytecode instructions either consume inputs from or push results onto the evaluation stack, or both.

At runtime, HHVM combines a traditional threaded interpreter and a JIT compiler. An important feature of the JIT design is its use of side-exits which use on-stack replacement (OSR) so that compiled code execution can always stop at any (bytecode-level) program point and transfer control back to the interpreter, the runtime system, or to another compiled code region.

This ability to always fallback to the interpreter means that the JIT is free to focus on the mainline use cases and avoid all of weird corner cases that can occur in a dynamic language (e.g., multiplying a string by an array!). For large-scale applications, it also helps to manage CPU and memory overheads incurred by JIT compilation since not everything needs to be JIT-compiled.

The core of the paper describes the design and implementation of the new HHVM JIT.

### The new HHVM JIT

The HHVM JIT is designed around four main principles:

1. Type specialisation
2. Profile-guided optimisations
3. Side exits
4. Region-based compilation

Although each of these principles has been used by a number of previous systems, the combination of these four principles in a dynamic-language JIT compiler is novel to the best of our knowledge.

Type specialisation uses type information —which can be obtained either statically or dynamically— to generate type specific code. For example, code specialised to integers (which are not reference counted) can be more efficient. There’s a trade-off to be made here between the performance boost of specialised code, and the overheads of generating lots of type-specific routines (over-specialisation).

One of the key sources of dynamic type information that the HHVM JIT uses is profiling.

Profile-guided optimizations (PGO), also called feeback-driven optimizations (FDO), are a class of optimizations that can be either enabled or enhanced by the use of runtime information.

Beyond informing type specialisation, the profile information is also used for code layout and method dispatch.

Side-exits (aka uncommon traps) allow the JIT to specialize the code for common types without supporting the uncommon types that may not even be possible at runtime.

Code generation in HHVM is based on regions (region-based compilation), where regions are not restricted to entire methods, single basic blocks or straightline traces. Region-based compilation works well with type specialisation, profile-guide optimisations, and side-exits: profiling highlights regions that are then used to compile generated code for common types, with side-exits to fall back to interpreted execution in less common cases.

The data structure used to represent regions at the bytecode level is a region descriptor. Region descriptors contain a control-flow graph (CFG), in which each node is a basic-block region used for profiling. For every block, a region descriptor contains:

• the sequence of bytecode instructions
• preconditions (type guards)
• post-conditions (known types at the end of the translation)
• type constraints (encoding information about how much knowledge is required for each input type in order to generate efficient code).

Region descriptors are lowered into a HipHop Intermediate Representation (HHIR), a typed representation using static single assignment (SSA) form. The type information comes from the hhbbc compiler and from profiling data. The HHIR level is where most of the code optimisations happen.

At a level below HHIR is a Virtual Assembly (Vasm) representation. It’s instruction set is very close to machine code, but supports an infinite number of virtual registers.

There are three different compilation modes within the HHVM JIT: live translations (the only kind supported in the first generation implementation), profiling translations (basic optimisations plus the insertion of block profiling), and fully optimised translations.

### Important optimisations

The full optimisation pipeline applies optimisations at each level of abstraction, as summarised in the figure below.

Prior to the region-based optimisations, there are also a couple of important whole-program optimisations.

#### Whole-program

• Function sorting places functions that commonly call each other close by in memory. A weighted directed call graph is built for the program, using basic block profile counters to inform the weights. Then the functions are clustered and sorted. The algorithm is fast enough to handle the large amount of code generated to support the Facebook website in under one second.
• Huge pages: once the hot code is clustered in the code cache, this in then mapped onto huge pages.

#### Region-level

• For live and profiling translations, regions are formed by inspecting the VM state and flowing live types. For optimised translations the collected profile information is used. A single function may be broken into multiple regions, and conversely a region may contain portions of multiple functions (through partial inlining). Each function is associated with a weighted CFG for all the blocks within it, which are grouped into regions using a depth-first search with blocks and arcs optionally skipped based on their weights.
• Guard relaxation helps to avoid over-specialisation, where the compiler specialises code excessively, leading to code bloat. As the name suggests, guard relaxation weakens type guard pre-conditions in some circumstances to allow more types to go through. Of course, the corresponding generated code will be slightly less efficient as it must now cope with more types – but it can still be much better than fully general code. As an example, an integer-only guard may be relaxed to allow integers and doubles.

#### HHIR-level

• Partial inlining is used to inline common simple methods such as getters and setters (it’s partial because full inlining can lead to code bloat).
• PHP uses reference counting to manage memory for common types such as strings, objects, and arrays. Naive reference counting solutions can lead to significant overheads. The HHVM JIT uses reference counting elimination to avoid reference counting where possible. The essence of the idea is to sink increment instructions in the call graph where it is provable safe. If an increment instructions sinks such that it is now next to a decrement instructions, then the pair can be eliminated.
• Method dispatching is optimised using information from the profiler. Specialised HHIR is emitted to (a) devirtualize momomorphic calls, (b) optimise calls from a common base class, and (c) optimise calls where the methods implement the same interface.

### Evaluation

The evaluation is based on measurements taken from the Facebook website running on HHVM. The top line results can be seen in the following figure, which takes a bit of interpreting:

The ‘JIT-region’ bar shows the performance of the full optimisation path, normalised to 100%. The previous generation of the HHVM JIT only support live translation, and it’s performance is represented by the ‘JIT-Tracelet’ bar. The live translation path achieves 82.2% of the full optimisation performance, or put another way, the region JIT in HHVM v2 provides about a 21.7% speedup. The performance achieved by the profile translations only is show in the ‘JIT-profile’ bar (about 39.8% of the full optimisation performance). Even the profile translation on its own generates code that is 3.1x faster than interpreted code.

The contribution of the various optimisations can be seen in the following chart, which explores the relative slowdown with selectively disabling individual optimisations.

BLeak: Automatically debugging memory leaks in web applications Vilk & Berger, PLDI’18

BLeak is a Browser Leak debugger that finds memory leaks in web applications. You can use BLeak to test your own applications by following the instructions at http://bleak-detector.org.

Guided by BLeak, we identify and fix over 50 memory leaks in popular libraries and apps including Airbnb, AngularJS, Google Analytics, Google Maps SDK, and jQuery. BLeak’s median precision is 100%; fixing the leaks it identifies reduces heap growth by an average of 94%, saving from 0.5MB to 8MB per round trip.

### Why are web application memory leaks so problematic?

Memory leaks in web applications are a pervasive problem. They lead to higher garbage collection frequency and overhead, reduced application responsiveness, and even browser tab crashes. Existing memory leak detection approaches don’t work well in the browser environment though:

• Staleness detection assumes leaked memory is rarely touched, but web apps regularly interact with leaked state (e.g. via event listeners).
• Growth-based technique assume that leaked objects are uniquely owned, or that leaked objects from strongly connected components in the heap graph. In a web application, objects frequently have multiple owners, and all roads lead back to window.
• Techniques that depend on static type information are not applicable since JavaScript is dynamically typed.

The current state of the art is manual processing of heap snapshots. As we show, this approach does not effectively identify leaking objects or provide useful diagnostic information, and it thus does little to help developers locate and fix memory leaks.

Consider the following extract from the Firefox debugger, which leaks memory every time a developer opens a source file since the event listeners are never removed. (The leak was found using BLeak of course!).

Using heap snapshots to try and detect diagnose the problem leaves developers staring at heap snapshots like this:

The top item in the heap snapshot, Array conflates all arrays in the application under one heading, and the (array) entry is referring to internal V8 arrays not under the application’s direct control. The primary leak is actually the of the Preview object, but it appears low on this list and has a small retained size, making it an unlikely target for investigation using this view. And even if a developer does decide to look into the Preview entry, finding the responsible code based on the retaining path information in the snapshot is not easy:

### An overview of BLeak

Instead of poring over heap snapshots, for the same memory leak as above BLeak will produce diagnostic output that looks like this:

The developer is pointed directly to the problematic source code.

BLeak automatically detects, ranks, and diagnoses memory leaks. The central observation is that in modern web apps users frequently return to the same visual state. For example, the news feed in Facebook, the property listing page in Airbnb, and the inbox view in Gmail.

… these round trips can be viewed as an oracle to identify leaks. Because visits to the same (approximate) visual stat should consume roughly the same amount of memory, sustained memory growth between visits is a strong indicator of a memory leak.

To use BLeak, you need to provide a short driver loop that cycles the application through the desired visual states. The loop is specified as an array of objects, each providing a check function to validate state preconditions, and a next function to transition to the next state. The final transition must take you back to the initial state, so that the cycle can repeat.

An example makes this much easier to understand. In the Firefox debugger case, the debugger opens to a view with a list of all documents in the application being debugged. Clicking on the first document in the list (main.js) opens it in the text editor. Closing the editor takes us back to the list view again. Here’s the loop specification:

Appendix B contains the driver loops used for a variety of applications including Airbnb. You can see that the loops are fairly easy to write.

BLeak uses the provided loop script to drive the application in a loop (eight iterations by default).

After each visit to the first visual state in the loop, BLeak takes a heap snapshot and tracks specific paths from GC roots that are continually growing. BLeak treats a path as growing if the object identified by that path gains more outgoing references (e.g., when an array expands or when properties are added to an object).

In the Firefox debugger example, BLeak finds four leak roots: an array within the codeMirror object that contains scroll event listeners, and event listener lists for mouseover, mouseup, and mousedown events on the DOM element containing the text editor.

The detected leak roots are then ranked to help developers prioritise their attention. Ranking is done via a leak share metric which first prunes objects in the graph reachable by non-leak roots, and then splits the credit for the remaining objects equally among the leak roots that retain them.

The final piece of the puzzle is pinpointing the sources of the detected leaks in order to produce the report we saw earlier.

BLeak … reloads the application and uses its proxy to transparently rewrite all of the JavaScript on the page, exposing otherwise-hidden edges in the heap as object properties. BLeak uses JavaScript reflection to instrument identified leak roots to capture stack traces when they grow and when they are overwritten (not just where they were allocated). With this instrumentation in place, BLeak uses the developer-provided script to run one final iteration of the loop to collect stack tracks. These stack traces directly zero in on the code responsible for leak growth.

### BLeak in action

BLeak is evaluated using a corpus of give popular web application that in turn use a variety of libraries (React, AngularJS, Google Analytics, etc.):

• Airbnb (looping between the page listing all services on Airbnb, and the page listing only homes and rooms for rent)
• Piwik 3.0.2 – an open source analytics platform. BLeak repeatedly visits the main dashboard page.
• Loomio 1.8.66 – a collaborative platform for group decision making. BLeak transitions between a group page listing all threads in a group, and the first thread listed on the page.
• Mailpile v1.0.0 – an open source mail client. BLeak cycles between the inbox and the first four emails in the inbox.
• Firefox debugger – while debugging Mozilla’s SensorWeb. The BLeak loop repeatedly opens and closes main.js in the debugger’s text editor.

Driver loops for all of these programs can be found in Appendix B, and the full list of detected leaks in these applications can be found in Appendix A.

Overall, BLeak finds 59 distinct memory leaks across the five applications, all of which were unknown to application developers. 27 of the discovered leaks were actually in libraries used by the web applications (6 of the 27 had been independently diagnosed by the library developers and had pending fixes). That leaves 32 memory leaks detected directly in the application code.

We reported all 32 new memory leaks to the relevant developers along with our fixes; 16 are now fixed, and 4 have fixes in code review. We find new leaks in popular applications and libraries including Airbnb, AngularJS, Google Maps SDK, Google Tag Manager, and Google Analytics.

The key results are summarised in the table below.

• BLeak has an average precision of 96.8%, and a median precision of 100%. Overall there were only three false positives, all caused by an object that continuously grows until some threshold or timeout is reached – increasing the number of round trips would have removed these .
• BLeak accurately pinpoints the code responsible for the leaks in all but one case (in which it fails to record a stack trace).
• Guide by the BLeak reports, the authors were able to fix every memory leak. It took about 15 minutes per leak to implement a fix.
• BLeak locates, ranks, and diagnoses memory leaks in less than seven minutes on these applications. Most of the time goes on receiving and parsing Chrome’s JSON-based heap snapshots.
• On average, fixing the memory leaks that BLeak reports eliminates over 93%h of all heap growth.
• Of the memory leaks BLeak finds, at least 77% would not be found with a staleness-based approach.

We show that BLeak has high precision and finds numerous previously-unknown memory leaks in web applications and libraries. BLeak is open source and is available for download at http://bleak-detector.org.

It’s time to start a new term on #themorningpaper. I read my very first #themorningpaper on the 30th July 2014 (“Why functional programming matters”, Hughes 1990) and since then, bar three scheduled breaks a year, I’ve been reading a research paper every weekday. Since the 8th October 2014, I’ve also been posting a write-up of the day’s paper on The Morning Paper blog. There’s not always an exact 1:1 correspondence between a post and a paper, but it’s pretty good. On that basis, you can now find somewhere in the order of 840 paper write-ups on the blog, and we’re racing towards the 1,000 mark!

People often ask, and yes, it’s a lot of work to curate papers, read them, and post the write-ups! Somewhere on the order of 15-20 hours a week (and all outside of my regular work commitments). I’ve got a tremendous amount of value out of this habit and I intend to keep it going. However, starting again this term it suddenly felt like a big load to carry – might be something to do with the fact that we’re moving house in a couple of weeks! So to keep things feeling fun for me and avoid it turning into a chore, I’m giving myself permission not to post a paper every single weekday. That puts a little more flexibility into my schedule on busy weeks, and makes space for a few more fitness-related activities as well (feeding your mind, but neglecting your health and fitness isn’t smart!). Some weeks there’ll still be one paper a day, some weeks there might only be one, I expect I’ll settle somewhere around 3 papers a week for the time being. (That feels like a light load after 4 years of doing 5!). Every paper will still go out to this blog, the mailing list, and to twitter so if you follow one of those sources, you won’t miss anything.

Some of you may be saying “thank goodness, now I’ll have more of a chance of keeping up!”. For a select few readers, I know you’ve developed a habit of reading The Morning Paper every day, incorporated into your routines somewhere. Unless you’ve been following along since day 1, there are hundreds of paper write-ups in the archives that can help to fill in the gaps. Another great thing to do would be to use that time to branch out on your own a little – follow up on one or two threads from things that have sparked your interest in The Morning Paper and hopefully discover some great new papers and research that way too. (But be warned, it can be addictive!). If you find something you really enjoy let me know – I’m always on the look out for great papers to cover.

“A Computer Science research paper most weekdays” doesn’t have quite the same ring as “a Computer Science research paper every weekday,” but keeping things sustainable hopefully means we can keep this thing going for many years to come!

And yes, I’ve got paper write-ups scheduled for every day the rest of this week ;).

It’s time to take my regular summer break from writing The Morning Paper – normal service will resume again on Monday 6th August. I’ll be topping up my paper backlog and scouting out interesting research during the break. If you’ve seen any great papers I haven’t already covered and that you think ‘The Morning Paper’ readers would enjoy, please do send them my way.

In the meantime, here are a few picks from papers we’ve covered this term in case you missed any of them. (In order of publication on the blog)

Many thanks to all of you that follow along and support the blog – the accountability keeps me sticking at it when I’m snowed under, and the joy of sharing great research wouldn’t be the same without you!