Fast and precise type checking for JavaScript Chaudhuri et al., OOPSLA’17

In this paper we present the design and implementation of Flow, a fast and precise type checker for JavaScript that is used by thousands of developers on millions of lines of code at Facebook every day.

In a pretty dense 30 pages, ‘Fast and precise type checking for JavaScript’ takes you through exactly how Facebook’s Flow works (although even then, some details are deferred to the extended edition!). I can’t read a paper packed with concise judgements and syntax such as this one now without thinking of Guy Steele’s wonderful recent talk “It’s time for a new old language.” It makes me feel a little better when struggling to get to grips with what the authors are really saying! Rather than unpacking the formal definitions here (which feels like it might take a small book and many hours!!), I’m going to focus in this write-up on giving you a high-level feel for what Flow does under the covers, and how it does it.

### Why Flow? Motivation and goals

Evolving and growing a JavaScript codebase is notoriously challenging. Developers spend a lot of time debugging silly mistakes — like mistyped property names, out-of-order arguments, references to missing values, checks that never fail due to implicit conversions, and so on — and worse, unraveling assumption and guarantees in code written by others.

The main internal repository at Facebook contains around 13M LoC of JavaScript, spanning about 122K files. That’s a lot of JavaScript! All of that code is covered by Flow, a static type checker for JavaScript that Facebook have been using for the past three years. Flow therefore has to be practical at scale and usable on real-world projects:

• The type checker must be able to cover large parts of a codebase without requiring too many changes in the code.
• The type checker must give fast responses even on a large codebase. “Developers do not want any noticeable ‘compile-time’ latency in their normal workflow, because that would defeat the whole purpose of using JavaScript.

Using Flow, developers (or developer tools) get precise answers to code intelligence queries, and Flow can catch a large number of common bugs with few false positives.

To meet Flow’s goals, the designers made three key decisions:

1. Common JavaScript idioms such as x = x || 0 are precisely modelled. To be able to handle cases like this requires support for type refinements (more on that soon).
2. Reflection and other legacy pattern that appear in a relatively small fraction of codebases are not explicitly focused on. Flow analyses source code, rather than first translating it to e.g. ES5, or even ES3.
3. The constraint-based analysis is modularised to support parallel computation so that queries can be answered in well under a second even on codebases with millions of lines of code.

The related work section of the paper (§11) provides a nice comparison of the differences in design choices between Flow and TypeScript from the perspective of the Flow team:

Unlike Flow, [TypeScript] focuses only on finding “likely errors” without caring about soundness. Type inference in TypeScript is mostly local and in some cases contextual; it doesn’t perform a global type inference like Flow, so in general more annotations are needed… Furthermore, even with fully annotated programs, TypeScript misses type errors because of unsound typing rules… Unsoundness is a deliberate choice in TypeScript and Dart, motivated by the desire to balance convenience with bug-finding. But we have enough anecdotal evidence from developers at Facebook that focusing on soundness is not only useful but also desirable, and does not necessarily imply inconvenience.

(We recently looked at ‘To Type or Not to Type’ where the authors found that Flow and TypeScript had roughly equivalent power in their ability to detect bugs.

### High level overview

In Flow, the set of runtime values that a variable may contain is described by its type (nothing new there!). The secret sauce of Flow is using runtime tests in the code to refine types.

For example, consider the following code:

	function pipe(x, f) {
if (f != null) { f(x); }
}


Seeing the test f != null, Flow refines the type of f to filter out null, and therefore knows that the value null cannot reach the call.

Refinements are useful with algebraic data types too. In this idiom records of different shapes have a common property that specifies the ‘constructor,’ and other properties dependent on the constructor value. For example,

	var nil = { kind : "nil" };
var cons = (head, tail) => {
return { kind: "cons", head, tail };
}


Now when Flow sees a function such as this:

	function sum(list) {
if (list.kind === "cons") {
}
return 0;
}


It refines the type of list following the test list.kind === "cons" so that it knows the only objects reaching the head and tail property accesses on the following line are guaranteed to have those properties.

When Flow sees an idiom such as x = x || nil (the var nil from our previous code snippet, not to be confused with null here!), it models the assignment by merging the refined type of x with the type of nil and updating the type of x with it.

Refinements can also be invalidated by assignments (for example, x = null;).

### Flow of constraints

Section 2 of the paper provides a formal definition of inference supporting refinement in FlowCore, a minimal subset of JavaScript including functions, mutable variables, primitive values, and records. The core components of the FlowCore constraint system are types, effects, and constraints. Effects track variable updates – each language term is associated with an effect, which roughly describes the set of variables that are (re)assigned with the term.

When Flow sees the line of code var nil = { kind: "nil" }; it records that nil is of the form $\{kind : \alpha_1 \}$ where $\alpha_1$ is a type variable, and also the constraint $''nil\ '' \leq \alpha_1$.

For the function cons,

	var cons = (head, tail) => {
return { kind: "cons", head, tail };
}


We get a type for cons of $(\alpha_2, \alpha_3) \rightarrow \{ kind: \alpha_4, head: \alpha_5, tail: \alpha_6 \}$ and additional constraints $\{''cons''\ \leq \alpha_4, \alpha_2 \leq \alpha_5, \alpha_3 \leq \alpha_6 \}$. And so the process continues, constructing a flow network.

Thinking of our system as a dataflow analysis framework, constraint generation amounts to setting up a flow network. The next step is to allow the system to stabilize under a set of appropriate flow functions.

For every new constraint that gets generated, all eligible constraint propagation rule are applied until a fixpoint is reached. Once the fixpoint is reached, Flow can either discover inconsistencies, or prove the absence thereof. Inconsistencies correspond to potential bugs in the use of various operations.

Section 4 in the paper briefly introduces a runtime semantics for FlowCore, and section 5 proves type safety via the introduction of a declarative type system closely matching the type inference system described above.

### Type annotations

Type annotations $\tau$ follow a similar grammar as types except that there are no type variables, types can appear anywhere type variables could appear, and there are no effects. We consider a type annotation to be just another kind of type use, that expects some type of values. In other words, like everything else we can formulate type checking with flow constraints.

### Modules and dependencies

Flow analyses a code base module-by-module (file-by-file). Each file is analysed separately once all files it depends on have been analysed. This strategy supports incremental and parallel analysis of large code bases.

The key idea is to demand a “signature” for every module. We ensure that types inferred for the expressions a module exports do not contain type variables — wherever they do, we demand type annotations… Requiring annotations for module interfaces is much better than requiring per-function annotations… Independently, having a signature for every module turns out to be a desirable choice for software engineering. It is considered good practice for documentation (files that import the module can simply look up its signature instead of its implementation), as well as error localization (blames for errors do not cross module boundaries).

In the 13M lines of JavaScript code in the main internal Facebook repository, about 29% of all locations where annotations could potentially be used actually do have annotations. The value of Flow’s type refinement mechanism was shown by turning it off – this led to more than 145K spurious errors being reported!

Virtual machine warmup blows hot and cold Barrett et al., OOPSLA’17

(With thanks to Prof. Richard Jones at Kent University who first pointed this paper out to me.)

Yesterday we saw the recommendations of Georges et al. for determining when a (Java) virtual machine has reached a steady state and benchmarks can be taken. Kalibera and Jones later provided a more accurate manual process. In ‘Virtual machine warmup blows hot and cold,’ Barrett et al. provide a fully-automated approach to determining when a steady state has been reached, and also whether or not that steady state represents peak performance. Their investigation applies to VMs across a range of languages: Java, JavaScript, Python, Lua, PHP, and Ruby.

Our results suggest that much real-world VM benchmarking, which nearly all relies on assuming that benchmarks do reach a steady state of peak performance, is likely to be partly or wholly misleading. Since microbenchmarks similar to those in this paper are often used in isolation to gauge the efficacy of VM optimisations, it is also likely that ineffective, or deleterious, optimisations may have been incorrectly judged as improving performance and included in VMs.

If you’re simply allowing a VM to run a small number (e.g., 10) of iterations and then expecting it to be warmed-up and in a steady state by then, you’re definitely doing it wrong!

To gather their data, the authors use a very carefully controlled experiment design. Because of the level of detail and isolation it took around 3 person years to design and implement the experiments. The repeatable experiment artefacts are available online at https://archive.org/download/softdev_warmup_experiment_artefacts/v0.8/.

### Benchmarking methodology

The basis for benchmarking are the binary trees, spectralnorm, n-body, fasta, and fannkuch redux microbenchmarks from the Computer Languages Benchmarks Game (CLBG).These small benchmarks are widely used by VM authors as optimisation targets. Versions of these for C, Java, JavaScript, Python, Lua, and Ruby are taken from Bolz and Tratt 2015.

On each process run, 2000 iterations of the microbenchmark are executed. There are 30 process executions overall, so we have a total of 30 x 2000 iterations. The authors go to great lengths to eliminate any other factors which may contribute to variance in the benchmarking results. For example, the machines are clean rebooted to bring them into a known state before each process run, networking is disabled, daemons are disabled, there is no file I/O, and so on. They even ensure that the machines run benchmarks within a safe temperature range (to avoid the effects of temperature-based CPU limiters in CPUs). Full details are in section 3, and section 7 on ‘threats to validity’ outlines even more steps that were taken in an attempt to obtain results that are as accurate and reliable as possible. Suffice to say, you can easily start to see where some of those 3-person years went!

The main hypothesis under investigation is:

• H1 Small, deterministic programs reach a steady state of peak performance.

And as a secondary hypothesis:

• H2 Moderately different hardware and operating systems have little effect on warmup

If the expected warm-up and followed by steady-state peak performance pattern is not observed, then the third hypothesis is :

• H3 Non-warmup process executions are largely due to JIT compilation or GC events

Benchmarks are run on GCC 4.9.3, Graal 0.18, HHVM 3.15.3 (PHP), JRuby + Truffle 9.1.2.0, HotSpot 8u112b15 (Java), LuaJIT 2.0.4, PyPy 5.6.0, and V8 5.4.500.43 (JavaScript). Three different benchmarking machines were used in order to test H2. Linux (1240v5) and Linux (4790) have the some OS (with the same packages and updates etc.) but different hardware. Linux (4790) and OpenBSD (4790) have the same hardware but different operating systems.

### Changepoint analysis

Each in-process run results in time series data of length 2000. A technique called statistical changepoint analysis is used to analyse the data and classify the results. Prior to this analysis the data is pre-processed to remove outliers, defined as any point after the first 200 that is outside the median ±3x (90%ile – 10%ile). Overall, 0.3% of all data points are classified as outliers under this definition.

We use changepoint analysis to determine if and when warmup has occurred. Formally, a changepoint is a point in time where the statistical properties of prior data are different to the statistical properties of subsequent data; the data between two changepoints is achangepoint segment. Changepoint analysis is a computationally challenging problem, requiring consideration of large numbers of possible changepoints.

The authors use the PELT algorithm which reduces the complexity to O(n). Changepoint detection is based on both the mean and variance of in-process iterations. Changepoint segments that have means within a small threshold (0.001 seconds) are considered equivalent. In addition a segment will be considered equivalent to the final segment if its mean is within variance($s_f$) seconds of the final segment mean. (Treating the variance of the final segment, $s_f$, as if it was a measure of seconds, not seconds squared). This variance-based threshold is used to account for the cumulative effect of external events during a run.

If hypothesis H1 holds, then we will see warm-up segment(s) followed by one or more steady-state segments, with the final segment being the fastest. A benchmark is said to reach a steady state if all segments which cover the last 500 in-process iterations are considered equivalent to the final segment.

• When the last 500 in-process iterations are not equivalent to the final segment then we say that the benchmark had no steady state.
• If a steady state is reached and all segments are equivalent then the benchmark is classified as flat.
• If a steady state is reached and at least one segment is faster than the final segment the the benchmark is classified as slowdown.
• If a steady state is reached and it is not flat or a slowdown, then we have the classic warmup pattern.

Flat and warmup benchmarks are considered ‘good,’ while slowdown and no steady state benchmarks are ‘bad.’

That deals with a single process run of 2000 in-process iterations. If all process executions for a given (VM, benchmark) pair have the same classification, then the pair is classified the same way and said to be consistent, otherwise the pair is classified as inconsistent.

In the charts and diagrams in the paper, you’ll see these various categories represents by symbols like this:

Our results consist of data for 3660 process executions and 7,320,000 in-process iterations. Table 1 (below) summarises the (VM, benchmark) pairs and process executions for each benchmarking machine.

Note that for (VM, benchmark) pairs, at best 37% of pairs show flat or warmup patterns, and another 6.5% are good inconsistent. The biggest proportion by far is ‘bad inconsistent.’

This latter figure clearly show a widespread lack of predictability: in almost half of cases, the same benchmark on the same VM on the same machine has more than one performance characteristic. It is tempting to pick one of these performance characteristics – VM benchmarking sometimes reports the fastest process execution, for example – but it is important to note that all of these performance characteristics are valid and may be experienced by real-world users.

Here’s a breakdown just for one machine, showing that only n-body and spectralnorm come close to ‘good’ warmup behaviour on all machines.

(Enlarge)

VMs seem to mostly either reach a steady state quickly (often in 10 or fewer in-process iterations) or take hundreds of in-process iterations. The latter are troubling because previous benchmarking methodologies will not have run the benchmarks long enough to see the steady state emerge.

Since there are so many cases that do not fit the expected warmup pattern, the authors investigated these to see if H3 holds: that these cases are mostly due to JIT compilation or GC.

The relatively few results we have with GC and JIT compilation events, and the lack of a clear message from them, means that we feel unable to validate or invalidate Hypothesis H3. Whilst some non-warmups are plausibly explained by GC or JIT compilation events, many are not, at least on HotSpot and PyPy. When there is no clear correlation, we have very little idea of a likely cause of the unexpected behaviour.

### What now for virtual machine benchmarking?

The results above undermine the VM benchmarking orthodoxy of benchmarks quickly and consistently reaching a steady state after a fixed number of iterations.

We believe that, in all practical cases, this means that one must use an automated approach to analysing each process execution individually. The open-source changepoint analysis approach presented in this paper is one such option.

• Benchmark results should present both warm-up times and steady steady performance. “There are cases in our results where, for a given benchmark, two or more VMs have steady state performance within 2x of each other, but warmup differs by 100-1000x.”
• In-process iterations should be run for around 0.5s, with a minimum acceptable time of 0.1s.
• It is hard to know exactly how many in-process iterations to run, but around 500 can be used most of the time, while occasionally using larger numbers (e.g. 1,500) to see if longer-term stability has been affected.
• Good results should be obtained with 10 process executions, occasionally running higher numbers to identify infrequent performance issues.

Statistically rigorous Java performance evaluation Georges et al., OOPSLA’07

This paper won the 10-year most influential paper award at OOPSLA this year. Many of the papers we look at on this blog include some kind of performance evaluation. As Georges et al., show, without good experimental design and statistical rigour it can be hard to draw any firm conclusions, and worse you may reach misleading or incorrect conclusions! The paper is set in the context of Java performance evaluation, but the lessons apply much more broadly.

Benchmarking is at the heart of experimental computer science research and development… As such, it is absolutely crucial to have a rigorous benchmarking methodology. A non-rigorous methodology may skew the overall picture, and may even lead to incorrect conclusions. And this may drive research and development in a non-productive direction, or may lead to a non-optimal product brought to market.

A good benchmark needs a well chosen and well motivated experimental design. In addition, it needs a sound performance evaluation methodology.

… a performance evaluation methodology needs to adequately deal with the non-determinism in the experimental setup… Prevalent data analysis approaches for dealing with non-determinism are not statistically rigorous enough.

In the context of Java sources of non-determinism can include JIT compilation, thread scheduling, and garbage collection for example. For many benchmarks run today on cloud platforms, non-determinism in the underlying cloud platform can also be a significant factor.

### Problems with performance evaluations

Common at the time of publication (it would be interesting to do a similar assessment of more recent papers) was a method whereby a number of performance runs – e.g. 30 – would be done, and the best performance number (smallest execution time) reported. This was in accordance with SPECjvm98 reporting rules for example. Here’s an example of doing this for five different garbage collectors.

CopyMS and GenCopy seem to perform about the same, and SemiSpace clearly outperforms GenCopy.

Here are the same experiment results, but reported using a statistically rigorous method reporting 95% confidence intervals.

Now we see that GenCopy significantly outperforms CopyMS, and that SemiSpace and GenCopy have overlapping confidence intervals – the difference between them could be simply due to random performance variations in the system under measurement.

After surveying 50 Java performance papers the authors conclude that there is little consensus on what methodology to use. Table 1 below summarises some of the methodologies used in those papers:

Suppose you use a non-rigorous methodology and report a single number such as best, average, worst etc., out of a set of runs. In a pairwise comparison, you might say there was a meaningful performance difference if the delta between the two systems was greater than some threshold $\theta$. Alternatively, using a statistically rigorous methodology and reporting confidence intervals, it may be that you see:

• overlapping intervals
• non-overlapping intervals, in the same order as the non-rigorous methodology
• non-overlapping intervals, in a different order to the non-rigorous methodology

This leads to six different cases – in only one of which can you truly rely on the results from the non-rigorous approach:

The authors run a series of tests and find that all prevalent methods can be misleading in a substantial fraction of comparisons between alternatives – up to 16%. Incorrect conclusions even occur in up to 3% of comparisons. (And if you really must report a single number, mean and median do better than best, second best, or worst).

There are many more examples in section 6 of the paper.

### Statistics to the rescue

We advocate adding statistical rigor to performance evaluation studies of managed runtime system, and in particular Java systems. The motivation for statistically rigorous data analysis is that statistics, and in particular confidence intervals, enable one to determine whether differences observed in measurements are due to random fluctuations in the measurements or due to actual differences in the alternatives compared against each other.

Section 3 in the paper is my favourite part, and it essentially consists of a mini-stats tutorial for people doing benchmarks.

If we could get an exact repeatable number out of every performance run, life would be much more straightforward. Unfortunately we can’t do that due to non-determinism (‘random errors’) in the process. So we need to control for perturbing events unrelated to what the experiment is trying to measure. As a first step, the authors recommend discarding extreme outliers. With that done, we want to compute a confidence interval.

In each experiment, a number of samples is taken from an underlying population. A confidence interval for the mean derived from these samples then quantifies the range of values that have a given probability of including the actual population mean.

A confidence interval $[c_1, c_2]$ is defined such that the probability of $\mu$ being between $c_1$ and $c_2$ equals $1 - \alpha$, where $\alpha$ is the significance level and $(1 - \alpha)$ is the confidence level.

A 90% confidence interval means that there is a 90% probability that the actual distribution mean of the underlying population is within the confidence interval. For the same data, if we want to be more confident that the true mean lies within the interval, say a 95% confidence interval, then it follows that we would need to make the interval wider.

Ideally we would take at least 30 samples such that we can build upon the central limit theorem. With a target significance level $\alpha$ chosen in advance, we can then determine $c_1$ and $c_2$ so that the probability of the true mean being in the interval equals $1 - \alpha$. It looks like this:

Where $s$ is the standard deviation of the sample, $\bar{x}$ the mean, and $z_{1 - \alpha / 2}$ is obtained from a pre-computed table.

A basic assumption made in the above derivation is that the sample variance $s^2$ provides a good estimate of the actual variance $\sigma^2$… This is generally the case for experiments with a large number of samples, e.g., $n \geq 30$. However, for a relatively small number of samples, which is typically assumed to mean $n < 30$, the sample variance $s^2$ can be significantly different from the actual variance $\sigma^2$.

In this case, we can use Student’s t-distribution instead and compute the interval as:

The value $t_{1 - \alpha / 2;n-1}$ is typically obtained from a pre-computed table. As the number of measurements $n$ increases the Student t-distribution approach the Gaussian distribution.

#### Comparing alternatives

Thus far we know how to compute confidence intervals for the mean of a single system. If we compare two system and their confidence intervals overlap, then we cannot conclude that the differences seen in the mean values are not due to random fluctuations in the measurements. If the confidence intervals do not overlap, we conclude that there is no evidence to suggest that there is not a statistically significant difference.

The paper shows the formula for computing a confidence interval for the difference of two means (see section 3.3). If this interval includes zero, we can conclude, at the confidence level chosen, that there is no statistically significant difference between the two alternatives.

If we want to compare more than two alternatives, then we can use a technique called Analysis of Variance (ANOVA).

ANOVA separates the total variation in a set of measurements into a component due to random fluctuations in the measurements and a component due to the actual differences among the alternatives… If the variation between the alternatives is larger than the variation within each alternative, then it can be concluded that there is a statistically significant difference between the alternatives.

The ANOVO test doesn’t tell which of the alternatives the statistically significant difference is between, if there is one! The Tukey HSD (Honestly Significantly Different) test can be used for this.

With ANOVA we can vary one input variable within an experiment. Multi-factor ANOVA enables you to study the effect of multiple input variables and all their interactions. Multivariate ANOVA (MANOVA) enables you to draw conclusions across multiple benchmarks.

### Recommendations

Using the more complex analyses, such as multi-factor ANOVA and MANOVA, raises two concerns. First, their output is often non-intuitive and in many cases hard to understand without deep background in statistics. Second, as mentioned before, doing all the measurements required as input to the analyses can be very time-consuming up to the point where it becomes intractable.

Section 4 of the paper therefore introduces a practical yet still statistically rigorous set of recommendations for Java performance evaluation.

To measure start-up performance:

• Measure the execution time of multiple VM invocations, each running a single benchmark iteration.
• Discard the first VM invocation and retain only the subsequent measurements. This ensures libraries are leaded when doing the measurements.
• Compute the confidence interval for a given confidence level. If there are more than 30 measurements, use the standard normal $z$-statistic, otherwise use the Student $t$-statistic.

• Consider $p$ VM invocations, each invocation running at most $q$ benchmark iterations. Suppose we want to retain $k$ measurements per invocation.
• For each VM invocation $i$ determine the iteration $s_i$ where steady-state performance is reached, i.e., once the coefficient of variation (std deviation divided by the mean) of the $k$ iterations ($s_i -k$ to $s_i$) falls below a preset threshold, say 0.01 or 0.02.
• For each VM invocation, compute the mean $\bar{x}_i$ of the $k$ benchmark iterations under steady-state: $x_i = \sum_{j= s_i-k}^{s_i} x_{ij}$.
• Compute the confidence interval for a given confidence level across the computed means from the different VM invocations. The overall mean $\bar{x} = \sum_{i=1}^{p} \bar{x_i}$, and the confidence interval is computed over the $\bar{x}_i$ measurements.

We’ll be talking more about the notion of ‘steady-state’ tomorrow – especially with micro-benchmarks.

For more on critically reading evaluation sections in papers in general, see “The truth, the whole truth, and nothing but the truth: a pragmatic guide to assessing empirical evaluations.”

Logging has become an overloaded term. In this paper logging is used in the context of recording information about the execution of a piece of software, for the purposes of aiding troubleshooting. For these kind of logging statements there always seem to be a trade-off between log verbosity, logging overhead, and the log actually containing enough useful information to help you diagnose a problem that occurs in the wild. As developers of the system, we tend to put logging statements in places where we think they’ll be useful – often as a retrospective action after a problem occurred that couldn’t easily be diagnosed!

So it’s interesting to step back for a moment and consider this: if you were starting to instrument a system from scratch, what would an optimal set of logging statements look like? Where would you place those statements and what criteria would you use to decide? If we assume great filtering and searching tools such that log verbosity is less of an issue, then a strawman would be to log every possible branch of execution. That’s optimal from the perspective of having the information needed in the log in order to diagnose any given problem. But we don’t do that because the performance overhead would be unacceptable. This gives us a clue for reframing the problem: what is the most useful information to log given a maximum logging overhead constraint?

This paper presents Log20, a tool that determines a near optimal placement of log printing statements under the constraint of adding less than a specified amount of performance overhead. Log20 does this in an automated way without any human involvement.

Throughout the paper, the abbreviation LPS is used to refer to a Log Printing Statement.

### Logging the manual way

Here’s an analysis of the number of LPSes at different severity levels in Cassandra, Hadoop, HBase, and ZooKeeper, indicating that developers seem to find plenty of value in recording non-error information in addition to error conditions:

Looking at the revision history of these systems, and following some of the discussion in bug reports, reveals that:

• Logging statements are often added retrospectively – 21,642 revisions have the sole purpose of adding log statements, presumably after finding they were needed to diagnose a problem.
• Balancing information and overhead is hard – at what level should a given piece of information be logged? Battles rage back and forth on this in comment threads (and commits!). 2,105 commits only modify a severity level.
• Setting the right verbosity level is also hard subjectively – whether something is an Error or Info for example can depend on the particular workload.

### How much information is that set of log statements providing?

The first and most significant puzzle piece on the journey towards optimal logging is figuring out how much information we’re getting from a particular log statement. Given the placement of a set of logging statements, the authors use an entropy-based model to capture the amount of uncertainty (unpredictability) that remains in figuring out which execution path was taken. We want to place log statements in such a way that entropy is minimised.

Log20 considers execution paths at the block level. That is, an execution path is a sequence of the blocks that the system traversed at runtime. Consider this program:

Here are some possible execution paths for the program, where blocks are identified by the line number on which they start:

Log20 samples the production system to collect path profiles. Let $p(x)$ be the number of occurrences of path $x$ divided by the total number of paths sampled in the system. In other words, $p(x)$ is an estimate of the probability of observing execution path $x$. Using Shannon’s entropy we can measure the overall unpredictability in the system as:
$\displaystyle H(X) = - \sum_{x \in X} p(x) \log_2 p(x)$

We instrument a subset of the blocks. When execution follows a given path, it produces a log sequence containing log entries for the instrumented blocks only. Given a log sequence and a placement of log statements, it’s possible therefore that multiple execution paths may give rise to the same log sequence. As a trivial example, suppose that in our placement we have just one LPS in the block on line 2 – then any of the paths $P_1$ through $P_9$ will result in the same log sequence.

Let the Possible Paths of a log sequence $L_i$, $PP(L_i) = {P_i, ..., P_j}$ be the set of paths that would output $L_i$ when executed.

Given a placement of log statements $S$, then we can use entropy to model how much information we are getting from those log statements. Consider a particular log output $L$, the entropy $H_L$ is:
$\displaystyle H_{L}(X) = - \sum_{x \in PP(L)} \frac{p(x)}{p(L)} \log_2 \frac{p(x)}{p(L)}$

Where $p(L)$ is the probability of the program taking a path that outputs L. $\frac{p(x)}{p(L)}$ is therefore telling us the probability that we took path $x$ given that we saw $L$, $P(x | L)$.

Now consider all possible log outputs produced by the placement $S$. We can measure the entropy of the placement, $H_S$ as $H_{S}(X) = \sum_{L \in O(S)} p(L)H_L$, where $O(S)$ is the set of all possible log sequences under placement $S$. This reduces to:
$\displaystyle H_{S}(X) = - \sum_{x \in X} p(x) \log_2 \frac{p(x)}{p(L_x)}$

### What’s the overhead of a log statement?

If we assume a fixed cost each time we emit a log statement, then the cost of given log statement placement is directly proportional to the number of times we expect it to be executed. We can figure this out from the production sampling, and assign each placement a weight $w$ representing that cost.

### The Log20 placement algorithm

Given a set of basic blocks, BB, where each block has a weight, $w$, the problem of placement is to find a subset of BB, $S \subset BB$, such that the sum of the weights of all basic blocks in $S$ is under a threshold, $W_T$, and entropy $H_S$ is minimized.

A brute force search is O(2^N), where N is the number of basic blocks, so that’s not going to work! Instead Log20 uses a greedy approximation algorithm. Sort the basic blocks in ascending order of weight (i.e., cheapest to instrument first). Considering them in this order, add them to the current placement if and only if adding the block under consideration both reduces the entropy and causes us to remain under the weight threshold.

One nice consequence of this is that all of the very rarely executed (and hence likely to be buggy) code paths tend to get instrumented.

Considering the example program we looked at earlier, with a weight threshold of 1.00 (on average there should be no more than 1.00 log entries printed by an execution path), then a single LPS should be placed at line 3 giving entropy 2.67. With a budget of 2.00, logging should be placed at lines 3 and 7.

Section 4.3 in the paper details an extension to the scheme I have just described which considers also logging variable values in LPSes. When these disambiguate later execution paths, logging such a value can reduce the number of downstream log statements required.

### Implementation details

Log20 comprises an instrumentation library, a tracing library used for both request tracing and logging, and an LPS placement generator using the algorithm we just examined. The instrumentation library uses Soot for bytecode instrumentation.

The tracing library has low overhead and consists of a scheduler and multiple logging containers (one per thread), each with a 4MB memory buffer. Log entries are of the form timestamp MethodID#BBID, threadID plus any variable values. In the evaluation, each logging invocation takes 43ns on average, compared to 1.5 microseconds for Log4j.

If you’re feeling brave, you can even have Log20 dynamically adjust the placement of log statements at runtime based on continued sampling of traces.

### Evaluation

The following charts show the relationship between overhead and entropy when using Log20 with HDFS, HBase, Hadoop YARN, and ZooKeeper. You can also see where the current manual instrumentation sits.

In HDFS, Log20’s placement can reduce the entropy from 6.41 to 0.91 with fewer than two log entries per request. Intuitively, this means that with two log entries per request, Log20 can reduce the number of possible paths from 2^6.41 (85) to 2^0.91 (2)… Log20 is substantially more efficient in disambiguating code paths compared to the existing placements.

For the same information level as existing Info logs, Log20 needs only 0.08 entries per request, versus 1.58 in the current system. If we consider Debug, Log20 needs just 0.32 log entries per request to achieve the same entropy as HDFS current 2434.92 log entries per request!

The real-world usefulness of Log20 for debugging was evaluated by looking at 41 randomly selected user-reported HDFS failures. Log20 was configured to hit the same performance threshold as the existing logs. Log20’s output is helpful in debugging 28 out of the 41. The existing log instrumentation helps in debugging 27.

After all that work, it’s a little disappointing that for the same performance cost, Log20 doesn’t do much better overall. However, when we zoom into cold or rarely executed paths (see B) above, Log20 does indeed give much better coverage.

Guided by information theory, [Log20] measures how effective each logging statement is in disambiguating code paths. We have shown that the placement strategy inferred by Log20 is significantly more efficient in path disambiguation than the placement of log printing statements in existing programs.

My VM is lighter (and safer) than your container Manco et al., SOSP’17

Can we have the improved isolation of VMs, with the efficiency of containers? In today’s paper choice the authors investigate the boundaries of Xen-based VM performance. They find and eliminate bottlenecks when launching large numbers of lightweight VMs (both unikernels and minimal Linux VMs). The resulting system is called LightVM and with a minimal unikernel image, it’s possible to boot a VM in 4ms. For comparison, fork/exec on Linux takes approximately 1ms. On the same system, Docker containers start in about 150ms.

These results are obtained when the LightVM guest is a unikernel. You’re probably only going to create a unikernel in specialised cases. (One interesting such case in the paper is a Micropython-based unikernel that can be used to support serverless function execution). The authors also create an automated build system called TinyX for creating minimalistic Linux VM images targeted at running a single application. If we look at boot times for a TinyX VM as compared to Docker, the performance is very close up to about 250 VMs/containers per core.

Beyond that point, Docker starts to edge it, since even the idle minimal Linux distribution created by TinyX does run some occasional background tasks.

### How does Xen scale with number of VMs, and where are the bottlenecks?

As the following chart shows, the biggest single factor limiting the scalability and performance of virtualisation is the size of the guest VMs. To produce that chart, a unikernel VM was booted from ramdisk, with varying sizes of binary objects injected into the uncompressed image file. So all the effects are due to image size.

So if we want fast booting, we know that image size is going to matter. We’ve looked at unikernels on The Morning Paper before, and they give you the smallest possible guest image. In this paper, the authors use Mini-OS to create a variety of unikernels, including the ‘daytime’ unikernel implementing a TCP servic that returns the current time. This is 480KB uncompressed, and runs in 3.6MB of RAM. This unikernel is used to test the lower bound of memory consumption for possible VMs.

Making your own unikernel image based on Mini-OS is probably more work than many people are prepared to do though, so the authors also created Tinyx.

Tinyx is an automated build system that creates minimalistic Linux VM images targeted at running a single application. The tool builds, in essence, a VM consisting of a minimalistic, Linux-based distributed along with an optimized Linux kernel. It provides a middle point between a highly specialized unikernel, which has the best performance but requires porting of applications to a minimalistic OS, and a full-fledged general-purpose OS VM that supports a large number of applications out of the box but incurs performance overheads.

Tinyx creates kernel images that are half the size of typical Debian kernels, and have significantly smaller runtime memory usage (1.6MB for Tinyx vs 8MB for Debian).

Using the small VM images thus obtained, we can probe the behaviour of Xen itself when launching lots of VMs. When launching 1000 guests, here are the boot and create times for Debian minimal install), Tinyx, MiniOS (unikernel) and for comparison on the same hardware: Docker containers and simple process creation.

As we keep creating VMs, the creation time increases noticeably (note the logarithmic scale): it takes 42s, 10s and 700ms to create the thousandth Debian, Tinyx, and unikernel guest, respectively.

As the size of the VM decreases, the creation time is responsible for ever larger portions of the overall time taken to get to availability. To understand where all the time was going, the team instrumented Xen to reveal this picture:

XenStore interaction and device creation dominate. Of these, the device creation overhead is fairly constant, but the XenStore overhead grows superlinearly.

### The design of LightVM

Our target is to achieve VM boot times comparable to process startup times. Xen has not been engineered for this objective, as the results in the previous section show, and the root of these problems is deeper than just inefficient code. For instance, one fundamental problem with the XenStore is its centralized, filesystem-like API which is simply too slow for use during VM creation and boot, requiring tens of interrupts and privilege domain crossings.

I bet it was hard to conceive of anyone launching 1000 guest VMs when that design was first created!

LightVM redesigns the Xen control plane with a lean driver called noxs (for ‘no XenStore’) that replaces the XenStore and allows direct communication between front-end and back-end drivers via shared memory.

LightVM also keeps on hand a pool of pre-prepared VM shells, through which all the processing common to all VMs is done in the background. When a VM creation command is issued, a suitable shell fitting the VM requirements is taken from the pool and only the final initialisation steps such as loading the kernel image into memory and finalising device initialisation need to be done.

Device creation in standard Xen ends up calling bash scripts, which is a slow process. LightVM replaces this with a binary daemon that executes a pre-defined setup with no forking or bash scripts.

### Performance

We saw the boot times for LightVM with a variety of images at the start of this post. Furthermore, LightVM can save a VM in around 30ms, and restore it in 20ms. Standard Xen needs 128ms and 550ms respectively.

Unikernel memory usage is fairly close to Docker containers. Tinyx needs more, but only 22GB more across 1000 guests. That’s a small fraction of the RAM of current servers.

CPU usage for VMs can also be on a par with containers, so long as the VMs are trimmed to include only the necessary functionality:

### Use cases

The authors present four different use cases where LightVM + lightweight VMs can shine.

In all the following scenarios, using containers would help performance but weaken isolation, while using full-blown VMs would provide the same isolation as lightweight VMs, but with poorer performance.

1. Personal firewalls per mobile user, running in mobile gateways at or near cellular base stations (mobile edge computing – MEC). Here a ClickOS unikernel image is used, and 8000 firewalls can be run on a 64-core AMD machine with 10ms boot times. A single machine running LightVM at the edge in this way can run personalized firewalls for all users in a cell without becoming a bottleneck.
2. Just-in-time service instantiation in mobile edge computing (similar to JITSU).
3. High-density TLS termination at CDNs, which requires the long term secret key of the content provider. Hence strong isolation between different content provider’s proxies is desirable.
4. Creation of a lightweight compute service such as AWS Lambda. For this use case they use a Micropython-based unikernel to run computations written in Python. It takes about 1.3ms to boot and start executing a function. When the system is deliberately stressed with more requests arriving than the test machine can cope with, service time goes up fairly linearly until about 800 VMs.

The use cases we presented show that there is a real need for lightweight virtualization, and that it is possible to simulataneously achieve both good isolation and performance on par or better than containers.

A recent post on the Google Cloud Platform blog, ‘Demystifying container vs VM-based security: security in plaintext’ provides an interesting perspective on container security and isolation, from a company that have been running a container-based infrastructure for a very long time.

DeepXplore: automated whitebox testing of deep learning systems Pei et al., SOSP’17

The state space of deep learning systems is vast. As we’ve seen with adversarial examples, that creates opportunity to deliberately craft inputs that fool a trained network. Forget adversarial examples for a moment though, what about the opportunity for good old-fashioned bugs to hide within that space? Experience with distributed systems tells us that there are likely to be plenty! And that raises an interesting question: how do you test a DNN? And by test here, I mean the kind of testing designed to deliberately find and expose bugs in corner cases etc., not the more common (in ML circles) definition of a test set that mostly focuses on demonstrating that the mainline cases work well.

How do you even know when you have done enough testing? Code coverage is clearly not a useful metric here: even one or a very small number of inputs to a DNN might achieve 100% code coverage. Instead the authors define a neuron coverage metric that looks at the percentage of neurons which are activated during execution of a test suite.

Even more fundamentally perhaps, how do you know what the correct output should be for a given test case? Using manually labelled real-world test data it’s very hard to get good coverage. Instead, DeepXplore uses differential testing, the notion of comparing multiple different systems and seeing where their outputs differ. Although not mentioned in the paper, this is very close to the idea of an ensemble, but here we’re deliberately looking for cases where the ensemble is not unanimous. When we use an ensemble, we’re essentially saying “I know that no one model is likely to be uniformly good everywhere, so let me use a collection of models and majority voting to improve accuracy overall.” Differences between the models on some inputs is treated as normal and expected. DeepXplore looks at things differently: if n other models all make one prediction for a given input, and one model makes a different prediction, then that model has a bug.

And what happens when you explicitly test state-of-the-art DL models in this way?

DeepXplore efficiently finds thousands of incorrect corner case behaviors (e.g., self-driving cars crashing into guard rails and malware masquerading as benign software) in state-of-the-art DL models with thousands of neurons trained on five popular datasets including ImageNet and Udacity self-driving challenge data. For all tested DL models, on average DeepXplore generated one test input demonstrating incorrect behavior within one second while running only on a commodity laptop.

That’s worth repeating: give me your fully-trained state-of-the-art deep learning system, and in one second on my laptop I can find a way to break it!!

The following test case found by DeepXplore nicely highlights why this is a problem:

Unfortunately, DL systems, despite their impressive capabilities, often demonstrate unexpected or incorrect behaviors in corner cases for several reasons such as biased training data, overfitting, and underfitting of the models… Therefore, safety- and security-critical DL systems, just like traditional software, must be tested systematically for different corner cases to detect and fix ideally any potential flaws or undesired behaviors.

### How DeepXplore works

DeepXplore is based on differential testing, so it requires at least two DNNs with the same functionality. Depending on what you’re trying to do, that may be a limiting factor. “However, our evaluation shows that in most cases multiple different DNNs, for a given problem, are easily available as developers often define and train their own DNNs for customisation and improved accuracy.” Given two or more models, DeepXplore begins to explore the model space starting from a seed set of test inputs:

DeepXplore takes unlabeled test inputs as seeds and generates new tests that cover a large number of neurons (i.e., activates them to a value above a customizable threshold) while causing the tested DNNs to behave differently.

While generating new tests, DeepXplore tries to maximise both neuron coverage and the discovery of tests that cause the DNNs to behave differently. Both goals are necessary for thorough testing that exposes erroneous corner cases. It’s also possible to give DeepXplore constraints that ensure generated test cases stay within given bounds (for example, image pixel values must be between 0 and 255).

Consider a toy example of two DNNs designed to classify images as either cars or faces. We might start out in a situation like this where both networks classify an image as car with high probability:

DeepXplore then tries to maximise the chances of finding differential behaviour by modifying the input such that one network continues to classify the input as a car, while the other one thinks it is a face.

In essence, it is probing for parts of the input space that lie between the decision boundaries of the DNNs:

Remember that the network has already been trained at this point, so the weights are fixed. DeepXplore solves a joint optimisation problem for neuron coverage and differential behaviour maximisation using gradient ascent:

First, we compute the gradient of the outputs of the neurons in both the output and hidden layers with the input value as a variable and the weight parameter as a constant. Such gradients can be computed efficiently for most DNNs… Next, we iteratively perform gradient ascent to modify the test input toward maximising the objective function of the join optimization problem…

The joint objective function looks like this:

• In the first term, we try to move the input in a direction such that one model makes a different prediction while the others maintain their current prediction. The $\lambda_1$ hyperparameter balances the relative importance of these two factors.
• In the second term, we try to maximise the activation of an inactive neuron to push it above a threshold $t$.
• The $\lambda_2$ hyperparameter balances the relative importance of these two objectives.

Let $s$ be the step size in the gradient ascent algorithm. Then after the _i_th iteration we have $\mathbf{x}_{i+1} = \mathbf{x}_i + s \circ \mathbf{grad}$. The gradient $\mathbf{grad}$ will be modified if necessary to ensure that the new input still satisfies any specified domain constraints.

The full algorithm looks like this:

The full implementation is 7,086 lines of Python code using Keras and TensorFlow.

### Testing DNNs with DeepXplore

DeepXplore is tested using three DNNs for each of five different datasets:

For the image based problem domains three different constraints are explored:

• The first constraint simulates the effect of different lighting conditions: DeepXplore can make the image darker or brighter, but cannot change the content. In the figure below, the top row shows original seed inputs, and the bottom row shows difference inducing test inputs found by DeepXplore. The arrows highlight which way the self-driving car decides to turn.

• The second constraint simulates a camera lens that is accidentally or deliberately occluded by a single small rectangle.

• The third constraints simulates the effect of multiple specs of dirt on the lens by permitting occlusion using multiple tiny black rectangles.

DeepXplore found thousands of erroneous behaviours in all the tested DNNs. Table 2 (below) summarizes the numbers of erroneous behaviours found by DeepXplore for each tested DNN while using 2,000 randomly selected seed inputs from the corresponding test sets.

The following table highlights the importance of including neuron coverage in the objective function as a way of increasing input diversity:

On average, DeepXplore achieves 34.4% more neuron coverage than random testing, and 33.2% more than adversarial testing.

For most models, DeepXplore can find the first difference-inducing inputs in seconds:

If we think of the set of DNNs like an ensemble, and use majority voting, then have an automated labelling system for the generated test cases. Using these as new labeled training examples improves accuracy by 1-3%.

Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing Matejka & Fitzmaurice et al., CHI’17

Today’s paper choice is inspired by the keynote that Prof. Miriah Meyer gave at the recent Velocity conference in London, ‘Why an interactive picture is worth a thousand numbers.’ She made a wonderful and thought-provoking case for the power of visualisations, and especially visualisations you can interact with, playing a central role in our process of understanding. Better ways of seeing and exploring data lead to better insights. Meyer opened her talk by showing us Cairo’s Datasaurus (‘Never trust summary statistics alone; always visualize your data’).

You can calculate all the statistic summaries you like over the datasaurus dataset, run regressions, perform clustering, and so on. But until you look at it with the right visualisation, you’re never going to reach the same understanding as you get from looking at the data this way:

Since the early 70’s, Anscome’s Quartet has been frequently used to illustrate the importance of graphical representations when exploring data:

The effectiveness of Anscombe’s Quartet is not due to simply having four different data sets which generate the same statistical properties, it is that four clearly different and identifiably distinct datasets are producing the same statistical properties.

In ‘Same Stats, Different Graphs,’ Matjeka & Fitzmaurice show a method for purposefully creating datasets which are identical over a range of statistical properties (of your choosing), yet produce dissimilar graphics. In my mind there’s a connection here to the idea of adversarial inputs to deep neural nets, which we might similarly express on some level as ‘Same Stats, Different Classes.’ Another thing I get from this paper is a very visual reminder of ‘Same Outcome (in terms of stats), Different Causes.’ There are lots of different hypotheses you could come up with that may produce the effect you’re seeing.

Their method doesn’t just produce datasets that retain the same statistical properties while looking different though, it also allows you to guide the way in which the visualisation looks different (a bit like crafting an adversarial input to produce a specific mis-classification). For example, in the figure below we have an initial data set (top left) and a set of target shapes for data set generation.

Here are the results produced by the technique when using these target shapes – every one of these has the same summary statistics as the initial dataset!

These examples are all in 2D, but there’s nothing about the technique that limits it to two dimensions. Here are some 1-D examples:

### How it works

The key insight behind our approach is that while generating a dataset from scratch to have particular statistical properties is relatively difficult, it is relatively easy to take an existing dataset, modify it slightly, and maintain (nearly) the same statistical properties. With repetition, this process creates a dataset with a different visual appearance from the original, while maintaining the same statistical properties. Further, if the modifications to the dataset are biased to move the points towards a particular goal, the resulting graph can be directed towards a particular visual appearance.

In pseudo-code, it looks like this:

Where:

• Initial_ds is the seed dataset defining the statistical properties which should be maintained
• Perturb modifies the current version of the dataset by moving one or more points by a small amount in a random direction. The small amount is chosen from a normal distribution and calibrated so that 95% or more of movements should result in the overall statistical properties remaining unchanged (to two decimal places). The temp parameter is a temperature used for simulated annealing.
• The Fit function checks if perturbing the points has improved the overall fitness, and accepts it if so. When coercing the dataset into a target shape, it uses the average distance of all points to the nearest point on the target shape.
• To avoid getting stuck in a locally-optimal solution, a perturbation may also be accepted if the current temperature is greater than a random number between 0 and 1. Temperature starts out at 0.4, and is gradually reduced to 0.01 using a quadratically-smoothed monotonic cooling schedule.
• The newly returned perturbation is then tested (isErrorOk) to ensure that overall stats have not changed (within 2 decimal places), and becomes the current dataset if so.

Here’s an example of dataset evolution across 200,000 iterations:

And of course you don’t have to start with a random cloud, you can evolve from any seed dataset. Here are some transformations from our friend the datasaurus:

Here’s a fun example where the target shape is used to produce a dataset exhibiting Simpson’s paradox. Start out with a strongly positively correlated dataset, for example:

Then give the algorithm a target shape that directs the dataset towards a series of negatively sloping lines:

A few iterations later, and we have a dataset with the same overall strongly positive correlation that we started with, but each subset of the data has an individually negative correlation.

I find that a very satisfying way of demonstrating the effect!

### Cloning datasets for anonymity

As discussed by Govindaraju and Haslett another use for datasets with the same statistical properties is the creation of “cloned” datasets to anonymise sensitive data. In this case, it is important that individual data points are changed while the overall structure of the data remains similar. This can be accomplished by performing a Kolmogorov-Smirnov test within the isErrorOk function…

This is clearly similar to the idea of adding noise to a dataset to enhance privacy. I guess I need to read Govindaraju and Haslett’s paper though, as it seems to me at first glance that if all you have maintained are the overall statistical properties you might as well provide those alone. Anything else inferred from the generated data must just be an artificial artefact? It must depend on how far you move from the original dataset…

The code and datasets presented in this work are available from www.autodeskresearch.com/publications/samestats.

If the thought of finding better insights through better visualisations inspires you, you might want to check out Miriah Meyer’s forthcoming book: ‘Making data visual: a practical guide to using visualisation for insight.