We already knew that AlphaGo could beat the best human players in the world: *AlphaGo Fan* defeated the European champion Fan Hui in October 2015 (‘Mastering the game of Go with deep neural networks and tree search’), and *AlphaGo Lee* used a similar approach to defeat Lee Sedol, winner of 18 international titles, in March 2016. So what’s really surprising here is the simplicity of *AlphaGo Zero* (the subject of this paper). AlphaGoZero achieves *superhuman performance*, and won **100-0** in a match against the previous best AlphaGo. And it does it without seeing a single human game, or being given any heuristics for gameplay. All AlphaGo Zero is given are the rules of the game, and then it learns by playing matches against itself. The blank slate, *tabula rasa*. And it learns fast!

Surprisingly, AlphaGo Zero outperformed AlphaGo Lee after just 36 hours; for comparison, AlphaGo Lee was trained over several months. After 72 hours, we evaluated AlphaGo Zero against the exact version of AlphaGo Lee that defeated Lee Sedol, under the 2 hour time controls and match conditions as were used in the man-machine match in Seoul. AlphaGo Zero used a single machine with 4 Tensor Processing Units (TPUs), while AlphaGo Lee was distributed over many machines and used 48 TPUs. AlphaGo Zero defeated AlphaGo Lee by 100 games to 0.

That comes out as superior performance in roughly 1/30th of the time, using only 1/12th of the computing power.

It’s kind of a bittersweet feeling to know that we’ve made a breakthrough of this magnitude, and also that no matter what we do, as humans we’re never going to be able to reach the same level of Go mastery as AlphaGo.

AlphaGo Zero discovered a remarkable level of Go knowledge during its self-play training process. This included fundamental elements of human Go knowledge, and also non-standard strategies beyond the scope of traditional Go knowledge.

In the following figure, we can see a timeline of AlphaGo Zero discovering *joseki* (corner sequences) all by itself. The top row (a) shows *joseki* used by human players, and when on the timeline they were discovered. The second row (b) shows the *joseki* AlphaGo Zero favoured at different stages of self-play training.

At 10 hours a weak corner move was preferred. At 47 hours the 3-3 invasion was most frequently played. This joseki is also common in human professional play; however AlphaGo Zero later

discovered and preferred a new variation.

Here’s an example self-play game after just three hours of training:

AlphaGo Zero focuses greedily on capturing stones, much like a human beginner. A game played after 19 hours of training exhibits the fundamentals of life-and-death, influence and territory:

Finally, here is a game played after 70 hours of training: “*the game is beautifully balanced, involving multiple battles and a complicated ko fight, eventually resolving into a half-point win for white*.”

On an Elo scale, AlphaGo Fan achieves a rating of 3,144, and AlphaGo Lee achieved 3,739. A fully trained AlphaGo Zero (40 days of training) achieved a rating of 5,185. To put that in context, a 200-point gap in Elo rating corresponds to a 75% probability of winning.

Our results comprehensively demonstrate that a pure reinforcement learning approach is fully feasible, even in the most challenging of domains: it is possible to train to superhuman level, without human examples or guidance, given no knowledge of the domain beyond basic rules.

There are four key differences between AlphaGo Zero and AlphaGo Fan/Lee, and we’ve already seen two of them:

- It is trained solely by self-play reinforcement learning, starting from random play.
- It uses only the black and white stones from the board as input features.
- It uses a
*single neural network*rather than separate policy and value networks. - It uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte-Carlo rollouts.

To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search

insidethe training loop, resulting in rapid improvement and precise and stable learning.

From a human perspective therefore we might say that AlphaGo Zero achieves mastery of the game of Go through many hours of *deliberate practice*.

In AlphaGo Fan there is a *policy network* that takes as input a representation of the board and outputs a probability distribution over legal moves, and a separate *value network* that takes as input a representation of the board and outputs a scalar value predicting the expected outcome of the game if play continued from here.

AlphaGo Zero combines both of these roles into a single deep neural network that outputs both move probabilities and an outcome prediction value: . The input to the network, , consists of 17 binary feature planes: . Each and input is a 19×19 matrix where ( ) if intersection contains a stone of the player’s colour at time step . The final input feature is 1 if it is black’s turn to play, and 0 for white.

History features are necessary because Go is not fully observable solely from the current stones, as repetitions are forbidden; similarly, the colour feature C is necessary because the komi is not observable.

The input is connected to a ‘tower’ comprised of one convolutional block and then nineteen residual blocks. On top of the tower are two ‘heads’: a value head and a policy head. End to end it looks like this:

In each position , a Monte-Carlo Tree Search (MCTS) is executed, which outputs probabilities for each move. These probabilities usually select much stronger moves than the raw move probabilities of the policy head on its own.

MCTS may be viewed as a powerful

policy improvementoperator. Self-play with search — using the improved MCTS-based policy to select each move, then using the game winner as a sample of the value — may be viewed as a powerfulpolicy evaluationoperator. The main idea of our reinforcement learning algorithm is to use these search operators repeatedly in a policy iteration procedure: the neural network’s parameters are updated to make the move probabilities and value more closely match the improved search probabilities and self-play winner ; these new parameters are used in the next iteration of self-play to make the search even stronger.

The nework is initialised to random weights, and then in each subsequent iteration self-play games are generated. At each time step an MCTS search is executed using the previous iteration of the network, and a move is played by sampling the search probabilities . A game terminates when both players pass, when the search value drops below a resignation threshold, or when the game exceeds a maximum length. The game is then scored to give a final reward.

The neural network is adjusted to minimise the error between the predicted value and the self-play winner , and to maximise the similarity of the neural network move probabilities to the search probabilities . Specifically, the parameters are adjusted by gradient descent on a loss function that sums over mean-squared error and cross-entropy losses:

I’ll leave you to reflect on this closing paragraph, thousands of years of collective human effort surpassed from scratch in just a few days:

]]>Humankind has accumulated Go knowledge from millions of games played over thousands of years, collectively distilled into patterns, proverbs, and books. In the space of a few days, starting tabula rasa, AlphaGo Zero was able to rediscover much of this Go knowledge, as well as novel strategies that provide new insights into the oldest of games.

Yesterday we looked at the information theory of deep learning, today in part II we’ll be diving into experiments using that information theory to try and understand what is going on inside of DNNs. The experiments are done on a network with 7 fully connected hidden layers, and widths 12-10-7-5-4-3-2 neurons.

The network is trained using SGD and cross-entropy loss function, but no other explicit regularization. The activation functions are hyperbolic tangent in all layers but the final one, where a sigmoid function is used. The task is to classify 4096 distinct patterns of the input variable into one of two output classes. The problem is constructed such that bits.

Output activations are discretised into 30 bins, and the Markov chain is used to calculate the joint distributions for every hidden layer . Using the joint distributions, it is then possible to calculate the encoder and decoder mutual information, and , for each hidden layer in the network. The calculations were repeated 50 times over, with different random initialisation of network works and different random selections of training samples.

We can plot the mutual information retained in each layer on a graph. The following chart shows the situation before any training has been done (i.e., random initial weights of each of the 50 generated networks).

The different colours in the chart represent the different hidden layers (and there are multiple points of each colour because we’re looking at 50 different runs all plotted together). On the x-axis is , so as we move to the right on the x-axis, the amount of mutual information between the hidden layer and the input increases. On the y-axis is , so as we move up on the y-axis, the amount of mutual information between the hidden layer and the output increases.

I’m used to thinking of progressing through the network layers from left to right, so it took a few moments for it to sink in that it’s the lowest layer which appears in the top-right of this plot (maintains the most mutual information), and the top-most layer which appears in the bottom-left (has retained almost no mutual information before any training). So the *information path* being followed goes from the top-right corner to the bottom-left traveling down the slope.

Part-way through training (here at 400 epochs), we can see that much more information is being retained through the layers .

After 9000 epochs we’re starting to see a pretty flat information path, which means that we’re retaining mutual information needed to predict all the way through the network layers.

Something else happens during training though, and the best demonstration of it is to be found in watching this short video.

What you should hopefully have noticed is that early on the points shoot up and to the right, as the hidden layers learn to retain more mutual information both with the input and also as needed to predict the output. But after a while, a phase shift occurs, and points move more slowly up and to the left.

The following chart shows the average layer trajectories when training with 85% of the data, giving a static representation of the same phenomenon. The green line shows the phase change point, and the yellow points are the final positions.

The two optimization phases are clearly visible in all cases. During the fast — Empirical eRror Minimization (ERM) — phase, which takes a few hundred epochs, the layers increase information on the labels (increase ) while preserving the DPI order (lower layers have higher information). In the second and much longer training phase the layer’s information on the input, , decreases and the layers lose irrelevant information until convergence (the yellow points). We call this phase the representation compression phase.

You can also see the phase change clearly (the vertical grey line) when looking at the normalized means and standard deviations of the layer’s stochastic gradients during the optimization process:

We claim that these distinct SG phases (grey line in the figure above), correspond and explain the ERM and compression phases we observe in the information plane.

The first, ERM, phase is a *drift* phase. Here the gradient means are much larger than their standard deviations, indicating small gradient stochasticity (high SNR). The increase in is what we expect to see from cross-entropy loss minimisation.

The existence of the compression phase is more surprising. In this phase the gradient means are very small compared to the batch-to-batch fluctuations, and the gradients behave like Gaussian noise with very small means for each layer (low SNR). This is a *diffusion phase*.

…the diffusion phase mostly adds random noise to the weights, and they evolve like Weiner processes [think Brownian motion], under the training error or label information constraint.

This has the effect of maximising entropy of the weights distribution under the training error constraint. This in turn minimises the mutual information – in other words, we’re discarding information in that is irrelevant to the task at hand. The fancy name for this process of entropy maximisation by adding noise is *stochastic relaxation*.

Compression by diffusion is exponential in the number of time steps it takes (optimisation epochs) to achieve a certain compression level (and hence why you see the points move more slowly during this phase).

One interesting consequence of this phase is the *randomised nature of the final weights of the DNN*:

This indicates that there is a huge number of different networks with essentially optimal performance, and

attempts to interpret single weights or even single neurons in such networks can be meaningless.

As can be clearly seen in the charts, different layers converge to different points in the information plane, and this is related to the critical slowing down of the stochastic relaxation process near the phase transitions on the Information Bottleneck curve.

Recall from yesterday that the *information curve* is a line of optimal representations separating the achievable and unachievable regions in the information plane. Testing the information values in each hidden layer and plotting them against the information curve shows that the layers do indeed approach this bound.

How exactly the DNN neurons capture the optimal IB representations is another interesting issue to be discussed elsewhere, but there are clearly many different layers that correspond to the same IB representation.

To understand the benefits of more layers the team trained 6 different architectures with 1 to 6 hidden layers and did 50 runs of each as before. The following plots show how the information paths evolved during training for each of the different network depths:

From this we can learn four very interesting things:

*Adding hidden layers dramatically reduces the number of training epochs for good generalization*. One hidden layer was unable to achieve good values even after 10^4 iterations, but six layers can achieve full relevant information capture after just 400.*The compression phase of each layer is shorter when it starts from a previous compressed layer*. For example, the convergence is much slower with 4 layers than with 5 or 6.*The compression is faster for the deeper (narrower and closer to the output) layers*. In the diffusion phase the top layers compress first and “pull” the lower layers after them. Adding more layers seems to add intermediate representations which accelerate compression.*Even wide layers eventually compress in the diffusion phase. Adding extra width does not help*.

Training sample size seems to have the biggest impact on what happens during the diffusion phase. Here are three charts showing the information paths when training with 5% (left), 45% (middle), and 85% (right of the data):

We are currently working on new learning algorithms that utilize the claimed IB optimality of the layers. We argue that SGD seems an overkill during the diffusion phase, which consumes most of the training epochs, and the much simpler optimization algorithms, such as Monte-Carlo relaxations, can be more efficient.

Furthermore, the analytic connections between encoder and decoder distributions can be exploited during training: *combining the IB iterations with stochastic relaxation methods may significantly boost DNN training*.

]]>To conclude, it seems fair to say, based on our experiments and analysis, that Deep Learning with DNN are in essence learning algorithms that effectively find efficient representations that are approximate minimal statistics in the IB sense. If our findings hold for general networks and tasks, the compression of the SGD and the convergence of the layers to the IB bound can explain the phenomenal success of Deep Learning.

In my view, this paper fully justifies all of the excitement surrounding it. We get three things here: (i) a theory we can use to reason about what happens during deep learning, (ii) a study of DNN learning during training based on that theory, which sheds a lot of light on what is happening inside, and (iii) some hints for how the results can be applied to improve the efficiency of deep learning – which might even end up displacing SGD in the later phases of training.

Despite their great success, there is still no comprehensive understanding of the optimization process or the internal organization of DNNs, and they are often criticized for being used as mysterious “black boxes”.

I was worried that the paper would be full of impenetrable math, but with full credit to the authors it’s actually highly readable. It’s worth taking the time to digest what the paper is telling us, so I’m going to split my coverage into two parts, looking at the theory today, and the DNN learning analysis & visualisations tomorrow.

Consider the supervised learning problem whereby we are given inputs and we want to predict labels . Inside the network we are learning some representation of the input patterns, that we hope enables good predictions. We also want good generalisation, not overfitting.

Think of a whole layer as a single random variable. We can describe this layer by two distributions: the encoder and the decoder .

So long as these transformations preserve information, we don’t really care which individual neurons within the layers encode which features of the input. We can capture this idea by thinking about the *mutual information* of with the input and the desired output .

Given two random variables and , their mutual information is defined based on information theory as

Where is the entropy of and is the conditional entropy of given .

The mutual information quantifies the number of relevant bits that the input contains about the label , on average.

If we put a hidden layer between and then is mapped to a point in the *information plane* with coordinates . The *Data Processing Inequality* (DPI) result tells us that for any 3 variables forming a Markov chain we have .

So far we’ve just been considering a single hidden layer. To make a deep neural network we need lots of layers! We can think of a Markov chain of K-layers, where denotes the hidden layer.

In such a network there is a unique information path which satisfies the DPI chains:

and

Now we bring in another property of mutual information; it is invariant in the face of invertible transformations:

for any invertible functions and .

And this reveals that the same information paths can be realised in many different ways:

Since layers related by invertible re-parameterization appear in the same point, each information path in the plane corresponds to many different DNN’s , with possibly very different architectures.

An *optimal encoder* of the mutual information would create a representation of a *minimal sufficient statistic* of with respect to . If we have a minimal sufficient statistic then we can *decode* the relevant information with the smallest number of binary questions. (That is, it creates the most compact encoding that still enables us to predict as accurately as possible).

The *Information Bottleneck* (IB) tradeoff Tishby et al. (1999) provides a computational framework for finding approximate minimal sufficient statistics. That is, the optimal tradeoff between the compression of and the prediction of .

The Information Bottleneck tradeoff is formulated by the following optimization problem, carried independently for the distributions, , with the Markov chain:

where the Lagrange multiplier determines the level of relevant information captured by the representation .

The solution to this problem defines an *information curve*: a monotonic concave line of optimal representations that separates the achievable and unachievable regions in the information plane.

Section 2.4 contains a discussion on the crucial role of noise in making the analysis useful (which sounds kind of odd on first reading!). I don’t fully understand this part, but here’s the gist:

The learning complexity is related to the number of relevant bits required from the input patterns for a good enough prediction of the output label , or the minimal under a constraint on given by the IB.

Without some noise (introduced for example by the use of sigmoid activation functions) the mutual information is simply the entropy independent of the actual function we’re trying to learn, and nothing in the structure of the points gives us any hint as to the learning complexity of the rule. With some noise, the function turns into a stochastic rule, and we can escape this problem. Anyone with a lay-person’s explanation of why this works, please do post in the comments!

With all this theory under our belts, we can go on to study the *information paths* of DNNs in the *information plane*. This is possible when we know the underlying distribution and the encoder and decoder distributions and can be calculated directly.

Our two order parameters, and , allow us to visualize and compare different network architectures in terms of their efficiency in preserving the relevant information in .

We’ll be looking at the following issues:

- What is SGD actually doing in the information plane?
- What effect does training sample size have on layers?
- What is the benefit of hidden layers?
- What is the final location of hidden layers?
- Do hidden layers form optimal IB representations?

(Where we know anonymous to be some combination of Hinton et al.).

This is the second of two papers on Hinton’s capsule theory that has been causing recent excitement. We looked at ‘Dynamic routing between capsules’ yesterday, which provides some essential background so if you’ve not read it yet I suggest you start there.

Building on the work of Sabour et al., we have proposed a new capsule network architecture in which each capsule has a logistic unit to represent the presence of an entity and a 4×4 pose matrix to represent the relationship between that entity and the viewer. We also introduced a new iterative routing procedure between capsule layers, making use of the EM algorithm, which allows the output of each lower-level capsule to be routed to a capsule in the layer above so that each higher-level capsule receives a cluster of similar pose votes, if such a cluster exists.

This revised CapsNet architecture (let’s call it CapsNetEM) does very well on the smallNORB dataset, which contains gray-level stereo images of 5 classes of toy: airplanes, cars, trucks, humans, and animals. Every individual toy is pictured at 18 different azimuths, 9 elevations, and 6 lighting conditions.

By ‘very well’ I mean state-of-the-art performance on this dataset, achieving a 1.4% test error rate (the CapsNet architecture we looked at yesterday achieves 2.7%). The best prior reported result on smallNORB is 2.56%. A smaller CapsNetEM network with 9 times fewer parameters than the previous state-of-the-art also achieves 2.2% error rate.

A second very interesting property of CapsNetEM demonstrated in this paper is increased robustness to some forms of adversarial attack. For both general and targeted adversarial attacks using an incremental adjustment strategy (white box) such as that described in ‘Explaining and harnessing adversarial examples,’ CapsNetEM is significantly less vulnerable than a baseline CNN.

…the capsule model’s accuracy after the untargeted attack never drops below chance (20%) whereas the convolutional model’s accuracy is reduced to significantly below chance with an epsilon of as small as 0.2.

We’re not out of the woods yet though – with black box attacks created by generating adversarial examples with a CNN and then testing them on both CapsNetEM and a different CNN, CapsNetEM does *not* perform noticeably better. Given that black box attacks are the more likely in the wild, that’s a shame.

The core capsules idea remains the same as we saw yesterday, with network layers divided into capsules (note by the way the similarity here with dividing layers into columns), and capsules being connected across layers.

Viewpoint changes have complicated effects on pixel intensities but simple, linear effects on the pose matrix that represents the relationship between an object or object-part and the viewer. The aim of capsules is to make good use of this underlying linearity, both for dealing with viewpoint variation and improving segmentation decisions.

Each capsule has a logistic unit to represent the presence of an entity, and a 4×4 pose matrix which can learn to represent the relationship between that entity and the viewer. A familiar object can be detected by looking for agreement between votes for its pose matrix. The votes come from capsules in the preceding network layer, based on parts they have detected.

A part produces a vote by multiplying its own pose matrix by a transformation matrix that represents the viewpoint invariant relationship between the part and the whole. As the viewpoint changes, the pose matrices of the parts and the whole will change in a coordinated way so that any agreement between votes from different parts will persist.

To find clusters of votes that agree, a *routing-by-agreement* iterative protocol is used. The major difference between CapsNet and CapsNetEM is the way that routing-by-agreement is implemented.

In CapsNet, the length of the pose vector is used to represent the probability that an entity is present. A non-linear squashing function keeps the length less than 1. A consequence though is that this “*prevents there from being any sensible objective function that is minimized by the iterative routing procedure*”.

CapsNet also uses the cosine of the angle between two pose vectors to measure their agreement. But the cosine (unlike the log variance of a Gaussian cluster) is *not good at distinguishing between good agreement and very good agreement*.

Finally, CapsNet uses a vector of length rather than a matrix with elements to represent a pose, so its transformation matrices have parameters rather than just .

CapsNetEM overcomes all of these limitations with a new routing-by-agreement protocol.

Let us suppose that we have already decided on the poses and activation probabilities of all the capsules in a layer and we now want to decide which capsules to activate in the layer above and how to assign each active lower-level capsule to one active higher-level capsule.

We start by looking at a simplified version of the problem in which the transformation matrices are just the identity matrix, essentially taking transformation out of the picture for the time being.

Think of each capsule in a higher layer as corresponding to a Gaussian, and the outputs of each lower level capsule as data points. Now the task essentially becomes figuring out the fit between the data points produced by the lower layer capsules, and the functions defined by the higher layer Gaussians.

This is a standard mixture of Gaussians problem, except that we have way too many Gaussians, so we need to add a penalty that prevents us from assigning every data point to a different Gaussian.

At this point in the paper we then get a very succinct description of the routing by agreement protocol, which I can only follow at a high level, but here goes…

The cost of explaining a whole data-point by using capsule that has an axis-aligned covariant matrix is simply the sum over all dimensions of the cost of explaining each dimension, of . This is simply where is the probability density of the component of the vote from under the capsule’s Guassian model for dimension which has variance .

Let be the cost of summing over *all* lower-level capsules for a single dimension of one higher-level capsule.

The activation function of capsules is then given by:

The non-linearity implemented by a whole capsule layer is a form of cluster finding using the EM algorithm, so we call it EM Routing.

Here’s the routing algorithm in its full glory:

The overall objective function reduces a free energy function:

The recomputation of the means and variances reduces an energy equal to the squared distances of the votes from the means, weighted by assignment probabilities. The recomputation of the caspule activations reduces a free energy equal to the sum of the (scaled) costs used in the logistic function minus the entropy of the activation values. The recomputation of the assignment probabilities reduces the sum of the two energies above minus the entropies of the assignment probabilities.

The general architecture of a CapsNetEM network looks like this:

Each capsule has a 4×4 pose matrix and one logistic activation unit. The 4×4 pose of each of the primary capsules is a linear transformation of the output of all the lower layer ReLUs centred at that location. When connecting the last capsule layer to the final layer the scaled coordinate (row, column) of the centre of the receptive field of each capsule is added to the first two elements of its vote. “*We refer to this technique as Coordinate Addition*.” The goal is to encourage the shared final transformations to produce values for those two elements that represent the fine position of the entity relative to the centre of the capsule’s receptive field.

The routing procedure is used between each adjacent pair of capsule layers, and for convolutional capsules each capsule in layer L+1 only sends feedback to capsules within its receptive field in layer L. Instances closer to the borders of the image therefore receive fewer feedbacks.

The following figure shows how the routing algorithm refines vote routing over three iterations:

We saw some of the results from applying CapsNetEM to smallNORB at the start of this piece. The authors also tested the ability of CapsNetEM to generalize to novel viewpoints by training both the convolution baseline and a CapsNetEM model on one third of the training data using a limited range of azimuths, and then testing on the remaining two-thirds containing the full set of azimuths. Compared with the baseline CNN, capsules with matched performance on familiar viewpoints reduce the test error rate on novel viewpoints by about 30%.

]]>SmallNORB is an ideal data-set for developing new shape-recognition models precisely because it lacks many of the additional features of images in the wild. Now that our capsules model works well on NORB, we plan to implement an efficient version so that we can test much larger models on much larger data-sets such as ImageNet.

The Morning Paper isn’t trying to be a ‘breaking news’ site (there are plenty of those already!) — we covered a paper from 1950 last month for example! That said, when exciting research news breaks, of course I’m interested to read up on it. So The Morning Paper tends to be a little late to the party, but in compensation I hope to cover the material in a little more depth than the popular press. Recently there’s been some big excitement around Geoff Hinton’s work on capsules (and let’s not forget the co-authors, Sabour & Frosst), AlphaZero playing Go against itself, and Scharwtz-Ziv & Tishby’s information bottleneck theory of deep neural networks. This week I’ll be doing my best to understand those papers, and share with you what I can.

There are two capsule-related papers from Hinton. The first of them is from NIPS’17, and it’s today’s paper choice: ‘Dynamic routing between capsules.’ Some of the ideas in this paper are superseded by the ICLR’18 submission ‘Matrix capsules with EM routing’ that we’ll look at tomorrow, nevertheless, there’s important background information and results here. Strictly, we’re not supposed to know that the ICLR’18 submission is by Hinton and team – it’s a double blind review process. Clearly not working as designed in this case!

For thirty years, the state-of-the-art in speech recognition used hidden Markov models with Gaussian mixtures, together with a one-of-n representation encoding.

The one-of-n representations that they use are exponentially inefficient compared with, say, a recurrent neural network that uses distributed representations.

Why *exponentially* ineffecient? To *double* the amount of information that an HMM (hidden Markov model) can remember we need to *square* the number of hidden nodes. For a recurrent net we only need to double the number of hidden neurons.

Now that convolutional neural networks have become the dominant approach to object recognition, it makes sense to ask whether there are any exponential inefficiencies that may lead to their demise. A good candidate is the difficulty that convolutional nets have in generalizing to novel viewpoints.

CNNs can deal with *translation* out of the box, but for robust recognition in the face of all other kinds of transformation we have two choices:

- Replicate feature detectors on a grid that grows exponentially with the number of dimensions, or
- Increase the size of the labelled training set in a similarly exponential way.

The big idea behind *capsules* (a new kind of building block for neural networks) is to efficiently encode *viewpoint invariant knowledge* in a way that generalises to novel viewpoints. To do this, capsules contain explicit *transformation matrices*.

A network layer is divided into many small groups of neurons called “capsules.” Capsules are designed to support transformation-independent object recognition.

The activities of the neurons within an active capsule represent the various properties of a particular entity that is present in the image. These properties can include many different types of instantiation parameter such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc.. One very special property is the existence of the instantiated entity in the image.

(Albedo is a measure for reflectance or optical brightness of a surface – I had to look it up!).

The authors describe the role of the capsules as like ‘an inverted rendering process’ in a graphics pipeline. In a graphics pipeline for example we might start out with an abstract model of a 3D teapot. This will then be placed in the coordinate system of a 3D world, and the 3D world coordinate system will be translated to a 3D camera coordinate system with the camera at the origin. Then lighting and reflections are handled followed by a projection transformation into the 2D view of the camera. Capsules start with that 2D view as input and try to reverse the transformations to uncover the abstract model class (teapot) behind the image.

There are many possible ways to implement the general idea of capsules… In this paper we explore an interesting alternative which is to use the overall length of the vector of instantiation parameters to represent the existence of the entity and to force the orientation of the vector to represent the properties of the entity.

In 2D, the picture in my head looks something like this:

To turn the length of the vector into a probability we need to ensure that it cannot exceed one, which is done by applying a non-linear *squashing* function that leaves the orientation of the vector unchanged but scales down its magnitude.

For all but the first layer of capsules, the total input to a capsule is a weighted sum over all “prediction vectors” from the capsules in the layer below and is produced by multiplying the output of a capsule in the layer below by a weight matrix

Where are coupling coefficients determined by an iterative dynamic routing process we’ll look at in the next section.

In convolutional capsule layers each unit in a capsule is a convolutional unit. Therefore, each capsule will output a grid of vectors rather than a single output vector.

A Capsule Network, or CapsNet combines multiple capsule layers. For example, here’s a sample network for the MNIST handwritten digit recognition task:

In general, the idea is to form a ‘parse tree’ of the scene. Layers are divided into capsules and capsules recognise parts of the scene. Active capsules in a layer from part of the parse tree, and each active capsule chooses a capsule in the layer above to be its parent in the tree.

The fact that the output of a capsule is a vector makes it possible to use a powerful dynamic routing mechanism to ensure that the output of the capsule gets sent to an appropriate parent in the layer above.

- Initially the output is routed to all possible parents, but scaled down by coupling coefficients that sum to 1 (determined by a ‘routing softmax’).
- For each possible parent, the capsule computes a ‘prediction vector’ by multiplying its own output by a weight matrix.
- If the prediction vector has a large scalar product with the output of a possible parent, there is top-down feedback which has the effect of increasing the coupling coefficient for that parent, and decreasing it for other parents.

This type of “routing-by-agreement” should be far more effective than the very primitive form of routing implemented by max-pooling which allows neurons in one layer to ignore all but the most active feature detector in a local pool in the layer below.

To replicate knowledge across space, all but the last layer of capsules are convolutional.

In the final layer of the network (DigitCaps in the figure above) we want the top-level capsule for digit class to have a long instantiation vector if and only if that digit is present in the image. To allow for multiple digits, a separate margin loss is used for each digit capsule. An additional reconstruction loss is used to encourage the digit capsules to encode the instantiate parameters of the digit: “*during training, we mask out all but the activity vector of the correct digit capsule, then we use this activity vector to reconstruct.*”

The CapsNet above is trained on MNIST, and then the capsules are probed by feeding perturbed versions of the activity vector into the decoder network to see how the perturbation affects the reconstruction.

Since we are passing the encoding of only one digit and zeroing out other digits, the dimensions of a digit capsule should learn to span the space of variations in the way digits of that class are instantiated. These variations include stroke thickness, skew and width. They also include digit-specific variations such as the length of the tail of a 2.

Here are some examples of the learned dimensions:

With a 3-layer network, CapsNet overall achieves a 0.25% test error – an accuracy previously only achieved by deeper networks.

To test the robustness of CapsNet to affine transformation, a CapsNet and traditional CNN were both trained on a padded and translated MNIST training set. The network was then tested on the affNIST data set, in which each example is an MNIST digit with a random small affine transformation. CapsNet and the traditional network achieved similar accuracy (99.23% vs 99.22%) on the expanded MNIST test set. CapsNet scored 79% on the affNIST test set though, handily beating the CNN’s score of 66%.

Dynamic routing can be viewed as a parallel attention mechanism that allows each capsule at one level to attend to some active capsules at the level below and to ignore others. This should allow the model to recognise multiple objects in the image even if objects overlap.

And indeed it does! A generated MultiMNIST training and test dataset is created by overlaying one digit on top of another with a shift of up to 4 pixels on each axis. CapsNet is correctly able to segment the image into the two original digits. The segmentation happens at a higher level than individual pixels, and so it can deal correctly with cases where a pixel is in both digits, while still accounting for all pixels.

Research on capsules is now at a similar stage to research on recurrent neural networks for speech recognition at the beginning of this century. There are fundamental representational reasons for believing it is a better approach but it probably requires a lot more insights before it can out-perform a highly developed technology. The fact that a simple capsules system already gives unparalleled performance at segmenting overlapping digits is an early indication that capsules are a direction worth exploring.

Xifeng Guo has a Keras implementation of CapsNet on GitHub which is helpful to study.

]]>As an antidote to callback-hell, ECMAScript 6 introduces Promises. Promises represent the value of an asynchronous computation, and the functions `resolve`

and `reject`

are used to settle the promise. Promises can be chained using `then`

.

However, the semantics of JavaScript promises are quite complex, and since the feature is implemented by way of ordinary function calls, there are no static checks to ensure correct usage. As a result, programmers often make mistakes in promise-based code that leads to pernicious errors, as is evident from many reported issues on forums such as StackOverflow.

This paper introduces the notion of a *promise graph*, which can be used to visualise the flow in a promises-based program and help to detect a range of bugs. In order to have formal basis for such reasoning and program analysis the paper also introduces , a calculus for the behaviour of promises, which is expressed as an extension to .

A promise is an object that represents the result of an asynchronous operation.

A promise can be in one of three different states. Initially a promise is *Pending*, indicating that the operation has not yet completed and the promise holds no value. A *Fulfilled* promise is one in which the operation has succeeded and the promise holds a result value. A *Rejected* promise is one in which the operation has failed and the promise holds an error value. A promise that is either fulfilled or rejected is said to be *settled*.

A promise is constructed by passing a single callback function with two parameters: a resolve function that can be called to fulfil the promise, and a reject function that can be called to reject the promise. These parameters are named `resolve`

and `reject`

by convention.

var p = new Promise((resolve, reject) => { resolve(42); // immediate resolution });

The `then`

function can be used to register resolve reactions and reject reactions with promises, and creates in turn a *dependent promise* with the result computed by the resolve (reject) reaction.

Section 3 of the paper defines a formal calculus for all of this. The runtime state has five components:

- A heap that maps addresses to values
- A promise state map that maps addresses to promise values (Pending, Fulfilled(value), or Rejected(value)).
- A map that maps addresses to a list of fulfil reactions
- A map that maps addresses to a list of reject reactions
- And a queue that holds scheduled reactions – promised that have been settled and their reactions are awaiting asynchronous execution by the event loop.

A *promisify* expression in this language turns an object into a promise. Then there a bunch of reduction rules such as this one:

They can look a bit intimidating on first glance, but that’s mostly down to the syntax. For example, all this one says is that:

- given is an address in the heap, that is not currently mapped to promise values, and an expression
`promisify(a)`

- then the system reduces to a new state where the expression value becomes
`undefined`

(the return value of a promisify expression), - and the promise state map is updated with a new entry for this address that holds the value Pending, the fulfil reaction and reject reaction maps get an empty entry for the address, and the rest of the state of the system is unaltered

The fun part of the paper for me though, is what happens when you create *promise graphs* off of the back of this formalism. So let’s get straight into that…

The promise graph captures control- and dataflow in a promise-based program to represent the flow of values through promises, the execution of fulfil and reject reactions, and the dependencies between reactions and promises.

Promise graphs contain three different types of nodes:

- Value nodes represent value allocation sites in the program
- Promise nodes represent promise allocation sites in the program,
- Function nodes represent every lambda or named function in the program.

There are four different types of edges between nodes as well:

- A settlement edge (which can be labelled either ‘resolve’ or ‘reject’) from a value node to a promise node indicates that the value is used to resolve or reject the promise .
- A registration edge (with label ‘resolve’ or ‘reject’) from a promise node to a function node indicates that the function is registered as a fulfil or reject reaction on the promise .
- A link edge from a promise to a dependent promise represents the dependency that when the parent promise is resolved or rejected, the dependent promise will be resolved or rejected with the same value.
- A return edge from a function node to a value node represents the function returning the value allocated at .

In the examples that follow, the subscript numbers (e.g., indicate the line number in the program on which the corresponding object is declared. At this point on first read I admit to thinking it would be easier just to read the code, but bear with me, it will all make more sense very soon.

Here’s a small example of a program and its corresponding promise graph:

You can read the graph forward: the value on line 3 (42) is used to resolve the promise declared on line 1 (p1), and flows into the fulfil reaction function declared on line 2. Inside that function, the newly computed value ($v_2$) is then used to resolve the second promise . You can also read it backwards to trace the resolution of to understand how and why it was resolved.

Here’s an example with branching:

Here’s some code from a StackOverflow question (Q42408234) and it’s corresponding promise graph. The poster had spend a full two hours trying to figure out why the promise from mongoose.find (on line 15) was returning `undefined`

.

Immediately the promise graph shows us two disconnected chains. The value returned on line 25 is used to resolve the promise on line 21, but then doesn’t go anywhere else. The function declared on lines 16-31 does not explicity return a value, so it will implicitly return undefined. The fix is to *explicitly* return the the value of bcrypt.compare as computed on line 20.

Armed with the ability to create promise graphs, we can now detected a variety of promise-related bugs.

- A
*Dead Promise*occurs when an object enters the Pending state and never transitions to either a fulfil or reject state. A promise node with no resolve, reject, or link edges is dead.

- A
*Missing resolve or reject reaction*occurs when a promise is resolved with a non-`undefined`

value, and the promise lacks a fulfil (or reject) reaction. This is a promise node with no outgoing edges.

- A
*Missing exceptional reject reaction*occurs when a promise is implicitly rejected by throwing an exception. These are promise nodes with a reject edge, but to reject registration edge.

- A
*Missing return*occurs when a fulfil or reject reaction unintentionally returns`undefined`

and this value is used to resolve or reject a dependent promise.

- A
*Double resolve or reject*occurs when a promise enters the fulfilled or rejected state, and later the promise is again resolved (or rejected). This can be detected in the graph by multiple resolve (or reject) edges leading to the same promise.

- An
*Unnecessary promise*is a promise whose sole purpose is to act as an intermediary between two other promises.

- A
*Broken promise chain*occurs when a programmer inadvertently creates a fork in a promise chain.

The authors looked at the most recent 600 StackOverflow questions tagged with both `promise`

and `javascript`

(out of over 5,000 total). They focused on those with code fragments of at most 100 lines, a genuine promises-related issue, and an answer. This boiled things down to quite a small list, show below:

The fourth column in the table above shows the additional information — above and beyond the promise graph — that was needed to pinpoint the bug (if any).

**HB**indicates that happens-before information is needed – i.e., the knowledge that one statement in the program always executes before another.**LU**indicates that loop unrolling is necessary to debug the root cause (i.e., what happens in one iteration affects a promise created in a subsequent one).**EV**indicates that event reasoning is needed to debug the root cause – e.g., the knowledge that an event handler for a button can fire multiple times.

Here’s an example analysis:

Q41268953. The programmer creates a promise, but never resolves or rejects it. As a result, the reactions associated with the promise are never executed. The programmer gets lost in the details and mistakenly believes the bug to be elsewhere in the program. He or she then proceeds to add additional logging and reject reactions in all the wrong places and these are never executed. In this scenario, the promise graph immediately identifies that the initial promise in the promise chain is never resolved or rejected, and hence that no reaction on chain is ever executed.

There are plenty more examples in section 5.3, and in appendix A.

]]>Today’s edition of The Morning Paper comes with a free tongue-twister; ‘type test scripts for TypeScript testing!’

One of the things that people really like about TypeScript is the DefinitelyTyped repository of type declarations for common (otherwise untyped) JavaScript libraries. There are over 3000 such declaration files now available. This is a great productivity boost, but it’s not perfect:

These declaration files are… written and maintained manually, which leads to many errors. The TypeScript type checker blindly trusts the declaration files, without any static or dynamic checking of the library code.

So it would be great if there was a way of running automated tests to verify that the TypeScript declaration files actually match the implementations in the JavaScript libraries they represent. The authors show that their runtime type checking approach can find more errors with fewer false positive than prior static analysis tools. But it requires a test suite. Where can we get a test suite from?

Our method is based on the idea of feedback-directed random testing as pioneered by the Randoop tool by Pacheco et al. 2007. With Randoop, a (Java) library is testing automatically by using the methods of the library itself to produce values, which are then fed back as parameters to other methods in the library.

The TSTest tool (http://www.brics.dk/tstools) takes as input a JavaScript library and a corresponding TypeScript declaration file. It then dynamically creates a JavaScript program to exercise the library, called a *type test script*. These scripts exercise the library code by mimicking the behaviour of potential applications, and perform runtime type checking to ensure that values match those specified by the type declarations.

TSTest is set loose on 54 real world libraries, and finds type mismatches in 49 of them.

Here’s a simple example of a type mismatch between the TypeScript declaration for PathJS, and the actual implementation.

First, the TypeScript declaration:

declare var Path: { root(path: string): void; routes: { root: IPathRoute, } }; interface IPathRoute { run(): void; }

And an excerpt from the implementation:

var Path= { root: function(path) { Path.routes.root = path; }, routes: { root: null } };

Looking at the type declaration, we see that the parameter `path`

of the `root`

method is declared to be a string. But in the implementation the value of this parameter is assigned to `Path.routes.root`

– which is declared to be of type `IPathRoute`

. A manual examination of the implementation shows that the value really should be a string.

This error is not found by the existing TypeScript declaration file checker TScheck, since it is not able to relate the side effects of a method with a variable. Type systems such as TypeScript or Flow also cannot find the error, because the types only appear in the declaration file, not as annotations in the library implementation.

When the type test script generated by TStest is run, it will generated output like this, highlighting the error:

```
*** Type error
property access: Path.routes.root
expected: object
observed: string
```

See §2 in the paper for an additional and much more involved example using generic types and higher-order functions.

A type test script has a main loop which repeatedly runs random tests against a library until a timeout is reached. Each test calls a library function, and the value returned is checked to see that its type matches the type declaration. Arguments to library functions are either randomly generated, *or taken from those produced by the library in previous actions where those values match the function parameter type declarations*. Test script applications can also interact with library object properties, which are treated like special kinds of function calls.

The strategy for choosing which tests to perform and which values to generate greatly affects the quality of the testing. For example, aggressively injection random (but type correct) values may break internal library invariants and thereby cause false positives, while having too little variety in the random value construction may lead to poor testing coverage and false negatives.

Here’s a snippet from the declaration file for the Async library:

declare module async { function memoize( fn: Function, hasher?: Function ): Function; function unmemoize(fn: Function): Function; }

And the type test script generated for it:

…adapting the ideas of feedback-directed testing to TypeScript involves many interesting design choices and requires novel solutions…

This challenges include generating values that match a given structural type, and type checking return values with structural types. Type mismatch errors will be reported based on deep checking (checking all reachable objects), but only shallow type checking is required to pass in order for a value to be stored for feedback testing (normally a type failure stops any further testing).

When functions are returned, their types cannot be checked immediately, but only when the functions are invoked.

The blame calculus (Wadler and Findler 2009) provides a powerful foundation for tracking function contracts (e.g. types) dynamically and deciding whether to blame the library or the application code if violations are detected.

Since the application code in this case is known to be well-formed (we generated it), any problems can be blamed on the library.

Recursion with generic types is broken by treating any type parameters involved in recursion as `any`

.

TStest is *not sound*: when an error is reported by TSTest that does not necessarily mean there is a mismatch in practice. An example that can cause this is the breaking of recursion in generic types using `any`

. “*This is mostly a theoretical issue, we have never encountered false positives in practice.*”

TStest is also *not conditionally complete*: there are some mismatches that it can never detect.

Neither of these two things mean that it isn’t useful though, which is what we’ll look at next.

Evaluation of TStest was conducted using the following libraries;

With time budgets of 10 seconds, 1 minute, and 5 minutes, and multiple runs, here’s how many mismatches TStest reports across all 54 libraries:

Mismatches were found in 49 of the 54 benchmarks, independently of the timeout and the number of repeated runs… The numbers in table 1 are quite large, and there are likely not around 6000 unique errors among the 54 libraries tested. A lot of the detected mismatches are different manifestations of the same root cause. However, our manual study shows that some declaration files do contain dozens of actual errors.

124 of the reported mismatches were randomly sampled for further evaluation (spanning 41 libraries).

- 63 of the 124 are mismatches that a programmer would almost certainly want to fix
- A further 47 were mismatches that a programmer would want to fix, but are
*only valid when TypeScript’s non-nullable types feature is enabled*. - That leaves 14 of the 124 that turned out to be benign. Some of these are due to limitations in the TypeScript type system, some are due to TStest constructing objects with private behaviour that are really only intended to be constructed by the library itself, and the majority (7/14) are
*intentional mismatches*by the developer:

For various reasons, declaration file authors sometimes intentionally write incorrect declarations even when correct alternatives are easily expressible, as also observed in previous work. A typical reason for such intentional mismatches is to document internal classes.

The authors believe it would be possible to adapt TStest to perform testing of Flow’s library definitions as well.

]]>In this paper we present the design and implementation of Flow, a fast and precise type checker for JavaScript that is used by thousands of developers on millions of lines of code at Facebook every day.

In a pretty dense 30 pages, ‘Fast and precise type checking for JavaScript’ takes you through exactly how Facebook’s Flow works (although even then, some details are deferred to the extended edition!). I can’t read a paper packed with concise judgements and syntax such as this one now without thinking of Guy Steele’s wonderful recent talk “It’s time for a new old language.” It makes me feel a little better when struggling to get to grips with what the authors are really saying! Rather than unpacking the formal definitions here (which feels like it might take a small book and many hours!!), I’m going to focus in this write-up on giving you a high-level feel for what Flow does under the covers, and how it does it.

Evolving and growing a JavaScript codebase is notoriously challenging. Developers spend a lot of time debugging silly mistakes — like mistyped property names, out-of-order arguments, references to missing values, checks that never fail due to implicit conversions, and so on — and worse, unraveling assumption and guarantees in code written by others.

The main internal repository at Facebook contains around 13M LoC of JavaScript, spanning about 122K files. That’s a lot of JavaScript! All of that code is covered by Flow, a static type checker for JavaScript that Facebook have been using for the past three years. Flow therefore has to be practical at scale and usable on real-world projects:

- The type checker must be able to cover large parts of a codebase without requiring too many changes in the code.
- The type checker must give fast responses even on a large codebase. “
*Developers do not want any noticeable ‘compile-time’ latency in their normal workflow, because that would defeat the whole purpose of using JavaScript.*”

Using Flow, developers (or developer tools) get precise answers to code intelligence queries, and Flow can catch a large number of common bugs with few false positives.

To meet Flow’s goals, the designers made three key decisions:

- Common JavaScript idioms such as
`x = x || 0`

are precisely modelled. To be able to handle cases like this requires support for type*refinements*(more on that soon). - Reflection and other legacy pattern that appear in a relatively small fraction of codebases are
*not*explicitly focused on. Flow analyses source code, rather than first translating it to e.g. ES5, or even ES3. - The constraint-based analysis is modularised to support parallel computation so that queries can be answered in well under a second even on codebases with millions of lines of code.

The related work section of the paper (§11) provides a nice comparison of the differences in design choices between Flow and TypeScript from the perspective of the Flow team:

Unlike Flow, [TypeScript] focuses only on finding “likely errors” without caring about soundness. Type inference in TypeScript is mostly local and in some cases contextual; it doesn’t perform a global type inference like Flow, so in general more annotations are needed… Furthermore, even with fully annotated programs, TypeScript misses type errors because of unsound typing rules… Unsoundness is a deliberate choice in TypeScript and Dart, motivated by the desire to balance convenience with bug-finding. But we have enough anecdotal evidence from developers at Facebook that focusing on soundness is not only useful but also desirable, and does not necessarily imply inconvenience.

(We recently looked at ‘To Type or Not to Type’ where the authors found that Flow and TypeScript had roughly equivalent power in their ability to detect bugs.

In Flow, the set of runtime values that a variable may contain is described by its type (nothing new there!). The secret sauce of Flow is using runtime tests in the code to *refine* types.

For example, consider the following code:

function pipe(x, f) { if (f != null) { f(x); } }

Seeing the test `f != null`

, Flow refines the type of `f`

to filter out `null`

, and therefore knows that the value `null`

cannot reach the call.

Refinements are useful with algebraic data types too. In this idiom records of different shapes have a common property that specifies the ‘constructor,’ and other properties dependent on the constructor value. For example,

var nil = { kind : "nil" }; var cons = (head, tail) => { return { kind: "cons", head, tail }; }

Now when Flow sees a function such as this:

function sum(list) { if (list.kind === "cons") { return list.head + sum(list.tail); } return 0; }

It refines the type of list following the test `list.kind === "cons"`

so that it knows the only objects reaching the `head`

and `tail`

property accesses on the following line are guaranteed to have those properties.

When Flow sees an idiom such as `x = x || nil`

(the var `nil`

from our previous code snippet, not to be confused with `null`

here!), it models the assignment by merging the refined type of `x`

with the type of `nil`

and updating the type of `x`

with it.

Refinements can also be *invalidated* by assignments (for example, `x = null;`

).

Section 2 of the paper provides a formal definition of inference supporting refinement in FlowCore, a minimal subset of JavaScript including functions, mutable variables, primitive values, and records. The core components of the FlowCore constraint system are types, effects, and constraints. Effects track variable updates – each language term is associated with an effect, which roughly describes the set of variables that are (re)assigned with the term.

When Flow sees the line of code `var nil = { kind: "nil" };`

it records that `nil`

is of the form where is a type variable, and also the constraint .

For the function `cons`

,

var cons = (head, tail) => { return { kind: "cons", head, tail }; }

We get a type for `cons`

of and additional constraints . And so the process continues, constructing a *flow network*.

Thinking of our system as a dataflow analysis framework, constraint generation amounts to setting up a flow network. The next step is to allow the system to stabilize under a set of appropriate flow functions.

For every new constraint that gets generated, all eligible constraint propagation rule are applied until a fixpoint is reached. Once the fixpoint is reached, Flow can either discover inconsistencies, or prove the absence thereof. Inconsistencies correspond to potential bugs in the use of various operations.

Section 4 in the paper briefly introduces a runtime semantics for FlowCore, and section 5 proves type safety via the introduction of a declarative type system closely matching the type inference system described above.

Type annotations follow a similar grammar as types except that there are no type variables, types can appear anywhere type variables could appear, and there are no effects. We consider a type annotation to be just another kind of type use, that expects some type of values. In other words, like everything else we can formulate type checking with flow constraints.

Flow analyses a code base module-by-module (file-by-file). Each file is analysed separately once all files it depends on have been analysed. This strategy supports incremental and parallel analysis of large code bases.

The key idea is to demand a “signature” for every module. We ensure that types inferred for the expressions a module exports do not contain type variables — wherever they do, we demand type annotations… Requiring annotations for module interfaces is much better than requiring per-function annotations… Independently, having a signature for every module turns out to be a desirable choice for software engineering. It is considered good practice for documentation (files that import the module can simply look up its signature instead of its implementation), as well as error localization (blames for errors do not cross module boundaries).

In the 13M lines of JavaScript code in the main internal Facebook repository, about 29% of all locations where annotations could *potentially* be used actually do have annotations. The value of Flow’s type refinement mechanism was shown by turning it off – this led to more than 145K spurious errors being reported!

(*With thanks to Prof. Richard Jones at Kent University who first pointed this paper out to me.)*

Yesterday we saw the recommendations of Georges et al. for determining when a (Java) virtual machine has reached a steady state and benchmarks can be taken. Kalibera and Jones later provided a more accurate manual process. In ‘Virtual machine warmup blows hot and cold,’ Barrett et al. provide a fully-automated approach to determining when a steady state has been reached, and *also whether or not that steady state represents peak performance*. Their investigation applies to VMs across a range of languages: Java, JavaScript, Python, Lua, PHP, and Ruby.

Our results suggest that much real-world VM benchmarking, which nearly all relies on assuming that benchmarks do reach a steady state of peak performance, is likely to be partly or wholly misleading. Since microbenchmarks similar to those in this paper are often used in isolation to gauge the efficacy of VM optimisations, it is also likely that ineffective, or deleterious, optimisations may have been incorrectly judged as improving performance and included in VMs.

If you’re simply allowing a VM to run a small number (e.g., 10) of iterations and then expecting it to be warmed-up and in a steady state by then, you’re definitely doing it wrong!

To gather their data, the authors use a very carefully controlled experiment design. Because of the level of detail and isolation it took around 3 person years to design and implement the experiments. The repeatable experiment artefacts are available online at https://archive.org/download/softdev_warmup_experiment_artefacts/v0.8/.

The basis for benchmarking are the *binary trees*, *spectralnorm*, *n-body*, *fasta*, and *fannkuch redux* microbenchmarks from the Computer Languages Benchmarks Game (CLBG).These small benchmarks are widely used by VM authors as optimisation targets. Versions of these for C, Java, JavaScript, Python, Lua, and Ruby are taken from Bolz and Tratt 2015.

On each process run, 2000 iterations of the microbenchmark are executed. There are 30 process executions overall, so we have a total of 30 x 2000 iterations. The authors go to great lengths to eliminate any other factors which may contribute to variance in the benchmarking results. For example, the machines are clean rebooted to bring them into a known state before each process run, networking is disabled, daemons are disabled, there is no file I/O, and so on. They even ensure that the machines run benchmarks within a safe temperature range (to avoid the effects of temperature-based CPU limiters in CPUs). Full details are in section 3, and section 7 on ‘threats to validity’ outlines even more steps that were taken in an attempt to obtain results that are as accurate and reliable as possible. Suffice to say, you can easily start to see where some of those 3-person years went!

The main hypothesis under investigation is:

**H1**Small, deterministic programs reach a steady state of peak performance.

And as a secondary hypothesis:

**H2**Moderately different hardware and operating systems have little effect on warmup

If the expected warm-up and followed by steady-state peak performance pattern is not observed, then the third hypothesis is :

**H3**Non-warmup process executions are largely due to JIT compilation or GC events

Benchmarks are run on GCC 4.9.3, Graal 0.18, HHVM 3.15.3 (PHP), JRuby + Truffle 9.1.2.0, HotSpot 8u112b15 (Java), LuaJIT 2.0.4, PyPy 5.6.0, and V8 5.4.500.43 (JavaScript). Three different benchmarking machines were used in order to test H2. Linux (1240v5) and Linux (4790) have the some OS (with the same packages and updates etc.) but different hardware. Linux (4790) and OpenBSD (4790) have the same hardware but different operating systems.

Each in-process run results in time series data of length 2000. A technique called *statistical changepoint analysis* is used to analyse the data and classify the results. Prior to this analysis the data is pre-processed to remove outliers, defined as any point *after the first 200* that is outside the median ±3x (90%ile – 10%ile). Overall, 0.3% of all data points are classified as outliers under this definition.

We use changepoint analysis to determine if and when warmup has occurred. Formally, a

changepointis a point in time where the statistical properties of prior data are different to the statistical properties of subsequent data; the data between two changepoints is achangepoint segment. Changepoint analysis is a computationally challenging problem, requiring consideration of large numbers of possible changepoints.

The authors use the PELT algorithm which reduces the complexity to O(n). Changepoint detection is based on both the mean and variance of in-process iterations. Changepoint segments that have means within a small threshold (0.001 seconds) are considered equivalent. In addition a segment will be considered equivalent to the final segment if its mean is within variance($s_f$) seconds of the final segment mean. (Treating the variance of the final segment, , as if it was a measure of seconds, not seconds squared). This variance-based threshold is used to account for the cumulative effect of external events during a run.

If hypothesis H1 holds, then we will see warm-up segment(s) followed by one or more steady-state segments, with the final segment being the fastest. A benchmark is said to reach a steady state if all segments which cover the last 500 in-process iterations are considered equivalent to the final segment.

- When the last 500 in-process iterations are
*not*equivalent to the final segment then we say that the benchmark had*no steady state*. - If a steady state is reached and
*all*segments are equivalent then the benchmark is classified as*flat*. - If a steady state is reached and at least one segment is faster than the final segment the the benchmark is classified as
*slowdown*. - If a steady state is reached and it is not flat or a slowdown, then we have the classic
*warmup*pattern.

Flat and warmup benchmarks are considered ‘good,’ while slowdown and no steady state benchmarks are ‘bad.’

That deals with a single process run of 2000 in-process iterations. If all process executions for a given (VM, benchmark) pair have the same classification, then the pair is classified the same way and said to be *consistent*, otherwise the pair is classified as *inconsistent*.

In the charts and diagrams in the paper, you’ll see these various categories represents by symbols like this:

Our results consist of data for 3660 process executions and 7,320,000 in-process iterations. Table 1 (below) summarises the (VM, benchmark) pairs and process executions for each benchmarking machine.

Note that for (VM, benchmark) pairs, at best 37% of pairs show flat or warmup patterns, and another 6.5% are good inconsistent. The biggest proportion by far is ‘bad inconsistent.’

This latter figure clearly show a widespread lack of predictability: in almost half of cases, the same benchmark on the same VM on the same machine has more than one performance characteristic. It is tempting to pick one of these performance characteristics – VM benchmarking sometimes reports the fastest process execution, for example – but it is important to note that all of these performance characteristics are valid and may be experienced by real-world users.

Here’s a breakdown just for one machine, showing that only *n-body* and *spectralnorm* come close to ‘good’ warmup behaviour on all machines.

(Enlarge)

VMs seem to mostly either reach a steady state quickly (often in 10 or fewer in-process iterations) or take hundreds of in-process iterations. The latter are troubling because previous benchmarking methodologies will not have run the benchmarks long enough to see the steady state emerge.

Since there are so many cases that do not fit the expected warmup pattern, the authors investigated these to see if H3 holds: that these cases are mostly due to JIT compilation or GC.

The relatively few results we have with GC and JIT compilation events, and the lack of a clear message from them, means that we feel unable to validate or invalidate Hypothesis H3. Whilst some non-warmups are plausibly explained by GC or JIT compilation events, many are not, at least on HotSpot and PyPy. When there is no clear correlation, we have very little idea of a likely cause of the unexpected behaviour.

The results above undermine the VM benchmarking orthodoxy of benchmarks quickly and consistently reaching a steady state after a fixed number of iterations.

We believe that, in all practical cases, this means that one must use an automated approach to analysing each process execution individually. The open-source changepoint analysis approach presented in this paper is one such option.

- Benchmark results should present both warm-up times and steady steady performance. “There are cases in our results where, for a given benchmark, two or more VMs have steady state performance within 2x of each other, but warmup differs by 100-1000x.”
- In-process iterations should be run for around 0.5s, with a minimum acceptable time of 0.1s.
- It is hard to know exactly how many in-process iterations to run, but around 500 can be used most of the time, while occasionally using larger numbers (e.g. 1,500) to see if longer-term stability has been affected.
- Good results should be obtained with 10 process executions, occasionally running higher numbers to identify infrequent performance issues.

This paper won the 10-year most influential paper award at OOPSLA this year. Many of the papers we look at on this blog include some kind of performance evaluation. As Georges et al., show, without good experimental design and statistical rigour it can be hard to draw any firm conclusions, and worse you may reach misleading or incorrect conclusions! The paper is set in the context of Java performance evaluation, but the lessons apply much more broadly.

Benchmarking is at the heart of experimental computer science research and development… As such, it is absolutely crucial to have a rigorous benchmarking methodology. A non-rigorous methodology may skew the overall picture, and may even lead to incorrect conclusions. And this may drive research and development in a non-productive direction, or may lead to a non-optimal product brought to market.

A good benchmark needs a well chosen and well motivated experimental design. In addition, it needs a sound performance evaluation methodology.

… a performance evaluation methodology needs to adequately deal with the non-determinism in the experimental setup… Prevalent data analysis approaches for dealing with non-determinism are not statistically rigorous enough.

In the context of Java sources of non-determinism can include JIT compilation, thread scheduling, and garbage collection for example. For many benchmarks run today on cloud platforms, non-determinism in the underlying cloud platform can also be a significant factor.

Common at the time of publication (it would be interesting to do a similar assessment of more recent papers) was a method whereby a number of performance runs – e.g. 30 – would be done, and the best performance number (smallest execution time) reported. This was in accordance with SPECjvm98 reporting rules for example. Here’s an example of doing this for five different garbage collectors.

CopyMS and GenCopy seem to perform about the same, and SemiSpace clearly outperforms GenCopy.

Here are the same experiment results, but reported using a statistically rigorous method reporting 95% confidence intervals.

Now we see that GenCopy significantly outperforms CopyMS, and that SemiSpace and GenCopy have overlapping confidence intervals – the difference between them could be simply due to random performance variations in the system under measurement.

After surveying 50 Java performance papers the authors conclude that there is little consensus on what methodology to use. Table 1 below summarises some of the methodologies used in those papers:

Suppose you use a non-rigorous methodology and report a single number such as best, average, worst etc., out of a set of runs. In a pairwise comparison, you might say there was a meaningful performance difference if the delta between the two systems was greater than some threshold . Alternatively, using a statistically rigorous methodology and reporting confidence intervals, it may be that you see:

- overlapping intervals
- non-overlapping intervals, in the same order as the non-rigorous methodology
- non-overlapping intervals, in a different order to the non-rigorous methodology

This leads to six different cases – in only one of which can you truly rely on the results from the non-rigorous approach:

The authors run a series of tests and find that *all prevalent methods can be misleading in a substantial fraction of comparisons between alternatives* – up to 16%. Incorrect conclusions even occur in up to 3% of comparisons. (And if you really must report a single number, mean and median do better than best, second best, or worst).

There are many more examples in section 6 of the paper.

We advocate adding statistical rigor to performance evaluation studies of managed runtime system, and in particular Java systems. The motivation for statistically rigorous data analysis is that statistics, and in particular confidence intervals, enable one to determine whether differences observed in measurements are due to random fluctuations in the measurements or due to actual differences in the alternatives compared against each other.

Section 3 in the paper is my favourite part, and it essentially consists of a mini-stats tutorial for people doing benchmarks.

If we could get an exact repeatable number out of every performance run, life would be much more straightforward. Unfortunately we can’t do that due to non-determinism (‘random errors’) in the process. So we need to control for perturbing events unrelated to what the experiment is trying to measure. As a first step, the authors recommend discarding extreme outliers. With that done, we want to compute a *confidence interval*.

In each experiment, a number of samples is taken from an underlying population. A confidence interval for the mean derived from these samples then quantifies the range of values that have a given probability of including the actual population mean.

A *confidence interval* is defined such that the probability of being between and equals , where is the *significance level* and is the *confidence level*.

A 90% confidence interval means that there is a 90% probability that the actual distribution mean of the underlying population is within the confidence interval. For the same data, if we want to be more confident that the true mean lies within the interval, say a 95% confidence interval, then it follows that we would need to make the interval *wider*.

Ideally we would take at least 30 samples such that we can build upon the central limit theorem. With a target significance level chosen in advance, we can then determine and so that the probability of the true mean being in the interval equals . It looks like this:

Where is the standard deviation of the sample, the mean, and is obtained from a pre-computed table.

A basic assumption made in the above derivation is that the sample variance provides a good estimate of the actual variance … This is generally the case for experiments with a large number of samples, e.g., . However, for a relatively small number of samples, which is typically assumed to mean , the sample variance can be significantly different from the actual variance .

In this case, we can use *Student’s t-distribution* instead and compute the interval as:

The value is typically obtained from a pre-computed table. As the number of measurements increases the Student t-distribution approach the Gaussian distribution.

Thus far we know how to compute confidence intervals for the mean of a single system. If we compare two system and their confidence intervals overlap, then *we cannot conclude that the differences seen in the mean values are not due to random fluctuations in the measurements*. If the confidence intervals do not overlap, we conclude that *there is no evidence to suggest that there is not a statistically significant difference*.

The paper shows the formula for computing a confidence interval for the difference of two means (see section 3.3). If this interval includes zero, we can conclude, at the confidence level chosen, that there is no statistically significant difference between the two alternatives.

If we want to compare more than two alternatives, then we can use a technique called *Analysis of Variance* (ANOVA).

ANOVA separates the total variation in a set of measurements into a component due to random fluctuations in the measurements and a component due to the actual differences among the alternatives… If the variation between the alternatives is larger than the variation within each alternative, then it can be concluded that there is a statistically significant difference between the alternatives.

The ANOVO test doesn’t tell which of the alternatives the statistically significant difference is between, if there is one! The *Tukey HSD* (Honestly Significantly Different) test can be used for this.

With ANOVA we can vary one *input variable* within an experiment. *Multi-factor ANOVA* enables you to study the effect of multiple input variables and all their interactions. *Multivariate ANOVA* (MANOVA) enables you to draw conclusions across multiple benchmarks.

Using the more complex analyses, such as multi-factor ANOVA and MANOVA, raises two concerns. First, their output is often non-intuitive and in many cases hard to understand without deep background in statistics. Second, as mentioned before, doing all the measurements required as input to the analyses can be very time-consuming up to the point where it becomes intractable.

Section 4 of the paper therefore introduces a practical yet still statistically rigorous set of recommendations for Java performance evaluation.

To measure start-up performance:

- Measure the execution time of multiple VM invocations, each running a single benchmark iteration.
- Discard the first VM invocation and retain only the subsequent measurements. This ensures libraries are leaded when doing the measurements.
- Compute the confidence interval for a given confidence level. If there are more than 30 measurements, use the standard normal -statistic, otherwise use the Student -statistic.

To measure steady-state performance:

- Consider VM invocations, each invocation running at most benchmark iterations. Suppose we want to retain measurements per invocation.
- For each VM invocation determine the iteration where steady-state performance is reached, i.e., once the coefficient of variation (std deviation divided by the mean) of the iterations ( to ) falls below a preset threshold, say 0.01 or 0.02.
- For each VM invocation, compute the mean of the benchmark iterations under steady-state: .
- Compute the confidence interval for a given confidence level across the computed means from the different VM invocations. The overall mean , and the confidence interval is computed over the measurements.

We’ll be talking more about the notion of ‘steady-state’ tomorrow – especially with micro-benchmarks.

For more on critically reading evaluation sections in papers in general, see “The truth, the whole truth, and nothing but the truth: a pragmatic guide to assessing empirical evaluations.”

]]>