Mastering the game of Go without human knowledge Silver et al., Nature 2017

We already knew that AlphaGo could beat the best human players in the world: AlphaGo Fan defeated the European champion Fan Hui in October 2015 (‘Mastering the game of Go with deep neural networks and tree search’), and AlphaGo Lee used a similar approach to defeat Lee Sedol, winner of 18 international titles, in March 2016. So what’s really surprising here is the simplicity of AlphaGo Zero (the subject of this paper). AlphaGoZero achieves superhuman performance, and won 100-0 in a match against the previous best AlphaGo. And it does it without seeing a single human game, or being given any heuristics for gameplay. All AlphaGo Zero is given are the rules of the game, and then it learns by playing matches against itself. The blank slate, tabula rasa. And it learns fast!

Surprisingly, AlphaGo Zero outperformed AlphaGo Lee after just 36 hours; for comparison, AlphaGo Lee was trained over several months. After 72 hours, we evaluated AlphaGo Zero against the exact version of AlphaGo Lee that defeated Lee Sedol, under the 2 hour time controls and match conditions as were used in the man-machine match in Seoul. AlphaGo Zero used a single machine with 4 Tensor Processing Units (TPUs), while AlphaGo Lee was distributed over many machines and used 48 TPUs. AlphaGo Zero defeated AlphaGo Lee by 100 games to 0.

That comes out as superior performance in roughly 1/30th of the time, using only 1/12th of the computing power.

It’s kind of a bittersweet feeling to know that we’ve made a breakthrough of this magnitude, and also that no matter what we do, as humans we’re never going to be able to reach the same level of Go mastery as AlphaGo.

AlphaGo Zero discovered a remarkable level of Go knowledge during its self-play training process. This included fundamental elements of human Go knowledge, and also non-standard strategies beyond the scope of traditional Go knowledge.

In the following figure, we can see a timeline of AlphaGo Zero discovering joseki (corner sequences) all by itself. The top row (a) shows joseki used by human players, and when on the timeline they were discovered. The second row (b) shows the joseki AlphaGo Zero favoured at different stages of self-play training.

At 10 hours a weak corner move was preferred. At 47 hours the 3-3 invasion was most frequently played. This joseki is also common in human professional play; however AlphaGo Zero later discovered and preferred a new variation.

Here’s an example self-play game after just three hours of training:

AlphaGo Zero focuses greedily on capturing stones, much like a human beginner. A game played after 19 hours of training exhibits the fundamentals of life-and-death, influence and territory:

Finally, here is a game played after 70 hours of training: “the game is beautifully balanced, involving multiple battles and a complicated ko fight, eventually resolving into a half-point win for white.”

On an Elo scale, AlphaGo Fan achieves a rating of 3,144, and AlphaGo Lee achieved 3,739. A fully trained AlphaGo Zero (40 days of training) achieved a rating of 5,185. To put that in context, a 200-point gap in Elo rating corresponds to a 75% probability of winning.

Our results comprehensively demonstrate that a pure reinforcement learning approach is fully feasible, even in the most challenging of domains: it is possible to train to superhuman level, without human examples or guidance, given no knowledge of the domain beyond basic rules.

### How does AlphaGo Zero work?

There are four key differences between AlphaGo Zero and AlphaGo Fan/Lee, and we’ve already seen two of them:

1. It is trained solely by self-play reinforcement learning, starting from random play.
2. It uses only the black and white stones from the board as input features.
3. It uses a single neural network rather than separate policy and value networks.
4. It uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte-Carlo rollouts.

To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search inside the training loop, resulting in rapid improvement and precise and stable learning.

From a human perspective therefore we might say that AlphaGo Zero achieves mastery of the game of Go through many hours of deliberate practice.

In AlphaGo Fan there is a policy network that takes as input a representation of the board and outputs a probability distribution over legal moves, and a separate value network that takes as input a representation of the board and outputs a scalar value predicting the expected outcome of the game if play continued from here.

AlphaGo Zero combines both of these roles into a single deep neural network $f_\theta$ that outputs both move probabilities and an outcome prediction value: $(\mathbf{p}, v) = f_\theta(s)$ . The input to the network, $s$, consists of 17 binary feature planes: $[X_t, Y_t, X_{t-1}, Y_{t-1}, ... , X_{t-7}, Y_{t-7}, C]$. Each $X$ and $Y$ input is a 19×19 matrix where $X_t^i = 1$ ($Y_i^t = 1$ ) if intersection $i$ contains a stone of the player’s colour at time step $t$. The final input feature $C$ is 1 if it is black’s turn to play, and 0 for white.

History features $X_t, Y_t$ are necessary because Go is not fully observable solely from the current stones, as repetitions are forbidden; similarly, the colour feature C is necessary because the komi is not observable.

The input is connected to a ‘tower’ comprised of one convolutional block and then nineteen residual blocks. On top of the tower are two ‘heads’: a value head and a policy head. End to end it looks like this:

In each position $s$, a Monte-Carlo Tree Search (MCTS) is executed, which outputs probabilities $\pi$ for each move. These probabilities usually select much stronger moves than the raw move probabilities $\mathbf{p}$ of the policy head on its own.

MCTS may be viewed as a powerful policy improvement operator. Self-play with search — using the improved MCTS-based policy to select each move, then using the game winner $z$ as a sample of the value — may be viewed as a powerful policy evaluation operator. The main idea of our reinforcement learning algorithm is to use these search operators repeatedly in a policy iteration procedure: the neural network’s parameters are updated to make the move probabilities and value $(\mathbf{p},v) = f_{\theta}(s)$ more closely match the improved search probabilities and self-play winner $(\mathbf{\pi},z)$; these new parameters are used in the next iteration of self-play to make the search even stronger.

The nework is initialised to random weights, and then in each subsequent iteration self-play games are generated. At each time step an MCTS search is executed using the previous iteration of the network, and a move is played by sampling the search probabilities $\mathbf{\pi}_t$. A game terminates when both players pass, when the search value drops below a resignation threshold, or when the game exceeds a maximum length. The game is then scored to give a final reward.

The neural network $(\mathbf{p},v) = f_{\theta_{i}}(s)$ is adjusted to minimise the error between the predicted value $v$ and the self-play winner $z$, and to maximise the similarity of the neural network move probabilities $\mathbf{p}$ to the search probabilities $\pi$. Specifically, the parameters $\theta$ are adjusted by gradient descent on a loss function $l$ that sums over mean-squared error and cross-entropy losses:

$\displaystyle l = (z-v)^2 - \mathbf{\pi}^{T}\log \mathbf{p} + c||\theta||^2$

I’ll leave you to reflect on this closing paragraph, thousands of years of collective human effort surpassed from scratch in just a few days:

Humankind has accumulated Go knowledge from millions of games played over thousands of years, collectively distilled into patterns, proverbs, and books. In the space of a few days, starting tabula rasa, AlphaGo Zero was able to rediscover much of this Go knowledge, as well as novel strategies that provide new insights into the oldest of games.

Opening the black box of deep neural networks via information Schwartz-Viz & Tishby, ICRI-CI 2017

Yesterday we looked at the information theory of deep learning, today in part II we’ll be diving into experiments using that information theory to try and understand what is going on inside of DNNs. The experiments are done on a network with 7 fully connected hidden layers, and widths 12-10-7-5-4-3-2 neurons.

The network is trained using SGD and cross-entropy loss function, but no other explicit regularization. The activation functions are hyperbolic tangent in all layers but the final one, where a sigmoid function is used. The task is to classify 4096 distinct patterns of the input variable $X$ into one of two output classes. The problem is constructed such that $I(X;Y) \approx 0.99$ bits.

Output activations are discretised into 30 bins, and the Markov chain $Y \rightarrow X \rightarrow T_i$ is used to calculate the joint distributions for every hidden layer $T_i$. Using the joint distributions, it is then possible to calculate the encoder and decoder mutual information, $I(X;T)$ and $I(T;Y)$, for each hidden layer in the network. The calculations were repeated 50 times over, with different random initialisation of network works and different random selections of training samples.

### A phase shift during training

We can plot the mutual information retained in each layer on a graph. The following chart shows the situation before any training has been done (i.e., random initial weights of each of the 50 generated networks).

The different colours in the chart represent the different hidden layers (and there are multiple points of each colour because we’re looking at 50 different runs all plotted together). On the x-axis is $I(X;T)$, so as we move to the right on the x-axis, the amount of mutual information between the hidden layer and the input $X$ increases. On the y-axis is $I(T;Y)$, so as we move up on the y-axis, the amount of mutual information between the hidden layer and the output $Y$ increases.

I’m used to thinking of progressing through the network layers from left to right, so it took a few moments for it to sink in that it’s the lowest layer which appears in the top-right of this plot (maintains the most mutual information), and the top-most layer which appears in the bottom-left (has retained almost no mutual information before any training). So the information path being followed goes from the top-right corner to the bottom-left traveling down the slope.

Part-way through training (here at 400 epochs), we can see that much more $I(T;Y)$ information is being retained through the layers .

After 9000 epochs we’re starting to see a pretty flat information path, which means that we’re retaining mutual information needed to predict $Y$ all the way through the network layers.

Something else happens during training though, and the best demonstration of it is to be found in watching this short video.

What you should hopefully have noticed is that early on the points shoot up and to the right, as the hidden layers learn to retain more mutual information both with the input and also as needed to predict the output. But after a while, a phase shift occurs, and points move more slowly up and to the left.

The following chart shows the average layer trajectories when training with 85% of the data, giving a static representation of the same phenomenon. The green line shows the phase change point, and the yellow points are the final positions.

The two optimization phases are clearly visible in all cases. During the fast — Empirical eRror Minimization (ERM) — phase, which takes a few hundred epochs, the layers increase information on the labels (increase $I_Y$) while preserving the DPI order (lower layers have higher information). In the second and much longer training phase the layer’s information on the input, $I_X$, decreases and the layers lose irrelevant information until convergence (the yellow points). We call this phase the representation compression phase.

You can also see the phase change clearly (the vertical grey line) when looking at the normalized means and standard deviations of the layer’s stochastic gradients during the optimization process:

We claim that these distinct SG phases (grey line in the figure above), correspond and explain the ERM and compression phases we observe in the information plane.

### The drift phase

The first, ERM, phase is a drift phase. Here the gradient means are much larger than their standard deviations, indicating small gradient stochasticity (high SNR). The increase in $I_Y$ is what we expect to see from cross-entropy loss minimisation.

### The diffusion phase

The existence of the compression phase is more surprising. In this phase the gradient means are very small compared to the batch-to-batch fluctuations, and the gradients behave like Gaussian noise with very small means for each layer (low SNR). This is a diffusion phase.

…the diffusion phase mostly adds random noise to the weights, and they evolve like Weiner processes [think Brownian motion], under the training error or label information constraint.

This has the effect of maximising entropy of the weights distribution under the training error constraint. This in turn minimises the mutual information $I(X;T_i)$ – in other words, we’re discarding information in $X$ that is irrelevant to the task at hand. The fancy name for this process of entropy maximisation by adding noise is stochastic relaxation.

Compression by diffusion is exponential in the number of time steps it takes (optimisation epochs) to achieve a certain compression level (and hence why you see the points move more slowly during this phase).

One interesting consequence of this phase is the randomised nature of the final weights of the DNN:

This indicates that there is a huge number of different networks with essentially optimal performance, and attempts to interpret single weights or even single neurons in such networks can be meaningless.

### Convergence to the Information Bottleneck bound

As can be clearly seen in the charts, different layers converge to different points in the information plane, and this is related to the critical slowing down of the stochastic relaxation process near the phase transitions on the Information Bottleneck curve.

Recall from yesterday that the information curve is a line of optimal representations separating the achievable and unachievable regions in the information plane. Testing the information values in each hidden layer and plotting them against the information curve shows that the layers do indeed approach this bound.

How exactly the DNN neurons capture the optimal IB representations is another interesting issue to be discussed elsewhere, but there are clearly many different layers that correspond to the same IB representation.

### Why does deep work so well?

To understand the benefits of more layers the team trained 6 different architectures with 1 to 6 hidden layers and did 50 runs of each as before. The following plots show how the information paths evolved during training for each of the different network depths:

From this we can learn four very interesting things:

1. Adding hidden layers dramatically reduces the number of training epochs for good generalization. One hidden layer was unable to achieve good $I_Y$ values even after 10^4 iterations, but six layers can achieve full relevant information capture after just 400.
2. The compression phase of each layer is shorter when it starts from a previous compressed layer. For example, the convergence is much slower with 4 layers than with 5 or 6.
3. The compression is faster for the deeper (narrower and closer to the output) layers. In the diffusion phase the top layers compress first and “pull” the lower layers after them. Adding more layers seems to add intermediate representations which accelerate compression.
4. Even wide layers eventually compress in the diffusion phase. Adding extra width does not help.

### Training sample size

Training sample size seems to have the biggest impact on what happens during the diffusion phase. Here are three charts showing the information paths when training with 5% (left), 45% (middle), and 85% (right of the data):

### What does all this tell us about future directions for DL?

We are currently working on new learning algorithms that utilize the claimed IB optimality of the layers. We argue that SGD seems an overkill during the diffusion phase, which consumes most of the training epochs, and the much simpler optimization algorithms, such as Monte-Carlo relaxations, can be more efficient.

Furthermore, the analytic connections between encoder and decoder distributions can be exploited during training: combining the IB iterations with stochastic relaxation methods may significantly boost DNN training.

To conclude, it seems fair to say, based on our experiments and analysis, that Deep Learning with DNN are in essence learning algorithms that effectively find efficient representations that are approximate minimal statistics in the IB sense. If our findings hold for general networks and tasks, the compression of the SGD and the convergence of the layers to the IB bound can explain the phenomenal success of Deep Learning.

Opening the black box of deep neural networks via information Schwartz-Viz & Tishby, ICRI-CI 2017

In my view, this paper fully justifies all of the excitement surrounding it. We get three things here: (i) a theory we can use to reason about what happens during deep learning, (ii) a study of DNN learning during training based on that theory, which sheds a lot of light on what is happening inside, and (iii) some hints for how the results can be applied to improve the efficiency of deep learning – which might even end up displacing SGD in the later phases of training.

Despite their great success, there is still no comprehensive understanding of the optimization process or the internal organization of DNNs, and they are often criticized for being used as mysterious “black boxes”.

I was worried that the paper would be full of impenetrable math, but with full credit to the authors it’s actually highly readable. It’s worth taking the time to digest what the paper is telling us, so I’m going to split my coverage into two parts, looking at the theory today, and the DNN learning analysis & visualisations tomorrow.

### An information theory of deep learning

Consider the supervised learning problem whereby we are given inputs $x \in X$ and we want to predict labels $y \in Y$. Inside the network we are learning some representation $T(X)$ of the input patterns, that we hope enables good predictions. We also want good generalisation, not overfitting.

Think of a whole layer $T$ as a single random variable. We can describe this layer by two distributions: the encoder $P(T|X)$ and the decoder $P(Y|T)$.

So long as these transformations preserve information, we don’t really care which individual neurons within the layers encode which features of the input. We can capture this idea by thinking about the mutual information of $T$ with the input $X$ and the desired output $Y$.

Given two random variables $X$ and $Y$, their mutual information $I(X;Y)$ is defined based on information theory as
$\displaystyle I(X;Y) = H(X) - H(X|Y)$

Where $H(X)$ is the entropy of $X$ and $H(X|Y)$ is the conditional entropy of $X$ given $Y$.

The mutual information quantifies the number of relevant bits that the input $X$ contains about the label $Y$, on average.

If we put a hidden layer $T$ between $X$ and $Y$ then $T$ is mapped to a point in the information plane with coordinates $(I(X;T), I(T;Y))$. The Data Processing Inequality (DPI) result tells us that for any 3 variables forming a Markov chain $X \rightarrow Y \rightarrow Z$ we have $I(X;Y) \geq I(X;Z)$.

So far we’ve just been considering a single hidden layer. To make a deep neural network we need lots of layers! We can think of a Markov chain of K-layers, where $T^i$ denotes the $i^{th}$ hidden layer.

In such a network there is a unique information path which satisfies the DPI chains:

$\displaystyle I(X;Y) \geq I(T_1;Y) \geq I(T_2;Y) \geq ... \geq I(T_k;Y) \geq I(\hat{Y};Y)$

and
$\displaystyle H(X) \geq I(X;T_1) \geq I(X;T_2) \geq ... \geq I(X;T_k) \geq I(X;\hat{Y})$

Now we bring in another property of mutual information; it is invariant in the face of invertible transformations:

$\displaystyle I(X;Y) = I(\psi (X); \phi(Y))$

for any invertible functions $\psi$ and $\phi$.

And this reveals that the same information paths can be realised in many different ways:

Since layers related by invertible re-parameterization appear in the same point, each information path in the plane corresponds to many different DNN’s , with possibly very different architectures.

### Information bottlenecks and optimal representations

An optimal encoder of the mutual information $I(X;Y)$ would create a representation of a minimal sufficient statistic of $X$ with respect to $Y$. If we have a minimal sufficient statistic then we can decode the relevant information with the smallest number of binary questions. (That is, it creates the most compact encoding that still enables us to predict $Y$ as accurately as possible).

The Information Bottleneck (IB) tradeoff Tishby et al. (1999) provides a computational framework for finding approximate minimal sufficient statistics. That is, the optimal tradeoff between the compression of $X$ and the prediction of $Y$.

The Information Bottleneck tradeoff is formulated by the following optimization problem, carried independently for the distributions, $p(t|x), p(t), p(y|t)$, with the Markov chain: $Y \rightarrow X \rightarrow T$

$\displaystyle \min_{p(t|x),p(y|t),p(t)} \{ I(X;T) - \beta I(T;Y)\}$

where the Lagrange multiplier $\beta$ determines the level of relevant information captured by the representation $T, I(T;Y)$.

The solution to this problem defines an information curve: a monotonic concave line of optimal representations that separates the achievable and unachievable regions in the information plane.

### Noise makes it all work

Section 2.4 contains a discussion on the crucial role of noise in making the analysis useful (which sounds kind of odd on first reading!). I don’t fully understand this part, but here’s the gist:

The learning complexity is related to the number of relevant bits required from the input patterns $X$ for a good enough prediction of the output label $Y$, or the minimal $I(X; \hat{X})$ under a constraint on $I(\hat{X}; Y)$ given by the IB.

Without some noise (introduced for example by the use of sigmoid activation functions) the mutual information is simply the entropy $H(Y)$ independent of the actual function we’re trying to learn, and nothing in the structure of the points $p(y|x)$ gives us any hint as to the learning complexity of the rule. With some noise, the function turns into a stochastic rule, and we can escape this problem. Anyone with a lay-person’s explanation of why this works, please do post in the comments!

### Setting the stage for Part II

With all this theory under our belts, we can go on to study the information paths of DNNs in the information plane. This is possible when we know the underlying distribution $P(X,Y)$ and the encoder and decoder distributions $P(T|X)$ and $P(Y|T)$ can be calculated directly.

Our two order parameters, $I(T; X)$ and $I(T;Y)$, allow us to visualize and compare different network architectures in terms of their efficiency in preserving the relevant information in $P(X;Y)$.

We’ll be looking at the following issues:

1. What is SGD actually doing in the information plane?
2. What effect does training sample size have on layers?
3. What is the benefit of hidden layers?
4. What is the final location of hidden layers?
5. Do hidden layers form optimal IB representations?

Matrix capsules with EM routing Anonymous 😉, Submitted to ICLR’18

(Where we know anonymous to be some combination of Hinton et al.).

This is the second of two papers on Hinton’s capsule theory that has been causing recent excitement. We looked at ‘Dynamic routing between capsules’ yesterday, which provides some essential background so if you’ve not read it yet I suggest you start there.

Building on the work of Sabour et al., we have proposed a new capsule network architecture in which each capsule has a logistic unit to represent the presence of an entity and a 4×4 pose matrix to represent the relationship between that entity and the viewer. We also introduced a new iterative routing procedure between capsule layers, making use of the EM algorithm, which allows the output of each lower-level capsule to be routed to a capsule in the layer above so that each higher-level capsule receives a cluster of similar pose votes, if such a cluster exists.

This revised CapsNet architecture (let’s call it CapsNetEM) does very well on the smallNORB dataset, which contains gray-level stereo images of 5 classes of toy: airplanes, cars, trucks, humans, and animals. Every individual toy is pictured at 18 different azimuths, 9 elevations, and 6 lighting conditions.

By ‘very well’ I mean state-of-the-art performance on this dataset, achieving a 1.4% test error rate (the CapsNet architecture we looked at yesterday achieves 2.7%). The best prior reported result on smallNORB is 2.56%. A smaller CapsNetEM network with 9 times fewer parameters than the previous state-of-the-art also achieves 2.2% error rate.

A second very interesting property of CapsNetEM demonstrated in this paper is increased robustness to some forms of adversarial attack. For both general and targeted adversarial attacks using an incremental adjustment strategy (white box) such as that described in ‘Explaining and harnessing adversarial examples,’ CapsNetEM is significantly less vulnerable than a baseline CNN.

…the capsule model’s accuracy after the untargeted attack never drops below chance (20%) whereas the convolutional model’s accuracy is reduced to significantly below chance with an epsilon of as small as 0.2.

We’re not out of the woods yet though – with black box attacks created by generating adversarial examples with a CNN and then testing them on both CapsNetEM and a different CNN, CapsNetEM does not perform noticeably better. Given that black box attacks are the more likely in the wild, that’s a shame.

### Capsules recap

The core capsules idea remains the same as we saw yesterday, with network layers divided into capsules (note by the way the similarity here with dividing layers into columns), and capsules being connected across layers.

Viewpoint changes have complicated effects on pixel intensities but simple, linear effects on the pose matrix that represents the relationship between an object or object-part and the viewer. The aim of capsules is to make good use of this underlying linearity, both for dealing with viewpoint variation and improving segmentation decisions.

Each capsule has a logistic unit to represent the presence of an entity, and a 4×4 pose matrix which can learn to represent the relationship between that entity and the viewer. A familiar object can be detected by looking for agreement between votes for its pose matrix. The votes come from capsules in the preceding network layer, based on parts they have detected.

A part produces a vote by multiplying its own pose matrix by a transformation matrix that represents the viewpoint invariant relationship between the part and the whole. As the viewpoint changes, the pose matrices of the parts and the whole will change in a coordinated way so that any agreement between votes from different parts will persist.

To find clusters of votes that agree, a routing-by-agreement iterative protocol is used. The major difference between CapsNet and CapsNetEM is the way that routing-by-agreement is implemented.

### Limitations of routing-by-agreement using vector lengths

In CapsNet, the length of the pose vector is used to represent the probability that an entity is present. A non-linear squashing function keeps the length less than 1. A consequence though is that this “prevents there from being any sensible objective function that is minimized by the iterative routing procedure”.

CapsNet also uses the cosine of the angle between two pose vectors to measure their agreement. But the cosine (unlike the log variance of a Gaussian cluster) is not good at distinguishing between good agreement and very good agreement.

Finally, CapsNet uses a vector of length $n$ rather than a matrix with $n$ elements to represent a pose, so its transformation matrices have $n^2$ parameters rather than just $n$.

CapsNetEM overcomes all of these limitations with a new routing-by-agreement protocol.

### Using Expectation Maximisation (EM) for routing-by-agreement

Let us suppose that we have already decided on the poses and activation probabilities of all the capsules in a layer and we now want to decide which capsules to activate in the layer above and how to assign each active lower-level capsule to one active higher-level capsule.

We start by looking at a simplified version of the problem in which the transformation matrices are just the identity matrix, essentially taking transformation out of the picture for the time being.

Think of each capsule in a higher layer as corresponding to a Gaussian, and the outputs of each lower level capsule as data points. Now the task essentially becomes figuring out the fit between the data points produced by the lower layer capsules, and the functions defined by the higher layer Gaussians.

This is a standard mixture of Gaussians problem, except that we have way too many Gaussians, so we need to add a penalty that prevents us from assigning every data point to a different Gaussian.

At this point in the paper we then get a very succinct description of the routing by agreement protocol, which I can only follow at a high level, but here goes…

The cost of explaining a whole data-point $i$ by using capsule $c$ that has an axis-aligned covariant matrix is simply the sum over all dimensions of the cost of explaining each dimension, $h$ of $i$. This is simply $-\ln (P_{ih})$ where $P_{ih}$ is the probability density of the $h^{th}$ component of the vote from $i$ under the capsule’s Guassian model for dimension $h$ which has variance $\sigma^{2}_{h}$.

Let $cost_h$ be the cost of summing over all lower-level capsules for a single dimension of one higher-level capsule.

The activation function of capsules $c$ is then given by:
$\displaystyle a_c = logistic(\lambda(b - \sum_{h} cost_h))$

The non-linearity implemented by a whole capsule layer is a form of cluster finding using the EM algorithm, so we call it EM Routing.

Here’s the routing algorithm in its full glory:

The overall objective function reduces a free energy function:

The recomputation of the means and variances reduces an energy equal to the squared distances of the votes from the means, weighted by assignment probabilities. The recomputation of the caspule activations reduces a free energy equal to the sum of the (scaled) costs used in the logistic function minus the entropy of the activation values. The recomputation of the assignment probabilities reduces the sum of the two energies above minus the entropies of the assignment probabilities.

### A CapsNetEM network

The general architecture of a CapsNetEM network looks like this:

Each capsule has a 4×4 pose matrix and one logistic activation unit. The 4×4 pose of each of the primary capsules is a linear transformation of the output of all the lower layer ReLUs centred at that location. When connecting the last capsule layer to the final layer the scaled coordinate (row, column) of the centre of the receptive field of each capsule is added to the first two elements of its vote. “We refer to this technique as Coordinate Addition.” The goal is to encourage the shared final transformations to produce values for those two elements that represent the fine position of the entity relative to the centre of the capsule’s receptive field.

The routing procedure is used between each adjacent pair of capsule layers, and for convolutional capsules each capsule in layer L+1 only sends feedback to capsules within its receptive field in layer L. Instances closer to the borders of the image therefore receive fewer feedbacks.

The following figure shows how the routing algorithm refines vote routing over three iterations:

### CapsNetEM for smallNORB

We saw some of the results from applying CapsNetEM to smallNORB at the start of this piece. The authors also tested the ability of CapsNetEM to generalize to novel viewpoints by training both the convolution baseline and a CapsNetEM model on one third of the training data using a limited range of azimuths, and then testing on the remaining two-thirds containing the full set of azimuths. Compared with the baseline CNN, capsules with matched performance on familiar viewpoints reduce the test error rate on novel viewpoints by about 30%.

SmallNORB is an ideal data-set for developing new shape-recognition models precisely because it lacks many of the additional features of images in the wild. Now that our capsules model works well on NORB, we plan to implement an efficient version so that we can test much larger models on much larger data-sets such as ImageNet.

Dynamic routing between capsules Sabour et al., NIPS’17

The Morning Paper isn’t trying to be a ‘breaking news’ site (there are plenty of those already!) — we covered a paper from 1950 last month for example! That said, when exciting research news breaks, of course I’m interested to read up on it. So The Morning Paper tends to be a little late to the party, but in compensation I hope to cover the material in a little more depth than the popular press. Recently there’s been some big excitement around Geoff Hinton’s work on capsules (and let’s not forget the co-authors, Sabour & Frosst), AlphaZero playing Go against itself, and Scharwtz-Ziv & Tishby’s information bottleneck theory of deep neural networks. This week I’ll be doing my best to understand those papers, and share with you what I can.

There are two capsule-related papers from Hinton. The first of them is from NIPS’17, and it’s today’s paper choice: ‘Dynamic routing between capsules.’ Some of the ideas in this paper are superseded by the ICLR’18 submission ‘Matrix capsules with EM routing’ that we’ll look at tomorrow, nevertheless, there’s important background information and results here. Strictly, we’re not supposed to know that the ICLR’18 submission is by Hinton and team – it’s a double blind review process. Clearly not working as designed in this case!

### Exponential inefficiencies

For thirty years, the state-of-the-art in speech recognition used hidden Markov models with Gaussian mixtures, together with a one-of-n representation encoding.

The one-of-n representations that they use are exponentially inefficient compared with, say, a recurrent neural network that uses distributed representations.

Why exponentially ineffecient? To double the amount of information that an HMM (hidden Markov model) can remember we need to square the number of hidden nodes. For a recurrent net we only need to double the number of hidden neurons.

Now that convolutional neural networks have become the dominant approach to object recognition, it makes sense to ask whether there are any exponential inefficiencies that may lead to their demise. A good candidate is the difficulty that convolutional nets have in generalizing to novel viewpoints.

CNNs can deal with translation out of the box, but for robust recognition in the face of all other kinds of transformation we have two choices:

1. Replicate feature detectors on a grid that grows exponentially with the number of dimensions, or
2. Increase the size of the labelled training set in a similarly exponential way.

The big idea behind capsules (a new kind of building block for neural networks) is to efficiently encode viewpoint invariant knowledge in a way that generalises to novel viewpoints. To do this, capsules contain explicit transformation matrices.

### Introducing Capsules

A network layer is divided into many small groups of neurons called “capsules.” Capsules are designed to support transformation-independent object recognition.

The activities of the neurons within an active capsule represent the various properties of a particular entity that is present in the image. These properties can include many different types of instantiation parameter such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc.. One very special property is the existence of the instantiated entity in the image.

(Albedo is a measure for reflectance or optical brightness of a surface – I had to look it up!).

The authors describe the role of the capsules as like ‘an inverted rendering process’ in a graphics pipeline. In a graphics pipeline for example we might start out with an abstract model of a 3D teapot. This will then be placed in the coordinate system of a 3D world, and the 3D world coordinate system will be translated to a 3D camera coordinate system with the camera at the origin. Then lighting and reflections are handled followed by a projection transformation into the 2D view of the camera. Capsules start with that 2D view as input and try to reverse the transformations to uncover the abstract model class (teapot) behind the image.

There are many possible ways to implement the general idea of capsules… In this paper we explore an interesting alternative which is to use the overall length of the vector of instantiation parameters to represent the existence of the entity and to force the orientation of the vector to represent the properties of the entity.

In 2D, the picture in my head looks something like this:

To turn the length of the vector into a probability we need to ensure that it cannot exceed one, which is done by applying a non-linear squashing function that leaves the orientation of the vector unchanged but scales down its magnitude.

For all but the first layer of capsules, the total input to a capsule $\mathbf{s}_j$ is a weighted sum over all “prediction vectors” $\hat{\mathbf{u}}_j|i$ from the capsules in the layer below and is produced by multiplying the output $\mathbf{u}_i$ of a capsule in the layer below by a weight matrix $\mathbf{W}_ij$
$\displaystyle \mathbf{s}_j = \sum{i}{} c_{ij} \hat{\mathbf{u}}_{j|i},\ \ \ \ \hat{\mathbf{u}}_{j|i} = \mathbf{W}_{ij}\mathbf{u}_i$

Where $c_{ij}$ are coupling coefficients determined by an iterative dynamic routing process we’ll look at in the next section.

In convolutional capsule layers each unit in a capsule is a convolutional unit. Therefore, each capsule will output a grid of vectors rather than a single output vector.

### Connecting capsules in layers

A Capsule Network, or CapsNet combines multiple capsule layers. For example, here’s a sample network for the MNIST handwritten digit recognition task:

In general, the idea is to form a ‘parse tree’ of the scene. Layers are divided into capsules and capsules recognise parts of the scene. Active capsules in a layer from part of the parse tree, and each active capsule chooses a capsule in the layer above to be its parent in the tree.

The fact that the output of a capsule is a vector makes it possible to use a powerful dynamic routing mechanism to ensure that the output of the capsule gets sent to an appropriate parent in the layer above.

• Initially the output is routed to all possible parents, but scaled down by coupling coefficients that sum to 1 (determined by a ‘routing softmax’).
• For each possible parent, the capsule computes a ‘prediction vector’ by multiplying its own output by a weight matrix.
• If the prediction vector has a large scalar product with the output of a possible parent, there is top-down feedback which has the effect of increasing the coupling coefficient for that parent, and decreasing it for other parents.

This type of “routing-by-agreement” should be far more effective than the very primitive form of routing implemented by max-pooling which allows neurons in one layer to ignore all but the most active feature detector in a local pool in the layer below.

To replicate knowledge across space, all but the last layer of capsules are convolutional.

In the final layer of the network (DigitCaps in the figure above) we want the top-level capsule for digit class $k$ to have a long instantiation vector if and only if that digit is present in the image. To allow for multiple digits, a separate margin loss is used for each digit capsule. An additional reconstruction loss is used to encourage the digit capsules to encode the instantiate parameters of the digit: “during training, we mask out all but the activity vector of the correct digit capsule, then we use this activity vector to reconstruct.

### Representations in MNIST

The CapsNet above is trained on MNIST, and then the capsules are probed by feeding perturbed versions of the activity vector into the decoder network to see how the perturbation affects the reconstruction.

Since we are passing the encoding of only one digit and zeroing out other digits, the dimensions of a digit capsule should learn to span the space of variations in the way digits of that class are instantiated. These variations include stroke thickness, skew and width. They also include digit-specific variations such as the length of the tail of a 2.

Here are some examples of the learned dimensions:

With a 3-layer network, CapsNet overall achieves a 0.25% test error – an accuracy previously only achieved by deeper networks.

To test the robustness of CapsNet to affine transformation, a CapsNet and traditional CNN were both trained on a padded and translated MNIST training set. The network was then tested on the affNIST data set, in which each example is an MNIST digit with a random small affine transformation. CapsNet and the traditional network achieved similar accuracy (99.23% vs 99.22%) on the expanded MNIST test set. CapsNet scored 79% on the affNIST test set though, handily beating the CNN’s score of 66%.

### Segmentation

Dynamic routing can be viewed as a parallel attention mechanism that allows each capsule at one level to attend to some active capsules at the level below and to ignore others. This should allow the model to recognise multiple objects in the image even if objects overlap.

And indeed it does! A generated MultiMNIST training and test dataset is created by overlaying one digit on top of another with a shift of up to 4 pixels on each axis. CapsNet is correctly able to segment the image into the two original digits. The segmentation happens at a higher level than individual pixels, and so it can deal correctly with cases where a pixel is in both digits, while still accounting for all pixels.

### A new beginning

Research on capsules is now at a similar stage to research on recurrent neural networks for speech recognition at the beginning of this century. There are fundamental representational reasons for believing it is a better approach but it probably requires a lot more insights before it can out-perform a highly developed technology. The fact that a simple capsules system already gives unparalleled performance at segmenting overlapping digits is an early indication that capsules are a direction worth exploring.

Xifeng Guo has a Keras implementation of CapsNet on GitHub which is helpful to study.

As an antidote to callback-hell, ECMAScript 6 introduces Promises. Promises represent the value of an asynchronous computation, and the functions resolve and reject are used to settle the promise. Promises can be chained using then.

However, the semantics of JavaScript promises are quite complex, and since the feature is implemented by way of ordinary function calls, there are no static checks to ensure correct usage. As a result, programmers often make mistakes in promise-based code that leads to pernicious errors, as is evident from many reported issues on forums such as StackOverflow.

This paper introduces the notion of a promise graph, which can be used to visualise the flow in a promises-based program and help to detect a range of bugs. In order to have formal basis for such reasoning and program analysis the paper also introduces $\lambda_p$ , a calculus for the behaviour of promises, which is expressed as an extension to $\lambda_{JS}$.

### Promises essentials

A promise is an object that represents the result of an asynchronous operation.

A promise can be in one of three different states. Initially a promise is Pending, indicating that the operation has not yet completed and the promise holds no value. A Fulfilled promise is one in which the operation has succeeded and the promise holds a result value. A Rejected promise is one in which the operation has failed and the promise holds an error value. A promise that is either fulfilled or rejected is said to be settled.

A promise is constructed by passing a single callback function with two parameters: a resolve function that can be called to fulfil the promise, and a reject function that can be called to reject the promise. These parameters are named resolve and reject by convention.

	var p = new Promise((resolve, reject) => {
resolve(42);  // immediate resolution
});


The then function can be used to register resolve reactions and reject reactions with promises, and creates in turn a dependent promise with the result computed by the resolve (reject) reaction.

### Promises calculus

Section 3 of the paper defines a formal $\lambda_p$ calculus for all of this. The runtime state has five components:

• A heap $\sigma$ that maps addresses to values
• A promise state map $\psi$ that maps addresses to promise values (Pending, Fulfilled(value), or Rejected(value)).
• A map $f$ that maps addresses to a list of fulfil reactions
• A map $r$ that maps addresses to a list of reject reactions
• And a queue $\pi$ that holds scheduled reactions – promised that have been settled and their reactions are awaiting asynchronous execution by the event loop.

A promisify expression in this language turns an object into a promise. Then there a bunch of reduction rules such as this one:

They can look a bit intimidating on first glance, but that’s mostly down to the syntax. For example, all this one says is that:

• given $a$ is an address in the heap, that is not currently mapped to promise values, and an expression promisify(a)
• then the system reduces to a new state where the expression value becomes undefined (the return value of a promisify expression),
• and the promise state map is updated with a new entry for this address that holds the value Pending, the fulfil reaction and reject reaction maps get an empty entry for the address, and the rest of the state of the system is unaltered

The fun part of the paper for me though, is what happens when you create promise graphs off of the back of this formalism. So let’s get straight into that…

### Promise graphs

The promise graph captures control- and dataflow in a promise-based program to represent the flow of values through promises, the execution of fulfil and reject reactions, and the dependencies between reactions and promises.

Promise graphs contain three different types of nodes:

• Value nodes represent value allocation sites in the program
• Promise nodes represent promise allocation sites in the program,
• Function nodes represent every lambda or named function in the program.

There are four different types of edges between nodes as well:

• A settlement edge (which can be labelled either ‘resolve’ or ‘reject’) from a value node $v$ to a promise node $p$ indicates that the value $v$ is used to resolve or reject the promise $p$.
• A registration edge (with label ‘resolve’ or ‘reject’) from a promise node $p$ to a function node $f$ indicates that the function $f$ is registered as a fulfil or reject reaction on the promise $p$.
• A link edge from a promise $p_1$ to a dependent promise $p_2$ represents the dependency that when the parent promise $p_1$ is resolved or rejected, the dependent promise will be resolved or rejected with the same value.
• A return edge from a function node $f$ to a value node $v$ represents the function $f$ returning the value allocated at $v$.

In the examples that follow, the subscript numbers (e.g., $v_{21}$ indicate the line number in the program on which the corresponding object is declared. At this point on first read I admit to thinking it would be easier just to read the code, but bear with me, it will all make more sense very soon.

Here’s a small example of a program and its corresponding promise graph:

You can read the graph forward: the value on line 3 (42) is used to resolve the promise declared on line 1 (p1), and flows into the fulfil reaction function declared on line 2. Inside that function, the newly computed value ($v_2$) is then used to resolve the second promise $p_2$. You can also read it backwards to trace the resolution of $p_2$ to understand how and why it was resolved.

Here’s an example with branching:

### A real world example

Here’s some code from a StackOverflow question (Q42408234) and it’s corresponding promise graph. The poster had spend a full two hours trying to figure out why the promise from mongoose.find (on line 15) was returning undefined.

Immediately the promise graph shows us two disconnected chains. The value returned on line 25 is used to resolve the promise on line 21, but then doesn’t go anywhere else. The function declared on lines 16-31 does not explicity return a value, so it will implicitly return undefined. The fix is to explicitly return the the value of bcrypt.compare as computed on line 20.

### A catalog of Promise-related bugs

Armed with the ability to create promise graphs, we can now detected a variety of promise-related bugs.

• A Dead Promise occurs when an object enters the Pending state and never transitions to either a fulfil or reject state. A promise node with no resolve, reject, or link edges is dead.

• A Missing resolve or reject reaction occurs when a promise is resolved with a non-undefined value, and the promise lacks a fulfil (or reject) reaction. This is a promise node with no outgoing edges.

• A Missing exceptional reject reaction occurs when a promise is implicitly rejected by throwing an exception. These are promise nodes with a reject edge, but to reject registration edge.

• A Missing return occurs when a fulfil or reject reaction unintentionally returns undefined and this value is used to resolve or reject a dependent promise.

• A Double resolve or reject occurs when a promise enters the fulfilled or rejected state, and later the promise is again resolved (or rejected). This can be detected in the graph by multiple resolve (or reject) edges leading to the same promise.

• An Unnecessary promise is a promise whose sole purpose is to act as an intermediary between two other promises.

• A Broken promise chain occurs when a programmer inadvertently creates a fork in a promise chain.

### Promise graphs and StackOverflow

The authors looked at the most recent 600 StackOverflow questions tagged with both promise and javascript (out of over 5,000 total). They focused on those with code fragments of at most 100 lines, a genuine promises-related issue, and an answer. This boiled things down to quite a small list, show below:

The fourth column in the table above shows the additional information — above and beyond the promise graph — that was needed to pinpoint the bug (if any).

• HB indicates that happens-before information is needed – i.e., the knowledge that one statement in the program always executes before another.
• LU indicates that loop unrolling is necessary to debug the root cause (i.e., what happens in one iteration affects a promise created in a subsequent one).
• EV indicates that event reasoning is needed to debug the root cause – e.g., the knowledge that an event handler for a button can fire multiple times.

Here’s an example analysis:

Q41268953. The programmer creates a promise, but never resolves or rejects it. As a result, the reactions associated with the promise are never executed. The programmer gets lost in the details and mistakenly believes the bug to be elsewhere in the program. He or she then proceeds to add additional logging and reject reactions in all the wrong places and these are never executed. In this scenario, the promise graph immediately identifies that the initial promise in the promise chain is never resolved or rejected, and hence that no reaction on chain is ever executed.

There are plenty more examples in section 5.3, and in appendix A.

Type test scripts for TypeScript testing Kristensen et al., OOPLSA’17

Today’s edition of The Morning Paper comes with a free tongue-twister; ‘type test scripts for TypeScript testing!’

One of the things that people really like about TypeScript is the DefinitelyTyped repository of type declarations for common (otherwise untyped) JavaScript libraries. There are over 3000 such declaration files now available. This is a great productivity boost, but it’s not perfect:

These declaration files are… written and maintained manually, which leads to many errors. The TypeScript type checker blindly trusts the declaration files, without any static or dynamic checking of the library code.

So it would be great if there was a way of running automated tests to verify that the TypeScript declaration files actually match the implementations in the JavaScript libraries they represent. The authors show that their runtime type checking approach can find more errors with fewer false positive than prior static analysis tools. But it requires a test suite. Where can we get a test suite from?

Our method is based on the idea of feedback-directed random testing as pioneered by the Randoop tool by Pacheco et al. 2007. With Randoop, a (Java) library is testing automatically by using the methods of the library itself to produce values, which are then fed back as parameters to other methods in the library.

The TSTest tool (http://www.brics.dk/tstools) takes as input a JavaScript library and a corresponding TypeScript declaration file. It then dynamically creates a JavaScript program to exercise the library, called a type test script. These scripts exercise the library code by mimicking the behaviour of potential applications, and perform runtime type checking to ensure that values match those specified by the type declarations.

TSTest is set loose on 54 real world libraries, and finds type mismatches in 49 of them.

### An example type mismatch

Here’s a simple example of a type mismatch between the TypeScript declaration for PathJS, and the actual implementation.

First, the TypeScript declaration:

	declare var Path: {
root(path: string):  void;
routes: {
root: IPathRoute,
}
};

interface IPathRoute {
run(): void;
}


And an excerpt from the implementation:

	var Path= {
root: function(path) {
Path.routes.root = path;
},
routes: {
root: null
}
};


Looking at the type declaration, we see that the parameter path of the root method is declared to be a string. But in the implementation the value of this parameter is assigned to Path.routes.root – which is declared to be of type IPathRoute. A manual examination of the implementation shows that the value really should be a string.

This error is not found by the existing TypeScript declaration file checker TScheck, since it is not able to relate the side effects of a method with a variable. Type systems such as TypeScript or Flow also cannot find the error, because the types only appear in the declaration file, not as annotations in the library implementation.

When the type test script generated by TStest is run, it will generated output like this, highlighting the error:

*** Type error
property access: Path.routes.root
expected: object
observed: string


See §2 in the paper for an additional and much more involved example using generic types and higher-order functions.

### Generating a type test script

A type test script has a main loop which repeatedly runs random tests against a library until a timeout is reached. Each test calls a library function, and the value returned is checked to see that its type matches the type declaration. Arguments to library functions are either randomly generated, or taken from those produced by the library in previous actions where those values match the function parameter type declarations. Test script applications can also interact with library object properties, which are treated like special kinds of function calls.

The strategy for choosing which tests to perform and which values to generate greatly affects the quality of the testing. For example, aggressively injection random (but type correct) values may break internal library invariants and thereby cause false positives, while having too little variety in the random value construction may lead to poor testing coverage and false negatives.

Here’s a snippet from the declaration file for the Async library:

	declare module async {
function memoize(
fn: Function,
hasher?: Function
): Function;
function unmemoize(fn: Function): Function;
}


And the type test script generated for it:

### Tricky things

…adapting the ideas of feedback-directed testing to TypeScript involves many interesting design choices and requires novel solutions…

This challenges include generating values that match a given structural type, and type checking return values with structural types. Type mismatch errors will be reported based on deep checking (checking all reachable objects), but only shallow type checking is required to pass in order for a value to be stored for feedback testing (normally a type failure stops any further testing).

When functions are returned, their types cannot be checked immediately, but only when the functions are invoked.

The blame calculus (Wadler and Findler 2009) provides a powerful foundation for tracking function contracts (e.g. types) dynamically and deciding whether to blame the library or the application code if violations are detected.

Since the application code in this case is known to be well-formed (we generated it), any problems can be blamed on the library.

Recursion with generic types is broken by treating any type parameters involved in recursion as any.

### Soundness and completeness

TStest is not sound: when an error is reported by TSTest that does not necessarily mean there is a mismatch in practice. An example that can cause this is the breaking of recursion in generic types using any. “This is mostly a theoretical issue, we have never encountered false positives in practice.

TStest is also not conditionally complete: there are some mismatches that it can never detect.

Neither of these two things mean that it isn’t useful though, which is what we’ll look at next.

### How well does it work?

Evaluation of TStest was conducted using the following libraries;

With time budgets of 10 seconds, 1 minute, and 5 minutes, and multiple runs, here’s how many mismatches TStest reports across all 54 libraries:

Mismatches were found in 49 of the 54 benchmarks, independently of the timeout and the number of repeated runs… The numbers in table 1 are quite large, and there are likely not around 6000 unique errors among the 54 libraries tested. A lot of the detected mismatches are different manifestations of the same root cause. However, our manual study shows that some declaration files do contain dozens of actual errors.

124 of the reported mismatches were randomly sampled for further evaluation (spanning 41 libraries).

• 63 of the 124 are mismatches that a programmer would almost certainly want to fix
• A further 47 were mismatches that a programmer would want to fix, but are only valid when TypeScript’s non-nullable types feature is enabled.
• That leaves 14 of the 124 that turned out to be benign. Some of these are due to limitations in the TypeScript type system, some are due to TStest constructing objects with private behaviour that are really only intended to be constructed by the library itself, and the majority (7/14) are intentional mismatches by the developer:

For various reasons, declaration file authors sometimes intentionally write incorrect declarations even when correct alternatives are easily expressible, as also observed in previous work. A typical reason for such intentional mismatches is to document internal classes.

The authors believe it would be possible to adapt TStest to perform testing of Flow’s library definitions as well.