To round out the week, I thought I’d take a selection of fun papers from the ‘More papers from 2016’ section of top 100 awesome deep learning papers list.

The texture networks paper we’ve covered before, so the link in the above list is to The Morning Paper write-up (but I felt like it belonged in this group nevertheless).

### Colorful image colorization

Given a grayscale photograph as input, this paper attacks the problem of hallucinating a plausible color version of the photograph.

How is this possible? Well, we’ve seen that networks can learn what various parts of the image represent. If you see enough images you can learn that grass is (usually) green, the sky is (sometimes!) blue, and ladybirds are red. The network doesn’t have to recover the actual ground truth colour, just a plausible colouring.

Therefore, our task becomes much more achievable: to model enough of the statistical dependencies between the semantics and the textures of grayscale images and their color versions in order to produce visually compelling results.

Results like this:

Training data for the colourisation task is plentiful – pretty much any colour photo will do. The tricky part is finding a good loss function – as we’ll see soon, many loss functions produce images that look desaturated, whereas we want vibrant realistic images.

The network is based on image data using the CIE Lab colourspace. Grayscale images have only the lightness, L, channel, and the goal is to predict the a (green-red) and b (blue-yellow) colour channels. The overall network architecture should look familiar by now, indeed so familiar that supplementary details are pushed to an accompanying website.

(That website page is well worth checking out by the way, it even includes a link to a demo site on Algorithmia where you can try the system out for yourself on your own images).

Colour prediction is inherently multi-modal, objects can take on several plausible colourings. Apples for example may be red, green, or yellow, but are unlikely to be blue or orange. To model this, the prediction is a distribution of possible colours for each pixel. A typical objective function might use e.g. Euclidean loss between predicted and ground truth colours.

However, this loss is not robust to the inherent ambiguity and multimodal nature of the colorization problem. If an object can take on a set of distinct ab values, the optimal solution to the Euclidean loss will be the mean of the set. In color prediction, this averaging effect favors grayish, desaturated results. Additionally, if the set of plausible colorizations is non-convex, the solution will in fact be out of the set, giving implausible results.

What can we do instead? The ab output space is divided into bins with grid size 10, and the top Q = 313 in-gamut (within the range of colours we want to use) are kept:

The network learns a mapping to a probability distribution over these Q colours (a Q-dimensional vector). The ground truth colouring is also translated into a Q-dimensional vector, and the two are compared using a multinomial cross entropy loss. Notably this includes a weighting term to rebalance loss based on colour-class rarity.

The distribution of ab values in natural images is strongly biased towards values with low ab values, due to the appearance of backgrounds such as clouds, pavement, dirt, and walls. Figure 3(b) [below] shows the empirical distribution of pixels in ab space, gathered from 1.3M training images in ImageNet. Observethat the number of pixels in natural images at desaturated values are orders of magnitude higher than for saturated values. Without accounting for this, the loss function is dominated by desaturated ab values.

The final predicted distribution then needs to be mapped to a point estimated in ab space. Taking the mode of the predicted distribution leads to vibrant but sometimes spatially inconsistent results (see RH column below). Taking the mean brings back another form of the desaturation problem (see LH column below).

To try to get the best of both worlds, we interpolate by re-adjusting the temperature T of the softmax distribution, and taking the mean of the result. We draw inspiration from the simulated annealing technique, and thus refer to the operation as taking the annealed-mean of the distribution.

Here are some more colourings from a network trained on ImageNet, which were rated by Amazon Mechanical Turk participants to see how lifelike they are.

And now for my very favourite part of the paper:

Since our model was trained using “fake” grayscale images generated by stripping ab channels from color photos, we also ran our method on real legacy black and white photographs, as shown in Figure 8 (additional results can be viewed on our project webpage). One can see that our model is still able to produce good colorizations, even though the low-level image statistics of the legacy photographs are quite different from those of the modern-day photos on which it was trained.

Aren’t they fabulous! Especially the Migrant Mother colouring.

The representations learned by the network also proved useful for object classification, detection, and segmentation tasks.

### Generative visual manipulation on the natural image manifold

So we’ve just seen that neural networks can help us with our colouring. But what about those of us that are more artistically challenged and have a few wee issues making realistic looking drawings (or alterations to existing drawings) in the first place? In turns out that generative adversarial neural networks can help.

It’s a grand sounding paper title, but you can think of it as “Fiddling about with images while ensuring they still look natural.” I guess that wouldn’t look quite so good in the conference proceedings ;).

Today, visual communication is sadly one-sided. We all perceive information in the visual form (through photographs, paintings, sculpture, etc), but only a chosen few are talented enough to effectively express themselves visually… One reason is the lack of “safety wheels” in image editing: any less-than-perfect edit immediately makes the image look completely unrealistic. To put another way, classic visual manipulation paradigm does not prevent the user from “falling off” the manifold of natural images.

As we know, GANs can be trained to learn effective representations of natural looking images (“the manifold of natural images“). So let’s do that, but then instead of using the trained GAN to generate images, use it as a constraint on the output of various image manipulation operations, to make sure the results lie on the learned manifold at all times. The result is an interactive tool that helps you make realistic looking alterations to existing images. It helps to see the tool in action, you can see a video here. The authors also demonstrate ‘generative transformation’ of one image to look more like another, and my favourite, creating a new image from scratch based on a user’s sketch.

The intuition for using GANs to learn manifold approximations is that they have been shown to produce high-quality samples, and that Euclidean distance in the latent space often corresponds to a perceptually meaningful visual similarity. This means we can also perform interpolation between points in the latent space.

Here’s what happens when the latent vector is updated based on user edits (top row, adding black colour and changing the shape):

In the interactive tool, each update step takes about 50-100ms, working only on the mapped representation of the original image. When the user is done, the generated image captures roughly the desired change, but the quality is degraded as compared to the original image.

To address this issue, we develop a dense correspondence algorithm to estimate both the geometric and color changes induced by the editing process.

This motion and colour flow algorithm is used to estimate the colour and shape changes in the generated image sequence (as user editing progressed), and then transfer them back on top of the original photo to generate photo-realistic images.

The user interface gives the user a colouring brush for changing the colour of regions, a sketching brush to outline shapes or add fine details, and a warping ‘brush’ for more explicit shape modifications.

Here are some results from user edits:

Transformations between two images also take place in the GAN-learned representation space and are mapped back in the same way:

It’s also possible to use the brush tools to create an image from scratch, and then add more scribbles to refine the result. How good is this! :

### WaveNet: a generative model for raw audio

Enough with the images already! What about generating sound? How about text-to-speech sound generation yielding state of the art performance? Hearing is believing, so check out these samples:

(You can find more on the DeepMind blog at https://deepmind.com/blog/wavenet-generative-model-raw-audio/).

We show that WaveNets can generate raw speech signals with subjective naturalness never before reported in the field of text-to-speech (TTS), as assessed by human raters.

The architecture of WaveNet is inspired by PixelRNN (See “RNN models for image generation” from a couple of weeks ago). The foundation is very simple – take a waveform $\mathbf{x}$ with with T values, ${x_1, ..., x_T}$. And let the probability of the $x_t$ be conditioned on all of the values that precede it: $p(x_t | x_1, ..., x_{t-1})$. Now the joint probability of the overall waveform is modelled by:

$p(\mathbf{x}) = \prod_{t=1}^{T} p(x_t | x_1, ..., x_{t-1})$

This can be modelled by a stack of convolutional layers. “By using causal convolutions, we make sure the model cannot violate the ordering in which we model the data…” (the prediction at timestep t cannot depend on any of the future timestamps). This can be implemented by shifting the output of a normal convolution by one or more timesteps.

At training time, the conditional predictions for all timesteps can be made in parallel because all timesteps of ground truth x are known. When generating with the model, the predictions are sequential: after each sample is predicted, it is fed back into the network to predict the next sample.

Causal convolutions need lots of layers to increase their receptive field. WaveNet uses dilated convolutions to increase receptive fields by orders of magnitude, without greatly increasing computational cost.

A dilated convolution is a convolution where the filter is applied over an area large than its length by skipping input values with a certain step.

WaveNet uses dilation doubling in every layer up to a limit of 512, before repeating (1,2,4, …, 512, 1,2,4, …, 512, …).

A straight softmax output layer would need 65,356 probabilities per timestep to model all possible values for raw audio stored as a sequence of 16-bit integer values. The data is quantized to 256 possible values using a non-linear quantization scheme which was found to produce a significantly better reconstruction than a simple linear scheme:
$f(x_t) = sign(x_t) \frac{\ln(1 + \mu|x_t|)}{\ln(1 + \mu)}$

where $-1 < x_t < 1$ and $\mu = 255$. The network uses both residual and parameterised skip connections throughout to speed up convergence. By conditioning the model on additional inputs, WaveNet can be guided to produce audio with the required characteristics (e.g., a certain speaker’s voice). For TTS, information about the text is fed as an extra input. > For the first experiment we looked at free-form speech generation (not conditioned on text). We used the English multi-speaker corpus from CSTR voice cloning toolkit (VCTK) (Yamagishi, 2012) and conditioned WaveNet only on the speaker. The conditioning was applied by feeding the speaker ID to the model in the form of a one-hot vector. The dataset consisted of 44 hours of data from 109 different speakers.

Since it wasn’t conditioned on text, the model generates made-up but human language-like words in a smooth way (see second audio clip at the top of this section). It can model speech from any of the speakers by conditioning it on the one-hot encoding – thus the model is powerful enough to capture the characteristics of all 109 speakers from the dataset in a single model.

The second experiment trained WaveNet on the same single-speaker speech databases that Google’s North American and Mandarin Chinese TTS systems are built on. WaveNet was conditioned on linguistic features derived from the input texts. In subjective paired comparison tests, WaveNet beat the best baselines:

WaveNet also achieved the highest ever score in a mean opinion score test where users had to rate naturalness on a scale of 1-5 (scoring over 4 on average). (See the first speech sample at the top of this section).

The third set of experiments trained WaveNet on two music datasets (see the third speech sample at the top of this section). “…the samples were often harmonic and aesthetically pleasing, even when produced by unconditional models.

From the WaveNet blog post on the DeepMind site:

WaveNets open up a lot of possibilities for TTS, music generation and audio modelling in general. The fact that directly generating timestep per timestep with deep neural networks works at all for 16kHz audio is really surprising, let alone that it outperforms state-of-the-art TTS systems. We are excited to see what we can do with them next.

### Google’s neural machine translation system: bridging the gap between human and machine translation

Google’s Neural Machine Translation (GNMT) system is an end-to-end learning system for automated translation. Previous NMT systems suffered in one or more of three key areas: training and inference speed, coping with rare words, and sometimes failing to translate all of the words in a source sentence. GNMT is now in production at Google, having handsomely beaten the Phrase-Based Machine Translation (PBMT) system used in production at Google beforehand.

Understanding how it all fits together will draw upon many of the papers we’ve looked at so far. At the core it’s a sequence-to-sequence learning network with an encoder network, a decoder network, and an attention network.

The encoder transforms a source sentence into a list of vectors, one vector per input symbol. Given this list of vectors, the decoder produces one symbol at a time, until the special end-of-sentence symbol (EOS) is produced. The encoder and decoder are connected through an attention module which allows the decoder to focus on different regions of the source sentence during the course of decoding.

The decoder is a combination of an RNN network and a softmax layer. Deeper models give better accuracy, but the team found that LSTM layers worked well up to 4 layers, barely with 6 layers, and very poorly beyond 8 layers. What to do? We learned the answer earlier this week, add residual connections:

Since in translation words in the source sentence may appear anywhere in the output sentence, the encoder network uses a bi-directional RNN for the encoder. Only the bottom layer is bi-direction – one LSTM layer processes the sentence left-to-right, while its twin processes the sentence right-to-left.

The encoder and decoder networks are placed on multiple GPUs, with each layer running on a different GPU. As well as using multiple GPUs, to get inference time down quantized inference involving reduce precision arithmetic is also used.

One of the main challenges in deploying our Neural Machine Translation model to our interactive production translation service is that it is computationally intensive at inference, making low latency translation difficult, and high volume deployment computationally expensive. Quantized inference using reduced precision arithmetic is one technique that can significantly reduce the cost of inference for these models, often providing efficiency improvements on the same computational devices.

The model is trained using full-precision floats, it is only for production inference that approximation is used. Here are the decoding times for 6003 English-French sentences across CPU, GPU, and Google’s Tensor Processing Unit (TPU) respectively:

Firstly, note that the TPU beats the CPU and GPU hands-down. The CPU beats the GPU because “our current decoder implementation is not fully utilizing the computation capacities that a GPU can theoretically offer during inference.”

#### Dealing with out of vocabulary words

Neural Machine Translation models often operate with fixed word vocabularies even though translation is fundamentally an open vocabulary problem (names, numbers, dates etc.)… Our most successful approach […] adopts the wordpiece model (WPM) implementation initially developed to solve a Japanese/Korean segmentation problem for the Google speech recognition system. This approach is completely data-driven and guaranteed to generate a deterministic segmentation for any possible sequence of characters.

For example, “Jet makers feud over seat width with big orders at stake” turns into the word pieces: “_J et _makers _fe ud _over _seat _width _with _big _orders _at _stake.” The words ‘Jet’ and ‘feud’ are both broken into two word pieces.

Given a training corpus and a number of desired tokens D, the optimization problem is to select D wordpieces such that the resulting corpus is minimal in the number of wordpieces when segmented according to the chosen wordpiece model.

#### Overall model performance.

The following chart shows side-by-side scores for translations made by the previous production system (PBMT), the new GNMT system, and humans fluent in both languages.

Side-by-side scores range from 0 to 6, with a score of 0 meaning “completely nonsense translation”, and a score of 6 meaning “perfect translation: the meaning of the translation is completely consistent with the source, and the grammar is correct”. A translation is given a score of 4 if “the sentence retains most of the meaning of the source sentence, but may have some grammar mistakes”, and a translation is given a score of 2 if “the sentence preserves some of the meaning of the source sentence but misses significant parts”. These scores are generated by human raters who are fluent in both languages and hence often capture translation quality better than BLEU scores.

Today we’re pressing on with the top 100 awesome deep learning papers list, and the section on recurrent neural networks (RNNs). This contains only four papers (joy!), and even better we’ve covered two of them previously (Neural Turing Machines and Memory Networks, the links below are to the write-ups). That leaves up with only two papers to cover today, however the first paper does run to 43 pages and it’s a lot of fun so I’m glad to be able to devote a little more space to it.

These papers are easier to understand with some background in RNNs and LSTMs. Christopher Olah has a wonderful post on “Understanding LSTM networks” which I highly recommend.

### Generating sequences with recurrent neural networks

This paper explores the use of RNNs, in particular, LSTMs, for generating sequences. It looks at sequences over discrete domains (characters and words), generating synthetic wikipedia entries, and sequences over real-valued domains, generating handwriting samples. I especially like the moment where Graves demonstrates that the trained networks can be used to ‘clean up’ your handwriting, showing what a slightly neater / easier to read version of your handwriting could look like. We’ll get to that shortly…

RNNs can be trained for sequence generation by processing real data sequences one step at a time and predicting what comes next. Assuming the predictions are probabilistic, novel sequences can be generated from a trained network by iteratively sampling from the network’s output distribution, then feeding in the sample as input at the next step. In other words by making the network treat its inventions as if they were real, much like a person dreaming.

Using LSTMs effectively gives the network a longer memory, enabling it to look back further in history to formulate its predictions.

The basic RNN architecture used for all the models in the paper looks like this:

Note how each output vector $y_t$ is used to parameterise a predictive distribution $Pr(x_{t+1} | y_t)$ over the next possible inputs (the dashed lines in the above figure). Also note the use of ‘skip connections’ as we looked at in yesterday’s post.

The LSTM cells used in the network look like this:

They are trained with the full gradient using backpropagation. To prevent the derivatives becoming too large, the derivative of the loss with respect to the inputs to the LSTM layers are clipped to lie within a predefined range.

Onto the experiments…

#### Text prediction

For text prediction we can either use sequences of words, or sequences of characters. With one-hot encodings, the number of different classes for words makes for very large input vectors (e.g. a vocabulary with 10’s of thousands of words of more). In contrast, the number of characters is much more limited. Also,

… predicting one character at a time is more interesting from the perspective of sequence generation, because it allows the network to invent novel words and strings. In general, the experiments in this paper aim to predict at the finest granularity found in the data, so as to maximise the generative flexibility of the network.

The Penn Treebank dataset is a selection of Wall Street Journal articles. It’s relatively small at just over a million words in total, but widely used as a language modelling benchmark. Both word and character level networks were trained on this corpus using a single hidden layer with 1000 LSTM units. Both networks are capable of overfitting the training data, so regularisation is applied. Two forms of regularisation were experimented with: weight noise applied at the start of each training sequence, and adaptive weight noise, where the variance of the noise is learned along with the weights.
The word-level RNN performed better than the character-based one, but the gap closes with regularisation (perplexity of 117 in the best word-based configuration, vs 122 for the best character-based configuration).

“Perplexity can be considered to be a measure of on average how many different equally most probable words can follow any given word. Lower perplexities represent better language models…” ([][] http://www1.icsi.berkeley.edu/Speech/docs/HTKBook3.2/node188_mn.html )

Much more interesting is a network that Graves trains on the first 96M bytes of the Wikipedia corpus (as of March 3rd 2006, captured for the Hutter prize competition). This has seven hidden layers of 700 LSTM cells each. This is an extract of the real Wikipedia data:

And here’s a sample generated by the network (for additional samples, see the full paper):

The sample shows that the network has learned a lot of structure from the data, at a wide range of different scales. Most obviously, it has learned a large vocabulary of dictionary words, along with a subword model that enables it to invent feasible-looking words and names: for example “Lochroom River”, “Mughal Ralvaldens”, “submandration”, “swalloped”. It has also learned basic punctuation, with commas, full stops and paragraph breaks occurring at roughly the right rhythm in the text blocks.

It can correctly open and close quotation marks and parentheses, indicating the models memory and these often span a distance that a short-range context cannot handle. Likewise, it can generate distinct large-scale regions such as XML headers, bullet-point lists, and article text.

Of course, the actual generated articles don’t make any sense to a human reader, it is just their structure that is mimicked. When we move onto handwriting though, the outputs do make a lot of sense to us…

#### Handwriting prediction

To test whether the prediction network could also be used to generate convincing real-valued sequences, we applied it to online handwriting data (online in this context means that the writing is recorded as a sequence of pen-tip locations, as opposed to offline handwriting, where only the page images are available). Online handwriting is an attractive choice for sequence generation due to its low dimensionality (two real numbers per data point) and ease of visualisation.

The dataset consists of handwritten lines on a smart whiteboard, with x,y co-ordinates and end-of-stroke markers (yes/no) captured at each time point. The main challenge was figuring out how to determine a predictive distribution for real-value inputs. The solution is to use mixture density neworks. Here the outputs of the network are used to parameterise a mixture distribution. Each output vector consists of the end of stroke probability e, along with a set of means, standard deviations, correlations, and mixture weights for the mixture components used to predict the x and y positions. See pages 20 and 21 for the detailed explanation.

Here are the mixture density outputs for predicted locations as the word under is written. The small blobs show accurate predictions while individual strokes are being written, and the large blobs show greater uncertainty at the end of strokes when the pen is lifted from the whiteboard.

The best samples were generated by a network with three hidden layers of 400 LSTM cells each, and 20 mixture components to model the offsets. Here are some samples created by the network.

The network has clearly learned to model strokes, letters and even short words (especially common ones such as ‘of’ and ‘the’). It also appears to have learned a basic character level language models, since the words it invents (‘eald’, ‘bryoes’, ‘lenrest’) look somewhat plausible in English. Given that the average character occupies more than 25 timesteps, this again demonstrates the network’s ability to generate coherent long-range structures

#### Handwriting generation

Those samples do of course look like handwriting, but as with our Wikipedia example, the actual words are nonsense. Can we learn to generated handwriting for a given text? To meet this challenge a soft window is convolved with the text string and fed as an extra input to the prediction network.

The parameters of the window are output by the network at the same time as it makes the predictions, so that it dynamically determines an alignment between the text and the pen locations. Put simply, it learns to decide which character to write next.

The network learns how far to slide the text window at each step, rather than learning an absolute position. “Using offsets was essential to getting the network to align the text with the pen trace.

And here are samples generated by the resulting network:

Pretty good!

#### Biased and primed sampling to control generation

One problem with unbiased samples is that they tend to be difficult to read (partly because real handwriting is difficult to read, and partly because the network is an imperfect model). Intuitively, we would expect the network to give higher probability to good handwriting because it tends to be smoother and more predictable than bad handwriting. If this is true, we should aim to output more probable elements of Pr(x|c) if we want the samples to be easier to read. A principled search for high probability samples could lead to a difficult inference problem, as the probability of every output depends on all previous outputs. However a simple heuristic, where the sampler is biased towards more probable predictions at each step independently, generally gives good results.

As we increase the bias towards higher probability predictions, the handwriting gets neater and neater…

As a final flourish, we can prime the network with a real sequence in the handwriting of a particular writer. The network then continues in this style, generating handwriting mimicking the author’s style.

Combine this with bias, and you also get neater versions of their handwriting!

### Conditional random fields as recurrent neural networks

Now we turn our attention to a new challenge problem that we haven’t looked at yet: semantic segmentation. This requires us to label the pixels in an image to indicate what kind of object they represent/are part of (land, building, sky, bicycle, chair, person, and so on…). By joining together regions with the same label, we segment the image based on the meaning of the pixels. Like this:

(The CRF-RNN column in the above figure shows the results from the network architecture described in this paper).

As we’ve seen, CNNs have been very successful in image classification and detection, but there are challenges applying them to pixel-labelling problems. Firstly, traditional CNNs don’t produce fine-grained enough outputs to label every pixel. But perhaps more significantly, even if we could overcome that hurdle, they don’t have any way of understanding that if pixel A is part of, say, a bicycle, then it’s likely that the adjacent pixel B is also part of a bicycle. Or in more fancy words:

CNNs lack smoothness constraints that encourage label agreement between similar pixels, and spatial and appearance consistency of the labelling output. Lack of such smoothness constraints can result in poor object delineation and small spurious regions in the segmentation output.

Conditional Random Fields (a variant of Markov Random Fields) are very good at smoothing. They’re basically models that take into account surrounding context when making predictions. So maybe we can combine Conditional Random Fields (CRF) and CNNs in some way to get the best of both worlds?

The key idea of CRF inference for semantic labelling is to formulate the label assignment problem as a probabilistic inference problem that incorporates assumptions such as the label agreement between similar pixels. CRF inference is able to refine weak and coarse pixel-level label predictions to produce sharp boundaries and fine-grained segmentations. Therefore, intuitively, CRFs can be used to overcome the drawbacks in utilizing CNNs for pixel-level labelling tasks.

Sounds good in theory, but it’s quite tricky in practice. The authors proceed in two stages: firstly showing that one iteration of the mean-field algorithm used in CRF can be modelled as a stack of common CNN layers; and secondly by showing that repeating the CRF-CNN stack with outputs from the previous iteration fed back into the next iteration you can end up with an RNN structure, dubbed CRF-RNN, that implements the full algorithm.

Our approach comprises a fully convolutional network stage, which predicts pixel-level labels without considering structure, followed by a CRF-RNN stage, which performs CRF-based probabilistic graphical modelling for structured prediction. The complete system, therefore, unifies the strengths of both CNNs and CRFs and is trainable end-to-end using the back-propagation algorithm and the Stochastic Gradient Descent (SGD) procedure.

There’s a lot of detail in the paper, some of which passes straight over my head, for example, the following sentence which warranted me bringing out the ‘hot pink’ highlighter:

In terms of permutohedral lattice operations, this can be accomplished by only reversing the order of the separable filters in the blur stage, while building the permutohedral lattice, splatting, and slicing in the same way as in the forward pass.

(What is a permutohedron you may ask? It’s actually not as scary as it sounds…)

Fortunately, we’re just trying to grok the big picture in this write-up, and for that the key is to understand how CNNs can model one mean-field iteration, and then how we stack the resulting structures in RNN formation.

#### Mean-field iteration as a stack of CNN layers

Consider a vector X with one element per pixel, representing the label assigned to that pixel drawn from some pre-defined set of labels. We construct a graph where the vertices are the elements in X, and edges between the elements hold pairwise ‘energy’ values. Minimising the overall energy of the configuration yields the most probable label assignments. Energy has two components: a unary component which depends only on the individual pixel and roughly speaking, predicts labels for pixels without considering smoothness and consistency; and pairwise energies that provide an image data-dependent smoothing term that encourages assigning similar labels to pixels with similar properties. The energy calculations are based on feature vectors derived from image features. Mean-field iteration is used to find an approximate solution for the minimal energy configuration.

The steps involved in a single iteration are:

• message passing,
• re-weighting,
• compatibility transformation,
• normalisation

Message passing is made tractable by using approximation techniques (those permutohedral lattice thingies) and two Guassian kernels: a spatial kernel and a bilateral kernel. Re-weighting can be implemented as a 1×1 convolution. Each kernel is given independent weights:

The intuition is that the relative importance of the spatial kernel vs the bilateral kernel depends on the visual class. For example, bilateral kernels may have on the one hand a high importance in bicycle detection, because similarity of colours is determinant; on the other hand they may have low importance for TV detection, given that whatever is inside the TV screen may have different colours.

Compatibility transformation assigns penalties when different labels are assigned to pixels with similar properties. It is implemented with a convolutional filter with learned weights (equivalent to learning a label compatibility function).

The addition (copying) and normalisation (softmax) operations are easy.

#### CRF as a stack of CRF-CNN layers

Multiple mean-field iterations can be implemented by repeating the above stack of layers in such a way that each iteration takes Q value estimates from the previous iteration and the unary values in their original form. This is equivalent to treating the iterative mean-field inference as a Recurrent Neural Network (RNN)… We name this RNN structure CRF-RNN.

Recall that the overall network has a fully-convolution network stage predicting pixels labels in isolation, followed by a CRF-CNN for structured prediction. In one forward pass the computation goes through the initial CNN stage, and then it takes T iterations for data to leave the loop created by the RNN. One the data leaves this loop, a softmax loss layer directly follows and terminates the network.

The resulting network achieves the state-of-the-art on the Pascal VOC 2010-2012 test datasets.

Today we’re looking at the final four papers from the ‘convolutional neural networks’ section of the ‘top 100 awesome deep learning papers‘ list.

### Deep residual learning for image recognition

Another paper, another set of state-of-the-art results, this time with 1st place on the ILSVRC 2015 classification tasks (beating GoogLeNet from the year before), as well as 1st place on ImageNet detection, ImageNet localisation, COCO detection, and COCO segmentation competitions. For those of you old enough to remember it, there’s a bit of a Crocodile Dundee moment in this paper: “22 layers? That’s not a deep network, this is a deep network…” How deep? About 100 layers seems to work well, though the authors tested networks up to 1202 layers deep!! Which begs the question, how on earth do you effectively train a network that is hundreds or even a thousand layers deep?

Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers?

No, it’s not. A degradation problem occurs as network depth increases, accuracy gets saturated, and then degrades rapidly as further layers are added. So after a point, the more layers you add, the worse the error rates. For example:

An interesting thought experiment leads the team to make a breakthrough and defeat the degradation problem. Imagine a deep network where after a certain (relatively shallow) number of layers, each additional layer is simply an identity mapping from the previous one: “the existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart.”

Suppose the desired mapping through a number of layers for certain features really is close to the identity mapping. The degradation problem suggests that the network finds it hard to learn such mappings across multiple non-linear layers. Lets say the output of one layer is $\mathbf{x}$. Now skip forward three layers in the network and imagine that we’d like the ability to easily learn an identity mapping (or close to) for some elements of $\mathbf{x}$ – an easy way to do this is just to provide the elements of $\mathbf{x}$ as direct inputs…

But we don’t want to just double the width of the receiving layer, and for some other elements of $\mathbf{x}$ perhaps we do want to learn some non-linear function. So here comes the clever part: suppose the ideal function we could learn is $H(\mathbf{x})$. If the intervening hidden layers can learn that then they can also learn $F(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}$. Maybe you can see where this is going: we can now simply combine the output of the previous layer with $\mathbf{x}$ using simple addition and we get $F(\mathbf{x}) + \mathbf{x} = H (\mathbf{x})$ ! But constructing things in this way though, we make it easy for the network to include both non-linear elements and also near identity elements (by learning weights close to zero).

Adding these layer jumping + addition short-circuits to a deep network creates a residual network. Here’s a 34-layer example:

Here are 18 and 34 layer networks trained on ImageNet without any residual layers. You can see the degradation effect with the 34 layer network showing higher error rates.

Take the same networks and pop in the residual layers, and now the 34-layer network is handsomely beating the 18-layer one.

Here are the results all the way up to 152 layer networks:

Now that’s deep!

### Identity mappings in deep residual networks

This paper analyses the skip-layers introduced in the residual networks (ResNets) that we just looked at to see whether the identity function is the best option for skipping. It turns out that it is, and that using an identity function for activation as well makes residual units even more effective.

Recall that we used $F(\mathbf{x}) + \mathbf{x}$ as the input when skipping layers. More generally, this is $F(\mathbf{x}) + h(\mathbf{x})$ where h is the identity function. But h could be also be another kind of function – for example constant scaling. The authors try a variety of different h functions, but identity works best. When we consider the activation function f as well (the original paper used ReLU) then we have $f(F(\mathbf{x}) + h(\mathbf{x}))$. Things work even better when f is also an identity mapping (instead of ReLU).

To construct an identity mapping f, we view the activation functions (ReLU and Batch Normalization) as “pre-activation” of the weight layers, in contrast to conventional wisdom of “post-activation”. This point of view leads to a new residual unit design, shown in Fig 1(b).

Making both h and f identity mappings allows a signal to be directly propagated from one unit to any other units, in both forward and backward passes.

Here’s the difference the new residual unit design makes when training a 1001 layer ResNet:

Based on this unit, we present competitive results on CIFAR-10/100 with a 1001-layer ResNet, which is much easier to train and generalizes better than the original ResNet. We further report improved results on ImageNet using a 200-layer ResNet, for which the counterpart of [the original ResNet] starts to overfit. These results suggest that there is much room to exploit the dimension of network depth, a key to the success of modern deep learning.

### Inception-v4, inception-resnet and the impact of residual connections or learning

While all this ResNet excitement was going on, the Inception team kept refining their architecture, up to Inception v3. The obvious question became: what happens if we take the Inception architecture and we add residual connections to it? Does that further improve training time or accuracy? This paper compares four models: Inception v3, a newly introduced in this paper Inception v4, and variations of Inception v3 and v4 that also have residual connections. It’s also fun to see ever more complex network building blocks being used as modules in higher level architectures. Inception v4 is too much to show in one diagram, but here’s the overall schematic:

And as a representative selection, here’s what you’ll find if you dig into the stem module:

And the ’17 x 17 Inception-B’ module:

Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin.

Since Inception 4 and Inception-ResNet-v2 (Inception 4 with residual connections) give overall very similar results, this seems to suggest that – at least at this depth – residual networks are not necessary for training deep networks.

However, the use of residual connections seems to improve the training speed greatly, which is alone a great argument for their use.

Both the Inception v4 and Inception ResNet-2 models outperform previous Inception networks, by virtue of the increased model size.

### Rethinking the Inception architecture for computer vision

This paper comes a little out of order in our series, as it covers the Inception v3 architecture. The bulk of the paper though is a collection of advice for designing image processing deep convolutional networks. Inception v3 just happens to be the result of applying that advice.

In this paper, we start with describing a few general principles and optimization ideas that proved to be useful for scaling up convolution networks in efficient ways.

• Avoid representational bottlenecks – representation size should gently decrease from the inputs to the outputs before reaching the final representation used for task at hand. Big jumps (downward) in representation size cause extreme compression of the representation and bottleneck the model.
• Higher dimensional representationsare easier to process locally in a network, more activations per tile allows for more disentangled features. The resulting networks train faster.
• Spatial aggregation of lower dimensional embeddings can be done without much or any loss in representational power. “For example, before performing a more spread out (e.g. 3×3 convolution), one can reduce the dimension of the input representation before the spatial aggregation without expecting serious adverse effects.”
• Balance network width and depth, optimal performance is achieved by balancing the number of filters per stage and the depth of the network. I.e., if you want to go deeper you should also consider going wider.

Although these principles might make sense, it is not straightforward to use them to improve the quality of networks out of the box. The idea is to use them judiciously in ambiguous situations only.

• Factorize into smaller convolutions, a larger (e.g. 5×5) convolution is disproportionately more expensive than a smaller (e.g. 3×3) one – by a factor of 25/9 in this case. Replacing the 5×5 convolution with a two-layer network of 3×3 convolutions reusing activations between adjacent tiles achieves the same end but uses (9+9)/25 less computation.

The above results suggest that convolutions with filters larger 3 × 3 a might not be generally useful as they can always be reduced into a sequence of 3 × 3 convolutional layers. Still we can ask the question whether one should factorize them into smaller, for example 2×2 convolutions. However, it turns out that one can do even better than 2 × 2 by using asymmetric convolutions, e.g. n × 1. For example using a 3 × 1 convolution followed by a 1 × 3 convolution is equivalent to sliding a two layer network with the same receptive field as in a 3×3 convolution. The two-layer solution is 33% cheaper.

Taking this idea even further (e.g. replacing 7×7 with a 1×7 followed by a 7×1) works well on medium grid sizes (between 12 and 20), so long as it is not used in early layers.

The authors also revisit the question of the auxiliary classifiers used to aid training in the original Inception. “Interestingly, we found that auxiliary classifiers did not result in improved convergence early in the training… near the end of training the network with the auxiliary branches starts to overtake the accuracy of the network without, and reaches a slightly higher plateau.” Removing the lower of the two auxiliary classifiers also had no effect.

Together with the earlier observation in the previous paragraph, this means that original the hypothesis of [Inception] that these branches help evolving the low-level features is most likely misplaced. Instead, we argue that the auxiliary classifiers act as regularizer.

There are several other hints and tips in the paper that we don’t have space to cover. It’s well worth checking out if you’re building these kinds of models yourself.

Today it’s the second tranche of papers from the convolutional neural nets section of the ‘top 100 awesome deep learning papers‘ list:

### Return of the devil in the details: delving deep into convolutional nets

This is a very nice study. CNNs had been beating handcrafted features in image recognition tasks, but it was hard to pick apart what really accounted for the differences (and indeed, the differences between different CNN models too) since comparisons across all of them were not done on a shared common basis. For example, we’ve seen augmentation typically used with CNN training – how much of the difference between IVF (the Improved Fisher Vector hand-engineered feature representation) and CNNs is attributed to augmentation, and not the image representation used? So Chatfield et al. studied IFV shallow representation, three different CNN-based deep representations, and deep representations with pre-training and then fine-tuning on the target dataset. For all of the studies, the same task (PASCAL VOC classification) was used. The three different CNN representations are denoted CNN-F (Fast) – based on Krizhevksy’s architecture; CNN-M (Medium) – using a decreased stride and smaller receptive field of the first convolutional layer; and CNN-S (Slow) based on the ‘accurate’ network from OverFeat.

Key findings:

• Augmentation improves performance by ~3% for both IFV and CNNs. (Or to put it another way, the use of augmentation accounts for 3% of the advantage attributed to deep methods). Flipping on its own helps only marginally, but flipping combined with cropping works well.
• Both IFV and CNNs are affected by adding or subtracting colour information. Retraining CNNs after converting images to grayscale results in about a 3% performance drop.
• CNN based methods still outperform shallow encodings, even accounting for augmentation improvements etc., by a large approximately 10% margin.
• CNN-M and CNN-S both outperform CNN-Fast by 2-3%. CNN-M is about 25% faster than CNN-S.
• Retraining the CNNs so that the final layer was of lower dimensionality resulted in a marginal performance boost.
• Fine-tuning makes a significant difference, improving results by about 2.7%.

In this paper we presented a rigorous empirical evaluation of CNN-based methods for image classification, along with a comparison with more traditional shallow feature encoding methods. We have demonstrated that the performance of shallow representations can be significantly improved by adopting data augmentation, typically used in deep learning. In spite of this improvement, deep architectures still outperform the shallow methods by a large margin. We have shown that the performance of deep representations on the ILSVRC dataset is a good indicator of their performance on other datasets, and that fine-tuning can further improve on already very strong results achieved using the combination of deep representations and a linear SVM.

### Spatial pyramid pooling in deep convolutional networks for visual recognition

The CNN architectures we’ve looked at so far have a series of convolutional layers (5 is popular) followed by fully connected layers and an N-way softmax output. One consequence of this is that they can only work with images of fixed size (e.g 224 x 224). Why? The sliding windows used in the convolutional layers can actually cope with any image size, but the fully-connected layers have a fixed sized input by construction. It is the point at which we transition from the convolution layers to the fully-connected layers therefore that imposes the size restriction. As a result images are often cropped or warped to fit the size requirements of the network, which is far from ideal.

Spatial pyramid pooling (SPP) adds a new layer between the convolutional layers and the fully-connected layers. Its job is to map any size input down to a fixed size output. The ideal of spatial pyramid pooling, also know as spacial pyramid matching or just ‘multi-level pooling’ pre-existed in computer vision, but had not been applied in the context of CNNs.

SPP works by dividing the feature maps output by the last convolutional layer into a number of spatial bins with sizes proportional to the image size, so the number of bins is fixed regardless of the image size. Bins are captured at different levels of granularity – for example, one layer of 16 bins dividing the image into a 4×4 grid, another layer of 4 bins dividing the image into a 2×2 grid, and a final layer comprising the whole image. In each spatial bin, the responses of each filter are simply pooled using max pooling.

Since the number of bins is known, we can just concatenate the SPP outputs to give a fixed length representation (see the figure above).

This not only allows arbitrary aspect ratios, but also allows arbitrary scales… When the input image is at different scales, the network will extract features at different scales. Interestingly, the coarsest pyramid level has a single bin that covers the entire image. This is in fact a “global pooling” operation, which is also investigated in several concurrent works.

An SPP layer added to four different networks architects, including AlexNet (Krizhevsky et al.) and OverFeat improved the accuracy of all of them. “The gain of multi-level pooling is not simply due to more parameters, rather it is because the multi-level pooling is robust to the variance in object deformations and spatial layout.

The SPP technique can also be used for detection. The state-of-the-art (as of time of publication) R-CNN method runs feature extraction on each of 2000 candidate windows extracted from an input image. This is expensive and slow. An SPP-net used for object detection extracts feature maps only once (possible at multiple scales). Then just the spatial pyramid pooling piece is run once for each candidate window. This turns out to give comparable results, but with running times 38x-102x faster depending on the number of scales.

### Very deep convolutional networks for large-scale image recognition

We’ve now seen the ConvNet architecture and some variations exploring what happens with different window sizes and strides, and training and testing at multiple scales. In this paper Simonyan and Zisserman hold all those variables fixed, and explore what effect the depth of the network has on classification accuracy.

The basic setup is a fixed-size 224 x 224 RGB input image with mean pixel value (computed over the training set) subtracted from each pixel. A stack of convolutional layers (with varying depth in each of the experiments) uses filters with a 3×3 receptive field, and in one configuration a layer is added with a 1×1 field (which can be seen as a linear transformation of the input channels, followed by non-linearity). The stride is fixed at 1 pixel. Spatial pooling is carried out by five max-pooling layers interleaved with the convolutional layers. This stack is then feed into three fully-connected layers and a final soft-max layer. The hidden layers all use ReLU activation.

Given this basic structure, the actual networks evaluated are shown in the table below (note that only one of them uses local response normalisation – LRN):

Here’s what the authors find:

Firstly, local response normalisation did not improve accuracy (A-LRN vs A), but adds to training time, so it is not employed in the deeper architectures. Secondly, classification error decreases with increased ConvNet depth (up to 19 layers in configuration E).

The error rate of our architecture saturates when the depth reaches 19 layers, but even deeper models might be beneficial for larger datasets. We also compared the net B with a shallow net with five 5 × 5 conv. layers, which was derived from B by replacing each pair of 3 × 3 conv. layers with a single 5 × 5 conv. layer (which has the same receptive field as explained in Sect. 2.3). The top-1 error of the shallow net was measured to be 7% higher than that of B (on a center crop), which confirms that a deep net with small filters outperforms a shallow net with larger filters.

The results above were achieved when training at a single scale. Even better results were achieved by adding scale jittering at training time (lightly rescaled versions of the original image).

### Going deeper with convolutions

So now things start to get really deep! This is the paper that introduced the ‘Inception’ network architecture, and a particular instantiation of it called ‘GoogLeNet’ which achieved a new state of the art in the 2014 ISLVRC (ImageNet) competition. GoogLeNet is 22 layers deep, and has a pretty daunting overall structure, which I thought I’d just include here in its full glory!

Despite the intimidating looking structure, GoogLeNet actually uses 12x fewer parameters than the winning Krizhevsky ConvNet of two years prior. At the same time, it is significantly more accurate. Efficiency in terms of power and memory use was an explicit design goal of the Inception architecture:

It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor [efficiency] rather than having a sheer fixation on accuracy numbers. For most of the experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time, so that they do not end up to be a purely academic curiosity, but could be put to real world us, even on large datasets, at a reasonable cost.

You can always make your network ‘bigger’ (both in terms of number of layers – depth, as well as the number of units in each layer – width), and in principle this leads to higher quality models. However, bigger networks have many more parameters making them prone to overfitting. To avoid this you need much more training data. They also require much more computational resource to train. “For example, in a deep vision network if two convolutional layers are chained, any uniform increase in the number of their filters results in a quadratic increase in computation.

One way to counteract this is to introduce sparsity. Arora et al., in “Provable bounds for learning some deep representations” showed that if the probability distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network technology can be constructed layer after layer by analyzing the correlation statistics of the preceding layer activations and clustering neurons with highly correlated outputs.

Unfortunately, today’s computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures.

Is there any hope for an architecture that makes use of filter-level sparsity, as suggested by the theory, but exploits current hardware by utilizing computations on dense matrices? The Inception architecture started out as an exploration of this goal.

The main idea of the Inception architecture is to consider how an optimal local sparse structure of a convolutional vision network can be approximated and covered by readily available dense components.

Given the layer-by-layer construction approach, this means we just have to find the optimal layer structure, and then stack it. The basic structure of an Inception layer looks like this:

The 1×1 convolutions detect correlated units in local regions, and the larger (3×3 and 5×5) convolutions detect the (smaller number) of more spatially spread out clusters. Since pooling has been shown to have a beneficial effect, a pooling path in each stage is added for good luck too.

The problem with the structure as shown above though, is that it is prohibitively expensive!

While this architecture might cover the optimal sparse structure, it would do it very inefficiently, leading to a computational blow up within a few stages.

The solution is to reduce the network dimensions using 1×1 convolutions on all the 3×3 and 5×5 pathways. “Beside being used as reductions, they also include the use of rectified linear activation making them dual purpose.” This gives a final structure for an Inception layer that looks like this:

In general, an Inception network is a network consisting of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid.

You can see these layers stacked on top of each other in the GoogLeNet model. That network is 22 layers deep (27 if the pooling layers are also counted). The overall number of building blocks used for the construction of the network is about 100! Propagating gradients all the way back through so many layers is a challenge. We know that shallower networks still have strong discriminative performance, and this fact was exploited by adding auxiliary classifiers connected to intermediate layers (yellow boxes in the overall network diagram at the start of this section)…

… These classifiers take the form of smaller convolutional networks put on top of the output of the Inception (4a) and (4d) modules. During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3). At inference time, these auxiliary networks are discarded.

Having recovered somewhat from the last push on deep learning papers, it’s time this week to tackle the next batch of papers from the ‘top 100 awesome deep learning papers.’ Recall that the plan is to cover multiple papers per day, in a little less depth than usual per paper, to give you a broad sweep of what’s in them (See “An experiment with awesome deep learning papers“). You’ll find the first batch of papers in the archives starting from February 27th. For the next three days, we’ll be tackling the papers from the ‘convolutional neural network models’ section, starting with:

Ujjwal Karn’s excellent blog post “An intuitive explanation of convolutional neural networks” provides a some great background on how convolutional networks work if you need a refresher before diving into these papers.

### ImageNet classification with deep convolutional neural networks

This is a highly influential paper that kicked off a whole stream of work using deep convolutional neural networks for image processing. Two factors changed that made this possible: firstly, the availability of large enough datasets (specifically, the introduction of ImageNet with millions of images, whereas the previous largest datasets had ‘only’ tens of thousands; and secondly the development of powerful enough GPUs to efficiently train large networks.

What makes CNNs such a good fit for working with image data?

Their capacity can be controlled by varying their depth and breadth, and they also make strong and mostly correct assumptions about the nature of images (namely, stationarity of statistics and locality of pixel dependencies). Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have much fewer connections and parameters and so they are easier to train, while their theoretically-best performance is likely to be only slightly worse.

The network that Krizhevsky et al. constructed has eight learned layers – five of them convolutional and three fully-connected. The output of the last layer is a 1000-way softmax which produces a distribution over the 1000 class labels (the ImageNet challenge is to create a classifier that can determine which object is in the image). Because the network is too large to fit in the memory of one GPU, training is split across two GPUs and the kernels of the 2nd, 4th, and 5th convolutional layers are connected only to those kernel maps in the previous layer which reside on the same GPU.

The authors single out four other aspects of the model architecture that they feel are particularly important:

1. The use of ReLU activation (instead of tanh). “Deep convolutional neural networks with ReLUs train several times faster than their equivalents with tanh units… Faster learning has a great influence on the performance of large models trained on large datasets
2. Using multiple GPUs (two!), and splitting the kernels between them with cross-GPU communication only in certain layers. The scheme reduces the top-1 and top-5 error rates by 1.7% and 1.2% respectively compared to a net with half as many kernels in each layer and trained on just one GPU.
3. Using local response normalisation, which “implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels”.
4. Using overlapping pooling. Let pooling layers be of size z x z, and spaced s pixels apart. Traditionally pooling was used with s = z, so that there was no overlap between pools. Krizhevsky et al. used s = 2 and z = 3 to give overlapping pooling. This reduced the top-1 and top-5 error rates by 0.4% and 0.3% respectively.

To reduce overfitting dropout and data augmentation (translations, reflections, and principal component manipulation) is used during training.

The end result:

On ILSVRC-2010, our network achieves top-1 and top-5 test set error rates of 37.5% and 17.0%.

In ILSVRC-2012, the network achieved a top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. That huge winning margin sparked the beginning of a revolution.

Our results show that a large, deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised learning. It is notable that our network’s performance degrades if a single convolutional layer is removed. For example, removing any of the middle layers results in a loss of about 2% for the top-1 performance of the network. So the depth really is important for achieving our results.

For a more in-depth look at this paper, see my earlier write-up on The Morning Paper.

### Maxout networks

Maxout networks are designed to work hand-in-glove with dropout. As you may recall, training with dropout is like training an exponential number of models all sharing the same parameters. Maxout networks are otherwise standard multilayer perceptron or deep CNNs that use a special activation function called the maxout unit. The output of a maxout unit is simply the maximum of its inputs.

In a convolutional network, a maxout feature map can be constructed by taking the maximum across k affine feature maps
(i.e., pool across channels, in addition spatial locations). When training with dropout, we perform the elementwise multiplication with the dropout mask immediately prior to the multiplication by the weights in all cases–we do not drop inputs to the max operator.

Maxout units make a piecewise linear approximation to an arbitrary convex function, as illustrated below.

Under evaluation, the combination of maxout and dropout achieved state of the art classification performance on MNIST, CIFAR-10, CIFAR-100, and SVHN (Street View House Numbers).

Why does it work so well? Dropout does exact model averaging in deeper architectures provided that they are locally linear among the space of inputs to each layer that are visited by applying different dropout masks…

We argue that dropout training encourages maxout units to have large linear regions around inputs that appear in the training data… Networks of linear operations and maxout may learn to exploit dropout’s approximate model averaging technique well.

In addition, rectifier units that saturate at zero are much more common with dropout training. A zero value stops the gradient from flowing through the unit making it hard to change under training and become active again.

Maxout does not suffer from this problem because gradient always flows through every maxout unit–even when a maxout unit is 0, this 0 is a function of the parameters and may be adjusted. Units that take on negative activations may be steered to become positive again later.

### Network in Network

A traditional convolutional layer applies convolution by applying linear filters to the receptive field, followed by nonlinear activation. The outputs are called feature maps. In this paper Lin et al. point out that such a process cannot learn representations that distinguish well between non-linear concepts.

The convolution filter in CNN is a generalized linear model (GLM) for the underlying data patch, and we argue that the level of abstraction is low with GLM. By abstraction we mean that the feature is invariant to the variants of the same concept.

Even maxout networks impose the constraint that instances of a latent concept lie within a convex set in the input space, to which they make a piecewise linear approximation. A key question therefore, is whether or not input features do indeed require non-linear functions in order to best represent the concepts contained within them. The authors assert that they do:

…the data for the same concept often live on a nonlinear manifold, therefore the representations that capture these concepts are generally highly nonlinear function of the input. In NIN, the GLM is replaced with a ”micro network” structure which is a general nonlinear function approximator. In this work, we choose multilayer perceptron as the instantiation of the micro network, which is a universal function approximator and a neural network trainable by back-propagation.

And that’s the big idea right there, replace the linear convolutional layer with a mini multilayer perceptron network (called an mlpconv layer). We know that such networks are good at learning functions, so let’s just allow the mlpconv network to learn what the best convolution function is. Since the mlpconv layers sit inside a larger network model, the overall approach is called “network in network.”

The second change that Lin et al. make to the traditional architecture comes at the last layer:

Instead of adopting the traditional fully connected layers for classification in CNN, we directly output the spatial average of the feature maps from the last mlpconv layer as the confidence of categories via a global average pooling layer, and then the resulting vector is fed into the softmax layer. In traditional CNN, it is difficult to interpret how the category level information from the objective cost layer is passed back to the previous convolution layer due to the fully connected layers which act as a black box in between. In contrast, global average pooling is more meaningful and interpretable as it enforces correspondance between feature maps and categories, which is made possible by a stronger local modeling using the micro network.

On CIFAR-10 (10 classes of natural images with 50,000 traning images in total, and 10,000 testing images), the authors beat the then state-of-the-art by more than one percent.

On CIFAR-100 they also beat the then best performance (without data augmentation, which was also not used to evaluate the N-in-N approach) by more than one percent. With the street view house numbers dataset, and with MNIST the authors get good results, but not quite state-of-the-art.

### OverFeat: Integrated recognition, localization and detection using convolutional networks

OverFeat shows how the features learned by a CNN-based classifier can also be used for localization and detection. On the ILSVRC 2013 dataset OverFeat ranked 4th in classification, 1st in localization, and 1st in detection. We know what the classification problem is (what object is in this image), but what about the localization and detection problems? Localization is like classification, but the network must also produce a bounding box showing the boundary of the detected object:

The detection problem involves images which may contain many small objects, and the network must detect each object and draw its bounding box:

The main point of this paper is to show that training a convolutional network to simultaneously classify, locate and detect objects in images can boost the classification accuracy and the detection and localization accuracy of all tasks. The paper proposes a new integrated approach to object detection, recognition, and localization with a single ConvNet. We also introduce a novel method for localization and detection by accumulating predicted bounding boxes.

There’s a lot of detail in the OverFeat paper that we don’t have space to cover, so if this work is of interest this paper is one where I definitely recommend going on to read the full thing.

Since objects of interest (especially in the later detection task) can vary significantly in size and position within the image, OverFeat applies a ConvNet at multiple locations within the image in a sliding window fashion, and at multiple scales. The system is then trained to produce a prediction of the location and size of the bounding box containing an object relative to the window. Evidence for each object category is accumulated at each location and size.

Starting with classification, the model is based on Krizhevsky et al. (our first paper today) in the first five layers, but no contrast normalisation is used, pooling regions are non-overlapping, and a smaller stride is used (2 instead of 4). Six different scales of input are used, resulting in unpooled layer 5 maps of varying resolution. These are then pooled and presented to the classifier.

The following diagram summarises the classifier approach built on top of the layer 5 feature maps:

At an intuitive level, the two halves of the network — i.e. feature extraction layers (1-5) and classifier layers (6-output) — are used in opposite ways. In the feature extraction portion, the filters are convolved across the entire image in one pass. From a computational perspective, this is far more efficient than sliding a fixed-size feature extractor over the image and then aggregating the results from different locations. However, these principles are reversed for the classifier portion of the network. Here, we want to hunt for a fixed-size representation in the layer 5 feature maps across different positions and scales. Thus the classifier has a fixed-size 5×5 input and is exhaustively applied to the layer 5 maps. The exhaustive pooling scheme (with single pixel shifts (∆x, ∆y)) ensures that we can obtain fine alignment between the classifier and the representation of the object in the feature map.

#### Localization

For localization the classifier layers are replaced by a regression networks trained to predict bounding boxes at each spatial location and scale. The regression predictions are then combined, along with the classification results at each location. Training with multiple scales ensure predictions match correctly across scales, and exponentially increases the confidence of the merged predictions.

Bounding boxes are combined based on the distance between their centres and the intersection of their areas, and the final prediction is made by taking the merged bounding boxes with maximum class scores. Figure 6 below gives a visual overview of the process:

#### Detection

Detection training is similar to classification training, but with the added necessity of predicting a background task when no object is present.

Traditionally, negative examples are initially taken at random for training, then the most offending negative errors are added to the training set in bootstrapping passes… we perform negative training on the fly, by selecting a few interesting negative examples per image such as random ones or most offending ones. This approach is more computationally expensive, but renders the procedure much simpler.

Omid, reloaded: scalable and highly-available transaction processing Shacham et al., FAST ’17

Omid is a transaction processing service powering web-scale production systems at Yahoo that digest billions of events per day and push them into a real-time index. It’s also been open-sourced and is currently incubating at Apache as the Apache Omid project. What’s interesting about Omid is that it adds transaction processing as a layer on top of an underlying distributed key-value store. In theory the underlying store is pluggable, but in practice it looks pretty tied to Apache HBase (and that’s the implementation you’ll find in Apache Omid). The idea is similar in spirit to the ‘bolt-on’ designs for strengthening consistency guarantees that we’ve looked at in previous editions of The Morning Paper.

Omid is designed for industrial use, leading to two high-level design principles:

1. Build on top of battle-tested key-value store technology (don’t reinvent this layer)
2. Strive for simplicity in operations: it should be easy to deploy, support, maintain, and monitor in production. This goal led the team to a design with a single centralized transaction manager (running in a primary-backup configuration).

That second design point sounds distinctly unfashionable of course, but Omid works hard to restore scalability for throughput-oriented workloads (you won’t be using Omid for any low latency requirements at the moment), and continued availability in the face of failure. Still, it’s interesting to compare the Omid design to the HopFS design that we looked at last week.

Omid scales to hundreds of thousands of transactions per second, over multi-petabyte shared storage. It supports snapshot isolation (not the full strength serializability). This means you can still see write skew under Omid – see “A Critique of ANSI SQL Isolation Levels” for all the gory details.

It’s helpful to understand how Omid is deployed at Yahoo for the Sieve content management system to orient ourselves. Apache HBase is used to store both application data and the commit table (CT, we’ll look at this soon…). HBase, the primary and backup transaction manager nodes, and ZooKeeper are all deployed on separate dedicated machines. HBase tables are sharded into multiple regions (10 for Sieve), managed by region servers (5 for Sieve). HBase is deployed on top of HDFS with 3-fold replication, and petabytes of data under management.

### High-level architecture

Omid stores all data (and it’s own transaction management data) in an HBase data plane. This provides persistence, scalability and high availability for data out of the box. Omid itself manages the transaction control plane via a centralized transaction manager (TM). The TM performs version (timestamp) allocation, conflict detection, and persistent logging of commits. Transaction metadata is stored in HBase, but conflict management is done entirely in RAM in the TM node (scale-up, rather than scale-out).

The transaction metadata consists of a commit table (CT) with information on committed transactions, and an additional commit field (cf) added to every key-value pair in the data store.

### Transaction processing

Transactions move through three states: initiated, committed, and completed. Transaction processing is accomplished via a curious dance between clients and the TM. Let’s walkthrough a sample transaction that reads one value, writes one value, and then commits.

• (A) The client tells the TM it wants to begin a new transaction. The TM increments its logical clock and returns the clock timestamp which is used as both the transaction id $tx_{id}$ and the transaction read timestamp $ts_r$.
• (B)The client wants to read the value of a key k. It issues a GET request to the k-v store, which must support multi-versioning. The $ts_r$ timestamp is passed with the GET request, and the data store returns all versions with a version smaller than $ts_r$. The client now has to find the most recent committed version from this list. With the list ordered by version, most recent first, the client can examine each in turn until it finds a committed value. When looking at a version, if its commit-field contains a non-nil value then we know that this value belongs to a committed transaction and we return it. If the commit field contains nil though then this is a tentative write from some other transaction. It may be though that that transaction has actually committed and just not updated the commit-field value yet, to catch this scenario, the client checks the commit table following a get tentative value procedure. This procedure will make more sense once we’ve seen how commit works, so I’ll come back to it then.
• (C) The client wants to PUT a value for some key. It adds the hash of the key to its write set and then PUTs the key-value pair with a nil commit-field value to indicate that this write is still tentative pending transaction commit.
• (D)The client now decides to commit the transaction. It sends the transaction id and the write set to the transaction manager. The transaction manager updates its clock and assigns a commit timestamp, $ts_c$. The transaction manager checks for conflicts to be sure that none of the keys in the write set have versions larger than $tx_{id}$. (Recall that the transaction id is also the timestamp that the transaction was started, so we’re checking that none of the values we want to write have been written by a subsequent committed transaction in the meantime). Assuming no conflicts, the TM writes the $tx_{id}$, $ts_c$ pair to the commit table. At this point the transaction is committed but not yet completed.
• (E) When the clients gets the OK back from the TM, it updates the conflict-field value to $ts_c$ for all keys in its write set. Once this is done, the client deletes the $tx_{id}$ row from the commit table, the transaction is completed.

A background cleanup process helps old (crashed or otherwise slow) transactions complete. Old versions of data items are also periodically tidied up by a background cleaner using a low water mark. If a transaction with an id smaller than the low water mark attempts to commit it will be aborted.

Let’s now return to the get tentative value procedure to be followed when the client reads a version with a nil commit field value. We need to take care with the window between the transaction being committed and completed. The client checks in the commit table for a row with the corresponding $tx_{id}$. If it finds one, it knows that the transaction is committed. In this scenario, it helps out by updating the commit-field value of the record using the commit timestamp in the commit table, and then returns this version to the client. Even this protocol is open to yet another race, so we have to re-read the record from the datastore one more time if a version is not found in the CT. See the GetTentativeValue procedure in the pseudo-code below:

### HA

HA is based on a primary-backup scheme with timeouts. The timeout is set relatively low for fast recovery. The tricky part is when the backup performs failover while the primary is still operational – we have two primaries at this point.

To this end, the primary TM actively checks if a backup has replaced it, and if so, “commits suicide”, i.e., halts. However, it is still possible to have a (short) window between the failover and the primary’s discovery of the existence of a new primary when two primary TMs are active.

Three properties are enforced to ensure that this cannot cause difficulties:

• All timestamps assigned by TM-2 exceed all those assigned by TM-1 (timestamp ranges are allocated in large chunks, called epochs, and managed as shared objects in ZooKeeper).
• After a transaction t begins, no other transaction that will end up with a commit timestamp earlier than t‘s read timestamp is allowed to update any additional data items.
• When a transaction reads a tentative update, it is able to determine whether this update will be committed with a timestamp smaller than its read timestamp or not.

To handle the third scenario, an invalidation mechanism is used.

… when a read encounters a tentative update whose txid is not present in the CT, it forces the transaction that wrote it to abort. We call this invalidation and extend the CT’s schema to include an invalid field to this end. Invalidation is performed via an atomic put-if-absent (support by HBase’s checkAndMutate API) to the CT, which adds a record marking that the transaction has “invalid” status. The use of an atomic put-if-absent achieves consensus regarding the state of the transaction.

A lease mechanism is used to ensure that a TM only needs to additionally check for invalidation if a lease expires while a commit record is in flight. See section 6 in the paper for full details of the scheme.

Recovery after failure takes around 4 seconds.

### Futures

Since becoming an Apache Incubator project, Omid is witnessing increased interest, in a variety of use cases. Together with Tephra, it is being consider for use by Apache Phoenix – an emerging OLTP SQL engine over HBase storage. In that context, latency has increased importance. We are therefore developing a low-latency version of Omid that has clients update the CT instead of the TM, which eliminates the need for batching and allows throughput scaling without sacrificing latency.

Why does Apache have both Omid and Tephra incubating? From the Omid proposal document: “Very recently, a new incubator proposal for a similar project called Tephra, has been submitted to the ASF. We think this is good for the Apache community, and we believe that there’s room for both proposals as the design of each of them is based on different principles (e.g. Omid does not require to maintain the state of ongoing transactions on the server-side component) and due to the fact that both -Tephra and Omid- have also gained certain traction in the open-source community.” Let all the flowers bloom :).

Deconstructing Xen Shi et al., NDSS 2017

Unfortunately, one of the most widely-used hypervisors, Xen, is highly susceptible to attack because it employs a monolithic design (a single point of failure) and comprises a complex set of growing functionality including VM management, scheduling, instruction emulation, IPC (event channels), and memory management.

As of v4.0, Xen runs to 270Kloc, a plenty large enough codebase to harbour multiple security vulnerabilities. Needless to say, once the hypervisor is compromised you’re in trouble. At best you’re exposed to a denial of service attack (shutdown or crash the machine and all the VMs running on it), at worst every single VM is compromised. SGX, which we looked at again earlier this week  can protect applications running in enclaves from a malicious hypervisor, but does nothing to protect against attacks against the hypervisor (and let’s face it, you’re not using SGX with your production workloads right now anyway).

The authors studied all 191 vulnerabilities published on the Xen Security Advisories list, of which 144 are directly related to the core hypervisor. From the 144:

• 62% lead to host denial-of-service attacks
• 15% lead to privilege escalation,
• 14% lead to information leaks, and
• 13% use the hypervisor to attack guest VMs.

As our security analysis demonstrates, the Xen core is fundamentally at risk. However, it is unclear how to effectively mitigate these threats.

Nexen is a deconstruction of Xen into multiple internal domains following the principle of least privilege. Enforcing separation between these domains is not easy since the hypervisor operates at the highest hardware privilege level – Nexen uses a same-privilege memory isolation technique we’ll look at shortly. A prototype implementation of Nexen mitigates 107 out of the 144 core hypervisor vulnerabilities, with very small performance overhead (on the order of 1.2%).

The overall design of Nexen looks like this:

There is a minimal fully privileged security monitor running directly on the hardware. Each virtual machine is supported by its own fully sandboxed Xen slice. Xen slices share code but each has its own data. Errors inside one slice are not considered dangerous to the whole system or other VMs.

Unfortunately, a simple decomposition of all functionality into slices is untenable because subsets of functionality interact across slice boundaries. High frequency privilege boundary crossing cause high performance degradation. So we create a single, slightly more privileged shared service domain—but still not as privileged as the security monitor. Deciding what to place in per-VM slices and the shared services domain is non-trivial and one of the key contributions of this work.

Xen Slices communicate with the shared service domain module through carefully controlled gates. All of this looks nice as a modular design of course, but the key point is that the domain boundaries need to be enforced – every single domain runs at the highest privilege of the system, ring 0 of the root mode. The monitor needs to be tamper-proof, but without creating high overheads or forcing large changes to the Xen code base. The secret to pulling this off is nested MMU virtualization.

Nested MMU virtualization works by monitoring all modifications of virtual-to-physical translations (mappings) and explicitly removing MMU modifying instructions from the unprivileged component – a technique called code deprivileging.

All mapping pages are marked as read-only. Thus any unexpected modifications to page table can be detected by traps. The removal of any modifying instructions from higher level domains is enforced through a binary scan (checking for both aligned and unaligned privileged instructions – attention to detail!). Restricting control of the core MMU to just the monitor greatly reduces the TCB (Trusted Compute Base) for memory management.

On top of the monitor, components such as the scheduler and domain management are placed in the shared service, while functions relating only to one VM live in the Xen slices. The security monitor does not have its own address space, it is shared by all internal domains – but as we’ve seen no-one is able to tamper with its data. The shared service and every Xen slice have their own address spaces. The overall memory access permissions are arranged like this:

If an internal domain wants to execute a privileged instruction, it has to do so by requesting the monitor to perform the instruction on its behalf (recall that no privileged instructions are allowed above the monitor). This allows the monitor to, well, monitor, such requests and prevent unauthorized or malicious use. For calls between internal domains, Nexen provides a secure call gate. It guarantees that:

• it is not possible to switch to another internal domain without calling the gate
• requests cannot be forged – each call gate is bound to both its return address and its domain type argument
• switches of control flow and address space are atomic

A gate keeper controls all exits from VMs to the hypervisor, and entries from the hypervisor to VMs. Switching between Xen slices is explicitly forbidden since there is no legitimate reason to ever want to do this.

The monitor and slicing services provide an isolation and code integrity foundation, the next challenge is to use these mechanisms to deconstruct Xen in such a way as to maximize the value of least-privilege while minimizing cross-domain interactions. Three principles are followed to guide this process.

1. Avoid inserting dangerous functionality into the shared service!
2. Avoid runtime communication – guests should not be able to actively invoke shared services
3. Separate mechanism from policy: “Complex calculation and decision making can be processed by untrusted code, but the final security sensitive operation must be performed by trusted code.

Here are the placement decisions and rationale for some of the key components (not an exhaustive list):

• Scheduler – fully inside the shared service, with interfaces for a guest to yield, sleep, or wake its VCPU.
• Event channel – delivers events between guest VMs, the hypervisor, and hardware. Each event channel bucket is placed in the owning VM’s Xen slice. Event sending is proxied through the shared service.
• Memory management – the allocator is only used during booting and domain building so it lives in the shared service. Xen slices are allowed to manage their bound VM’s memory mapping update.
• Code emulation and I/O – emulation code is run in each VM’s Xen slice.

### Protection afforded by Nexen

An attacker cannot escalate memory access privilege – either directly writing to a protected memory region or updating the page table in an attempt to gain access will result in a page fault. Nexen kill’s the attackers Xen slice and VM in this case.

An attacker cannot abuse privileged instructions – these have been removed from per-VM slices, so the attacker must reuse those in the security monitor. Sanity checking on calls prevents misuse. If a malicious context is forged causing a jump to the instruction the attacker will stop lose execution control for a while before executing the monitor and sanity checking kicks in again.

An attacker cannot trick the monitor into giving extra priviliges – requesting operation for which the attacker does not have privileges will be rejected, and pretending to be another internal domain is fooled by the unique id number mapped in the read-only region of each domain’s address space.

An attacker cannot fool Xen – Nexen reuses a lot of Xen code, but all memory and instruction invariants and policies are strictly enforced by the security monitor.

There is a clear boundary between those attacks Nexen can and can not prevent. Attacks with their key steps happening in Xen slices can mostly be prevented. This is because Xen slice is a sandbox that can be sacrificed. Exceptions are those trying to crash or leak information to a guest VM with Iago attacks. They do not try to harm the hypervisor or other VMs so sandboxing does not work for them. The gate keeper guards interactions between the hypervisor and VMs. Part of attacks that takes effect in a guest VM can be prevented. However, verifying data corrupted by an Iago attack requires recomputing, which the gate keeper is incapable of.

Of the 144 Xen hypervisor vulnerabilities in the report, Nexen can defend against 96 of them on the Intel x86 platform, and if an equivalent implementation was created for ARM and AMD processors, the architecture would also defend against a further 17 vulnerabilities specific to those processor types. In total, Nexen prevents 107 out of the 144 vulnerabilities.

Here’s an illustrative sample of reported vulnerabilities and how Nexen stops them.

(Click for larger view)

The current main achilles heel of Nexen is the shared service – if a logic error in this component is found and exploited then the hypervisor may still be compromised. Here are a selection of attacks that even Nexen cannot currently prevent: