Large-scale evolution of image classifiers Real et al., 2017
I’m sure you noticed the bewildering array of network architectures in use when we looked at some of the top convolution neural network papers of the last few years last week (Part 1, Part2, Part 3). With sufficient training data, these networks can achieve amazing feats, but how do you find the best network architecture for your problem in the first place?
Discovering neural network architectures… remains a laborious task. Even within the specific problem of image classification, the state of the the art was attained through many years of focused investigation by hundreds of researchers.
If you’re an AI researcher and you come up against a difficult problem where it’s hard to encode the best rules (or features) by hand, what do you do? Get the machine to learn them for you of course! If what you want to learn is a function, and you can optimise it through back propagation, then a network is a good solution (see e.g. ‘Network in Network‘, or ‘Learning to learn by gradient descent by gradient descent‘). If what you want to learn is a policy that plays out over some number of time steps then reinforcement learning would be a good bet. But what if you wanted to learn a good network architecture? For that the authors turned to a technique from the classical AI toolbox, evolutionary algorithms.
As you may recall, evolutionary algorithms start out with an initial population, assess the suitability of population members for the task in hand using some kind of fitness function, and then generate a new population for the next iteration by mutations and combinations of the fittest members. So we know that we’re going to start out with some population of initial model architectures (say, 1000 of them), define a fitness function based on how well they perform when trained, and come up with a set of appropriate mutation operations over model architectures. That’s the big picture, and there’s just one more trick from the deep learning toolbox that we need to bring to bear: brute force! If in doubt, overwhelm the problem with bigger models, more data, or in this case, more computation:
We used slightly-modified known evolutionary algorithms and scaled up the computation to unprecedented levels, as far as we know. This, together with a set of novel and intuitive mutation operators, allowed us to reach competitive accuracies on the CIFAR-10 dataset. This dataset was chosen because it requires large networks to reach high accuracies, thus presenting a computational challenge.
The initial population consists of very simple (and very poorly performing) linear models, and the end result is a fully trained neural network with no post-processing required. Here’s where the evolved models stand in the league table:
That’s a pretty amazing result when you think about it. Really top-notch AI researchers are in very short supply, but computation is much more readily available on the open market. ‘Evolution’ evolved an architecture, with no human guidance, that beats some of our best models from the last few years.
Let’s take a closer look at the details of the evolutionary algorithm, and then we’ll come back and dig deeper into the evaluation results.
We start with a population of 1000 very simple linear regression models, and then use tournament selection. During each evolutionary step, a worker process (250 of them running in parallel) chooses two individuals at random and compares their fitness. The worst of the pair is removed from the population, and the better model is chosen as a parent to help create the next generation. A mutation is applied to the parent to create a child. The worker then trains the child, evaluates it on the validation set, and puts it back into the population.
Using this strategy to search large spaces of complex image models requires considerable computation. To achieve scale, we developed a massively-parallel, lock-free infrastructure. Many workers operate asynchronously on different computers. They do not communicate directly with each other. Instead, they use a shared file-system, where the population is stored.
Training and validation takes place on the CIFAR-10 dataset consisting of 50,000 training examples and 10,000 test examples, all labeled with 1 of 10 common object classes. Each training runs for 25,600 steps – brief enough so that each individual can be trained somewhere between a few seconds and a few hours, depending on the model size. After training, a single evaluation on the validation set provides the accuracy to use as the model’s fitness.
We need architectures that are trained to completion within an evolutionary experiment… [but] 25,600 steps are not enough to fully train each individual. Training a large enough model to completion is prohibitively slow for evolution. To resolve this dilemma, we allow the children to inherit the parents’ weights whenever possible.
The final piece of the puzzle then, is the encoding of model architectures, and the mutation operations defined over them. A model architecture is encoded as a graph (its DNA). Vertices are tensors or activations (either batch normalisation with ReLUs, or simple linear units). Edges in the graph are identity connections (for skipping) or convolutions.
When multiple edges are incident on a vertex, their spatial scales or numbers of channels may not coincide. However, the vertex must have a single size and number of channels for its activations. The inconsistent inputs must be resolved. Resolution is done by choosing one of the incoming edges as the primary one. We pick this primary edge to be the one that is not a skip connection.
Activation functions are similarly reshaped (using some combination of interpolation, truncation, and padding), and the learning rate is also encoded in the DNA.
When creating a child, a worker picks a mutation at random from the following set:
- Alter learning rate
- Identity (effectively gives the individual further training time in the next generation)
- Reset weights
- Insert convolution (at a random location in the ‘convolutional backbone’). Convolutions are 3×3 with strides of 1 or 2.
- Remove convolution
- Alter stride (powers of 2 only)
- Alter number of channels (of a random convolution)
- Filter size (horizontal or vertical at random, on a random convolution, odd values only)
- Insert one-to-one (adds a one-to-one / identity connection)
- Add skip (identity between random layers)
- Remove skip (removes a random skip)
Here’s an example of an evolution experiment, with selected population members highlighted:
Five experiment runs are done, and although not all models reach the same accuracy, they get pretty close. It took 9 x 1019 FLOPs on average per experiment.
The following chart shows how accuracy improves over time during the experiments:
We observe that populations evolve until they plateau at some local optimum. The fitness (i.e. validation accuracy) value at this optimum varies between experiments (Above, inset). Since not all experiments reach the highest possible value, some populations are getting “trapped” at inferior local optima. This entrapment is affected by two important meta-parameters (i.e. parameters that are not optimized by the algorithm). These are the population size and the number of training steps per individual.
The larger the population size, the more thoroughly the space of models can be explored, which helps to reach better optima. More training time means that a model needs to undergo fewer identity mutations to reach a given level of training (remember that the end result of the evolution process is a fully trained model, not just a model architecture).
Two other approaches to escaping local optima are increasing the mutation rate, and resetting weights. When it looks like members of the population are trapped in poor local optima, the team tried applying 5 mutations instead of 1 for a few generations. During this period some population members escape the local optimum, and none get worse:
To avoid getting trapped by poorer architectures that just happened to have received more training (e.g. through the identity mutation), the team also tried experiments in which the weights are simultaneously reset across all population members when a plateau is reached. The populations suffer a temporary degradation (as to be expected), but ultimately reach a higher optima.
The IEEE Global Initiative for Ethical Considerations in Artificial Intelligence and Autonomous Systems. Ethically Aligned Design: A Vision For Prioritizing Wellbeing With Artificial Intelligence And Autonomous Systems, Version 1. IEEE, 2016. http://standards.ieee.org/develop/indconn/ec/autonomous_systems.html.
Something a little different for today… the IEEE recently put out a first version of their “Ethically Aligned Design” report for public discussion. It runs to 136 pages (!) but touches on a number of very relevant issues.
This document represents the collective input of over one hundred global thought leaders in the fields of Artificial Intelligence, law and ethics, philosophy, and policy from the realms of academia, science, and the government and corporate sectors.
The report itself is divided into eight sections, each of which seems to be the result of the deliberations of a different sub-committee. The eight areas are:
- General Principles
- Embedding values into autonomous intelligent systems
- Methodologies to guide ethical research and design
- Safety and beneficence of Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI)
- Personal data and individual access control
- Reframing autonomous weapons systems
- Economics/humanitarian issues
I’m going to focus on the first five of these areas today, and of necessity in reducing 136 pages to one blog post, I’ll be skipping over a lot of details and just choosing the parts that stand out to me on this initial reading.
Future AI systems may have the capacity to impact the world on the scale of the agricultural or industrial revolutions.
This section opens with a broad question, “How can we ensure that AI/AS do not infringe human rights?” (where AI/AS stands for Artificial Intelligence / Autonomous Systems throughout the report). The first component of the answer connects back to documents such as the Universal Declaration of Human Rights and makes a statement that I’m sure very few would disagree with, although it offers little help in the way of implementation:
AI/AS should be designed and operated in a way that respects human rights, freedoms, human dignity, and cultural diversity.
The other two components of the answer though raise immediately interesting technical considerations:
- AI/AS must be verifiably safe and secure throughout their operational lifetime.
- If an AI/AS causes harm it must always be possible to discover the root cause (traceability) for said harm.
The second of these in particular is very reminiscent of the GDPR ‘right to an explanation,’ and we looked at some of the challenges with provenance and explanation in previous editions of The Morning Paper.
A key concern over autonomous systems is that their operation must be transparent to a wide range of stakeholders for different reasons (noting that the level of transparency will necessarily be different for each stakeholder). Stated simply, a transparent AI/AS is one in which it is possible to discover how and why the system made a particular decision, or in the case of a robot, acted the way it did.
The report calls for new standards describing “measurable, testable levels of transparency, so that systems can be objectively assessed and levels of compliance determined.
This is an interesting section. The overall argument / intention seems to be that we want to build systems that make decisions which align with the way the impacted communities would like decisions to be made. The actual wording raises a few questions though…
Society does not have universal standards or guidelines to help embed human norms or moral values into autonomous intelligent systems (AIS) today. But as these systems grow to have increasing autonomy to make decisions and manipulate their environment, it is essential they be designed to adopt, learn, and follow the norms and values of the community they serve, and to communicate and explain their actions in as transparent and trustworthy manner possible, given the scenarios in which they function and the humans who use them.
What if the norms and values of the community they serve aren’t desirable? For example, based on all the horrific stories that are increasingly being shared, the ‘norm’ of how women are treated in IT is not something we would ever want to propagate into an AIS. There are many examples in history of things that were once accepted norms which we now find very unacceptable. Could we not embed norms and values (e.g., non-discrimination) of a better, more noble version of ourselves and our communities? Presuming of course we can all agree on what ‘better’ looks like…
Values to be embedded in AIS are not universal, but rather largely specific to user communities and tasks.
This opens the door to ‘moral overload’, in which an AIS is subject to many possibly conflicting norms and values. What should we do in these situations? The recommended best practice seems guaranteed to produce discrimination against minorities (but then again, so does democracy when viewed through the same lens, this stuff is tricky!):
Our recommended best practice is to prioritize the values that reflect the shared set of values of the larger stakeholder groups. For example, a self-driving vehicle’s prioritization of one factor over another in its decision making will need to reflect the priority order of values of its target user population, even if this order is in conflict with that of an individual designer, manufacturer, or client.
In the same section though, we also get:
Moreover, while deciding which values and norms to prioritize, we call for special attention to the interests of vulnerable and under-represented populations, such that these user groups are not exploited or disadvantaged by (possibly unintended) unethical design.
The book “Moral Machines: Teaching robots right from wrong” is recommended as further reading in this area.
Understanding whether / ensuring that systems actually implement the intended norms requires transparency. Two levels of transparency are envisaged: firstly around the information conveyed to the user while an autonomous system interacts, and secondly enabling the system to be evaluated as a whole by a third-party.
A system with the highest level of traceability would contain a black-box like module such as those used in the airline industry, that logs and helps diagnose all changes and behaviors of the system.
Methodologies to guide ethical research and design
The report highlights two key issues relating to business practices involving AI:
- a lack of value-based ethical culture and practices, and
- a lack of values-aware leadership
Businesses are eager to develop and monetize AI/AS but there is little supportive structure in place for creating ethical systems and practices around its development or use… Engineers and design teams are neither socialized nor empowered to raise ethical concerns regarding their designs, or design specifications, within their organizations. Considering the widespread use of AI/AS and the unique ethical questions it raises, these need to be identified and addressed from their inception
Companies should implement ‘ethically aligned design’ programs (from which the entire report derives its title). Professional codes of conduct can support this (there’s a great example from the British Computer Society in this section of the report).
The lack of transparency about the AI/AS manufacturing process presents a challenge to ethical implementation and oversight. Regulators and policymakers have an important role to play here the report argues. For example:
…when a companion robot like Jibo promises to watch your children, there is no organization that can issue an independent seal of approval or limitation on these devices. We need a ratings and approval system ready to serve social/automation technologies that will come online as soon as possible.
CloudPets anyone? What a disgrace.
There’s a well made point by Frank Pasquale, Professor of Law at the University of Maryland about the importance (and understandability) of the training data vs the algorithm too:
…even if machine learning processes are highly complex, we may still want to know what data was fed into the computational process. Presume as complex a credit scoring system as you want. I still want to know the data sets fed into it, and I don’t want health data in that set…
Safety and beneficence of AGI and ASI
This section stresses the importance of a ‘safety mindset’ at all stages.
As AI systems become more capable, unanticipated or unintended behavior becomes increasingly dangerous, and retrofitting safety into these more generally capable and autonomous AI systems may be difficult. Small defects in AI architecture, training, or implementation, as well as mistaken assumptions, could have a very large impact when such systems are sufficiently capable.
The paper “Concrete problems in AI safety” (on The Morning Paper backlog) describes a range of possible failure modes.
Any AI system that is intended to ultimately have capabilities with the potential to do harm should be design to avoid these issues pre-emptively. Retrofitting safety into future more generally capable AI systems may be difficult:
As an example, consider the case of natural selection, which developed an intelligent “artifact” (brains) by simple hill-climbing search. Brains are quite difficult to understand, and “refactoring” a brain to be trustworthy when given large amounts of resources and unchecked power would be quite an engineering feat. Similarly, AI systems developed by pure brute force might be quite difficult to align.
Personal data and individual access control
This is the section most closely aligned with the GDPR, and at its heart is the problem of the asymmetry of data:
Our personal information fundamentally informs the systems driving modern society but our data is more of an asset to others than it is to us. The artificial intelligence and autonomous systems (AI/AS) driving the algorithmic economy have widespread access to our data, yet we remain isolated from gains we could obtain from the insights derived from our lives.
The call is for tools allowing every individual citizen control over their own data and how it is shared. There’s also this very interesting reminder about Western cultural norms here too:
We realize the first version of The IEEE Global Initiative’s insights reflect largely Western views regarding personal data where prioritizing an individual may seem to overshadow the use of information as a communal resource. This issue is complex, as identity and personal information may pertain to single individuals, groups, or large societal data sets.
What is personal data? Any data that can be reasonably linked to an individual based on their unique physical, digital, or virtual identity. That includes device identifiers, MAC addresses, IP addresses, and cookies. Guidance on determining what constitutes personal data can be found in the U.K. Information Commissioner’s Office paper, “Determining what is personal data.”
As a tool for any organization regarding these issues, a good starting point is to apply the who, what, why and when test to the collection and storage of personal information:
- Who requires access and for what duration?
- What is the purpose of the access? Is it read, use and discard, or collect, use and store?
- Why is the data required? To fulfil compliance? Lower risk? Because it is monetized? In order to provide a better service/experience?
- When will it be collected, for how long will it be kept, and when will it be discarded, updated, re-authenticated…
The report also points out how difficult informed consent can be. For example, “Data that appears trivial to share can be used to make inferences that an individual would not wish to share…”
I’ve barely scratched the surface, but this post is getting too long already. One of the key takeaways is that this is a very complex area! I personally hold a fairly pessimistic view when it comes to hoping that the unrestrained forces of capitalism will lead to outcomes we desire. Therefore even though it may seem painful, some kind of stick (aka laws and regulations) does ultimately seem to be required. Will our children (or our children’s children) one day look back in horror at the wild west of personal data exploitation when everything that could be mined about a person was mined, exploited, packaged and sold with barely any restriction?
Let’s finish on a positive note though. It’s popular to worry about AI and Autonomous Systems without also remembering that they can be a force for tremendous good. In addition, as well as introducing unintended bias and discrimination, they can also be used to eliminate it in a way we could never achieve with human decision makers. An example I’ve been talking about here draws inspiration from the Adversarial Neural Cryptography paper of all things. There we get a strong hint that the adversarial network structure introduced with GANs can also be applied in other ways. Consider a network that learns an encoding of information about a person (but explicitly excluding, say, information about race and gender). Train it in conjunction with two other networks, one that learns to make the desired business predictions based on the learned representation, and one (the adversarial net) that attempts to predict race and gender based on that same representation. When the adversarial net cannot do better than random chance, we have a pretty good idea that we’ve eliminated unintended bias from the system…
To round out the week, I thought I’d take a selection of fun papers from the ‘More papers from 2016’ section of top 100 awesome deep learning papers list.
- Colorful image colorization, Zhang et al., 2016
- Texture networks: feed-forward synthesis of textures and stylized images
- Generative visual manipulation on the natural image manifold, Zhu et al., 2016
- WaveNet: a generative model for raw audio, van den Oord et al., 2016
- Google’s neural machine translation system: bridging the gap between human and machine translation, Wu et al., 2016
The texture networks paper we’ve covered before, so the link in the above list is to The Morning Paper write-up (but I felt like it belonged in this group nevertheless).
Colorful image colorization
Given a grayscale photograph as input, this paper attacks the problem of hallucinating a plausible color version of the photograph.
How is this possible? Well, we’ve seen that networks can learn what various parts of the image represent. If you see enough images you can learn that grass is (usually) green, the sky is (sometimes!) blue, and ladybirds are red. The network doesn’t have to recover the actual ground truth colour, just a plausible colouring.
Therefore, our task becomes much more achievable: to model enough of the statistical dependencies between the semantics and the textures of grayscale images and their color versions in order to produce visually compelling results.
Results like this:
Training data for the colourisation task is plentiful – pretty much any colour photo will do. The tricky part is finding a good loss function – as we’ll see soon, many loss functions produce images that look desaturated, whereas we want vibrant realistic images.
The network is based on image data using the CIE Lab colourspace. Grayscale images have only the lightness, L, channel, and the goal is to predict the a (green-red) and b (blue-yellow) colour channels. The overall network architecture should look familiar by now, indeed so familiar that supplementary details are pushed to an accompanying website.
(That website page is well worth checking out by the way, it even includes a link to a demo site on Algorithmia where you can try the system out for yourself on your own images).
Colour prediction is inherently multi-modal, objects can take on several plausible colourings. Apples for example may be red, green, or yellow, but are unlikely to be blue or orange. To model this, the prediction is a distribution of possible colours for each pixel. A typical objective function might use e.g. Euclidean loss between predicted and ground truth colours.
However, this loss is not robust to the inherent ambiguity and multimodal nature of the colorization problem. If an object can take on a set of distinct ab values, the optimal solution to the Euclidean loss will be the mean of the set. In color prediction, this averaging effect favors grayish, desaturated results. Additionally, if the set of plausible colorizations is non-convex, the solution will in fact be out of the set, giving implausible results.
What can we do instead? The ab output space is divided into bins with grid size 10, and the top Q = 313 in-gamut (within the range of colours we want to use) are kept:
The network learns a mapping to a probability distribution over these Q colours (a Q-dimensional vector). The ground truth colouring is also translated into a Q-dimensional vector, and the two are compared using a multinomial cross entropy loss. Notably this includes a weighting term to rebalance loss based on colour-class rarity.
The distribution of ab values in natural images is strongly biased towards values with low ab values, due to the appearance of backgrounds such as clouds, pavement, dirt, and walls. Figure 3(b) [below] shows the empirical distribution of pixels in ab space, gathered from 1.3M training images in ImageNet. Observethat the number of pixels in natural images at desaturated values are orders of magnitude higher than for saturated values. Without accounting for this, the loss function is dominated by desaturated ab values.
The final predicted distribution then needs to be mapped to a point estimated in ab space. Taking the mode of the predicted distribution leads to vibrant but sometimes spatially inconsistent results (see RH column below). Taking the mean brings back another form of the desaturation problem (see LH column below).
To try to get the best of both worlds, we interpolate by re-adjusting the temperature T of the softmax distribution, and taking the mean of the result. We draw inspiration from the simulated annealing technique, and thus refer to the operation as taking the annealed-mean of the distribution.
Here are some more colourings from a network trained on ImageNet, which were rated by Amazon Mechanical Turk participants to see how lifelike they are.
And now for my very favourite part of the paper:
Since our model was trained using “fake” grayscale images generated by stripping ab channels from color photos, we also ran our method on real legacy black and white photographs, as shown in Figure 8 (additional results can be viewed on our project webpage). One can see that our model is still able to produce good colorizations, even though the low-level image statistics of the legacy photographs are quite different from those of the modern-day photos on which it was trained.
Aren’t they fabulous! Especially the Migrant Mother colouring.
The representations learned by the network also proved useful for object classification, detection, and segmentation tasks.
Generative visual manipulation on the natural image manifold
So we’ve just seen that neural networks can help us with our colouring. But what about those of us that are more artistically challenged and have a few wee issues making realistic looking drawings (or alterations to existing drawings) in the first place? In turns out that generative adversarial neural networks can help.
It’s a grand sounding paper title, but you can think of it as “Fiddling about with images while ensuring they still look natural.” I guess that wouldn’t look quite so good in the conference proceedings ;).
Today, visual communication is sadly one-sided. We all perceive information in the visual form (through photographs, paintings, sculpture, etc), but only a chosen few are talented enough to effectively express themselves visually… One reason is the lack of “safety wheels” in image editing: any less-than-perfect edit immediately makes the image look completely unrealistic. To put another way, classic visual manipulation paradigm does not prevent the user from “falling off” the manifold of natural images.
As we know, GANs can be trained to learn effective representations of natural looking images (“the manifold of natural images“). So let’s do that, but then instead of using the trained GAN to generate images, use it as a constraint on the output of various image manipulation operations, to make sure the results lie on the learned manifold at all times. The result is an interactive tool that helps you make realistic looking alterations to existing images. It helps to see the tool in action, you can see a video here. The authors also demonstrate ‘generative transformation’ of one image to look more like another, and my favourite, creating a new image from scratch based on a user’s sketch.
The intuition for using GANs to learn manifold approximations is that they have been shown to produce high-quality samples, and that Euclidean distance in the latent space often corresponds to a perceptually meaningful visual similarity. This means we can also perform interpolation between points in the latent space.
Here’s what happens when the latent vector is updated based on user edits (top row, adding black colour and changing the shape):
In the interactive tool, each update step takes about 50-100ms, working only on the mapped representation of the original image. When the user is done, the generated image captures roughly the desired change, but the quality is degraded as compared to the original image.
To address this issue, we develop a dense correspondence algorithm to estimate both the geometric and color changes induced by the editing process.
This motion and colour flow algorithm is used to estimate the colour and shape changes in the generated image sequence (as user editing progressed), and then transfer them back on top of the original photo to generate photo-realistic images.
The user interface gives the user a colouring brush for changing the colour of regions, a sketching brush to outline shapes or add fine details, and a warping ‘brush’ for more explicit shape modifications.
Here are some results from user edits:
Transformations between two images also take place in the GAN-learned representation space and are mapped back in the same way:
It’s also possible to use the brush tools to create an image from scratch, and then add more scribbles to refine the result. How good is this! :
WaveNet: a generative model for raw audio
Enough with the images already! What about generating sound? How about text-to-speech sound generation yielding state of the art performance? Hearing is believing, so check out these samples:
- WaveNet speaking in US English
- WaveNet trained without the accompanying text sequence, so that it has to make up what to say:
- WaveNet trained on classical piano music (just the audio, no score!) and left to its own devices to generate whatever it wants to:
(You can find more on the DeepMind blog at https://deepmind.com/blog/wavenet-generative-model-raw-audio/).
We show that WaveNets can generate raw speech signals with subjective naturalness never before reported in the field of text-to-speech (TTS), as assessed by human raters.
The architecture of WaveNet is inspired by PixelRNN (See “RNN models for image generation” from a couple of weeks ago). The foundation is very simple – take a waveform with with T values, . And let the probability of the be conditioned on all of the values that precede it: . Now the joint probability of the overall waveform is modelled by:
This can be modelled by a stack of convolutional layers. “By using causal convolutions, we make sure the model cannot violate the ordering in which we model the data…” (the prediction at timestep t cannot depend on any of the future timestamps). This can be implemented by shifting the output of a normal convolution by one or more timesteps.
At training time, the conditional predictions for all timesteps can be made in parallel because all timesteps of ground truth x are known. When generating with the model, the predictions are sequential: after each sample is predicted, it is fed back into the network to predict the next sample.
Causal convolutions need lots of layers to increase their receptive field. WaveNet uses dilated convolutions to increase receptive fields by orders of magnitude, without greatly increasing computational cost.
A dilated convolution is a convolution where the filter is applied over an area large than its length by skipping input values with a certain step.
WaveNet uses dilation doubling in every layer up to a limit of 512, before repeating (1,2,4, …, 512, 1,2,4, …, 512, …).
A straight softmax output layer would need 65,356 probabilities per timestep to model all possible values for raw audio stored as a sequence of 16-bit integer values. The data is quantized to 256 possible values using a non-linear quantization scheme which was found to produce a significantly better reconstruction than a simple linear scheme:
where and . The network uses both residual and parameterised skip connections throughout to speed up convergence. By conditioning the model on additional inputs, WaveNet can be guided to produce audio with the required characteristics (e.g., a certain speaker’s voice). For TTS, information about the text is fed as an extra input. > For the first experiment we looked at free-form speech generation (not conditioned on text). We used the English multi-speaker corpus from CSTR voice cloning toolkit (VCTK) (Yamagishi, 2012) and conditioned WaveNet only on the speaker. The conditioning was applied by feeding the speaker ID to the model in the form of a one-hot vector. The dataset consisted of 44 hours of data from 109 different speakers.
Since it wasn’t conditioned on text, the model generates made-up but human language-like words in a smooth way (see second audio clip at the top of this section). It can model speech from any of the speakers by conditioning it on the one-hot encoding – thus the model is powerful enough to capture the characteristics of all 109 speakers from the dataset in a single model.
The second experiment trained WaveNet on the same single-speaker speech databases that Google’s North American and Mandarin Chinese TTS systems are built on. WaveNet was conditioned on linguistic features derived from the input texts. In subjective paired comparison tests, WaveNet beat the best baselines:
WaveNet also achieved the highest ever score in a mean opinion score test where users had to rate naturalness on a scale of 1-5 (scoring over 4 on average). (See the first speech sample at the top of this section).
The third set of experiments trained WaveNet on two music datasets (see the third speech sample at the top of this section). “…the samples were often harmonic and aesthetically pleasing, even when produced by unconditional models.”
From the WaveNet blog post on the DeepMind site:
WaveNets open up a lot of possibilities for TTS, music generation and audio modelling in general. The fact that directly generating timestep per timestep with deep neural networks works at all for 16kHz audio is really surprising, let alone that it outperforms state-of-the-art TTS systems. We are excited to see what we can do with them next.
Google’s neural machine translation system: bridging the gap between human and machine translation
Google’s Neural Machine Translation (GNMT) system is an end-to-end learning system for automated translation. Previous NMT systems suffered in one or more of three key areas: training and inference speed, coping with rare words, and sometimes failing to translate all of the words in a source sentence. GNMT is now in production at Google, having handsomely beaten the Phrase-Based Machine Translation (PBMT) system used in production at Google beforehand.
Understanding how it all fits together will draw upon many of the papers we’ve looked at so far. At the core it’s a sequence-to-sequence learning network with an encoder network, a decoder network, and an attention network.
The encoder transforms a source sentence into a list of vectors, one vector per input symbol. Given this list of vectors, the decoder produces one symbol at a time, until the special end-of-sentence symbol (EOS) is produced. The encoder and decoder are connected through an attention module which allows the decoder to focus on different regions of the source sentence during the course of decoding.
The decoder is a combination of an RNN network and a softmax layer. Deeper models give better accuracy, but the team found that LSTM layers worked well up to 4 layers, barely with 6 layers, and very poorly beyond 8 layers. What to do? We learned the answer earlier this week, add residual connections:
Since in translation words in the source sentence may appear anywhere in the output sentence, the encoder network uses a bi-directional RNN for the encoder. Only the bottom layer is bi-direction – one LSTM layer processes the sentence left-to-right, while its twin processes the sentence right-to-left.
The encoder and decoder networks are placed on multiple GPUs, with each layer running on a different GPU. As well as using multiple GPUs, to get inference time down quantized inference involving reduce precision arithmetic is also used.
One of the main challenges in deploying our Neural Machine Translation model to our interactive production translation service is that it is computationally intensive at inference, making low latency translation difficult, and high volume deployment computationally expensive. Quantized inference using reduced precision arithmetic is one technique that can significantly reduce the cost of inference for these models, often providing efficiency improvements on the same computational devices.
The model is trained using full-precision floats, it is only for production inference that approximation is used. Here are the decoding times for 6003 English-French sentences across CPU, GPU, and Google’s Tensor Processing Unit (TPU) respectively:
Firstly, note that the TPU beats the CPU and GPU hands-down. The CPU beats the GPU because “our current decoder implementation is not fully utilizing the computation capacities that a GPU can theoretically offer during inference.”
Dealing with out of vocabulary words
Neural Machine Translation models often operate with fixed word vocabularies even though translation is fundamentally an open vocabulary problem (names, numbers, dates etc.)… Our most successful approach […] adopts the wordpiece model (WPM) implementation initially developed to solve a Japanese/Korean segmentation problem for the Google speech recognition system. This approach is completely data-driven and guaranteed to generate a deterministic segmentation for any possible sequence of characters.
For example, “Jet makers feud over seat width with big orders at stake” turns into the word pieces: “_J et _makers _fe ud _over _seat _width _with _big _orders _at _stake.” The words ‘Jet’ and ‘feud’ are both broken into two word pieces.
Given a training corpus and a number of desired tokens D, the optimization problem is to select D wordpieces such that the resulting corpus is minimal in the number of wordpieces when segmented according to the chosen wordpiece model.
Overall model performance.
The following chart shows side-by-side scores for translations made by the previous production system (PBMT), the new GNMT system, and humans fluent in both languages.
Side-by-side scores range from 0 to 6, with a score of 0 meaning “completely nonsense translation”, and a score of 6 meaning “perfect translation: the meaning of the translation is completely consistent with the source, and the grammar is correct”. A translation is given a score of 4 if “the sentence retains most of the meaning of the source sentence, but may have some grammar mistakes”, and a translation is given a score of 2 if “the sentence preserves some of the meaning of the source sentence but misses significant parts”. These scores are generated by human raters who are fluent in both languages and hence often capture translation quality better than BLEU scores.
Today we’re pressing on with the top 100 awesome deep learning papers list, and the section on recurrent neural networks (RNNs). This contains only four papers (joy!), and even better we’ve covered two of them previously (Neural Turing Machines and Memory Networks, the links below are to the write-ups). That leaves up with only two papers to cover today, however the first paper does run to 43 pages and it’s a lot of fun so I’m glad to be able to devote a little more space to it.
- Generating sequences with recurrent neural networks, Graves 2013
- Conditional random fields as recurrent neural networks, Zheng and Jayasumana, 2015
- Neural Turing machines
- Memory networks
These papers are easier to understand with some background in RNNs and LSTMs. Christopher Olah has a wonderful post on “Understanding LSTM networks” which I highly recommend.
Generating sequences with recurrent neural networks
This paper explores the use of RNNs, in particular, LSTMs, for generating sequences. It looks at sequences over discrete domains (characters and words), generating synthetic wikipedia entries, and sequences over real-valued domains, generating handwriting samples. I especially like the moment where Graves demonstrates that the trained networks can be used to ‘clean up’ your handwriting, showing what a slightly neater / easier to read version of your handwriting could look like. We’ll get to that shortly…
RNNs can be trained for sequence generation by processing real data sequences one step at a time and predicting what comes next. Assuming the predictions are probabilistic, novel sequences can be generated from a trained network by iteratively sampling from the network’s output distribution, then feeding in the sample as input at the next step. In other words by making the network treat its inventions as if they were real, much like a person dreaming.
Using LSTMs effectively gives the network a longer memory, enabling it to look back further in history to formulate its predictions.
The basic RNN architecture used for all the models in the paper looks like this:
Note how each output vector is used to parameterise a predictive distribution over the next possible inputs (the dashed lines in the above figure). Also note the use of ‘skip connections’ as we looked at in yesterday’s post.
The LSTM cells used in the network look like this:
They are trained with the full gradient using backpropagation. To prevent the derivatives becoming too large, the derivative of the loss with respect to the inputs to the LSTM layers are clipped to lie within a predefined range.
Onto the experiments…
For text prediction we can either use sequences of words, or sequences of characters. With one-hot encodings, the number of different classes for words makes for very large input vectors (e.g. a vocabulary with 10’s of thousands of words of more). In contrast, the number of characters is much more limited. Also,
… predicting one character at a time is more interesting from the perspective of sequence generation, because it allows the network to invent novel words and strings. In general, the experiments in this paper aim to predict at the finest granularity found in the data, so as to maximise the generative flexibility of the network.
The Penn Treebank dataset is a selection of Wall Street Journal articles. It’s relatively small at just over a million words in total, but widely used as a language modelling benchmark. Both word and character level networks were trained on this corpus using a single hidden layer with 1000 LSTM units. Both networks are capable of overfitting the training data, so regularisation is applied. Two forms of regularisation were experimented with: weight noise applied at the start of each training sequence, and adaptive weight noise, where the variance of the noise is learned along with the weights.
The word-level RNN performed better than the character-based one, but the gap closes with regularisation (perplexity of 117 in the best word-based configuration, vs 122 for the best character-based configuration).
“Perplexity can be considered to be a measure of on average how many different equally most probable words can follow any given word. Lower perplexities represent better language models…” ( http://www1.icsi.berkeley.edu/Speech/docs/HTKBook3.2/node188_mn.html )
Much more interesting is a network that Graves trains on the first 96M bytes of the Wikipedia corpus (as of March 3rd 2006, captured for the Hutter prize competition). This has seven hidden layers of 700 LSTM cells each. This is an extract of the real Wikipedia data:
And here’s a sample generated by the network (for additional samples, see the full paper):
The sample shows that the network has learned a lot of structure from the data, at a wide range of different scales. Most obviously, it has learned a large vocabulary of dictionary words, along with a subword model that enables it to invent feasible-looking words and names: for example “Lochroom River”, “Mughal Ralvaldens”, “submandration”, “swalloped”. It has also learned basic punctuation, with commas, full stops and paragraph breaks occurring at roughly the right rhythm in the text blocks.
It can correctly open and close quotation marks and parentheses, indicating the models memory and these often span a distance that a short-range context cannot handle. Likewise, it can generate distinct large-scale regions such as XML headers, bullet-point lists, and article text.
Of course, the actual generated articles don’t make any sense to a human reader, it is just their structure that is mimicked. When we move onto handwriting though, the outputs do make a lot of sense to us…
To test whether the prediction network could also be used to generate convincing real-valued sequences, we applied it to online handwriting data (online in this context means that the writing is recorded as a sequence of pen-tip locations, as opposed to offline handwriting, where only the page images are available). Online handwriting is an attractive choice for sequence generation due to its low dimensionality (two real numbers per data point) and ease of visualisation.
The dataset consists of handwritten lines on a smart whiteboard, with x,y co-ordinates and end-of-stroke markers (yes/no) captured at each time point. The main challenge was figuring out how to determine a predictive distribution for real-value inputs. The solution is to use mixture density neworks. Here the outputs of the network are used to parameterise a mixture distribution. Each output vector consists of the end of stroke probability e, along with a set of means, standard deviations, correlations, and mixture weights for the mixture components used to predict the x and y positions. See pages 20 and 21 for the detailed explanation.
Here are the mixture density outputs for predicted locations as the word under is written. The small blobs show accurate predictions while individual strokes are being written, and the large blobs show greater uncertainty at the end of strokes when the pen is lifted from the whiteboard.
The best samples were generated by a network with three hidden layers of 400 LSTM cells each, and 20 mixture components to model the offsets. Here are some samples created by the network.
The network has clearly learned to model strokes, letters and even short words (especially common ones such as ‘of’ and ‘the’). It also appears to have learned a basic character level language models, since the words it invents (‘eald’, ‘bryoes’, ‘lenrest’) look somewhat plausible in English. Given that the average character occupies more than 25 timesteps, this again demonstrates the network’s ability to generate coherent long-range structures
Those samples do of course look like handwriting, but as with our Wikipedia example, the actual words are nonsense. Can we learn to generated handwriting for a given text? To meet this challenge a soft window is convolved with the text string and fed as an extra input to the prediction network.
The parameters of the window are output by the network at the same time as it makes the predictions, so that it dynamically determines an alignment between the text and the pen locations. Put simply, it learns to decide which character to write next.
The network learns how far to slide the text window at each step, rather than learning an absolute position. “Using offsets was essential to getting the network to align the text with the pen trace.”
And here are samples generated by the resulting network:
Biased and primed sampling to control generation
One problem with unbiased samples is that they tend to be difficult to read (partly because real handwriting is difficult to read, and partly because the network is an imperfect model). Intuitively, we would expect the network to give higher probability to good handwriting because it tends to be smoother and more predictable than bad handwriting. If this is true, we should aim to output more probable elements of Pr(x|c) if we want the samples to be easier to read. A principled search for high probability samples could lead to a difficult inference problem, as the probability of every output depends on all previous outputs. However a simple heuristic, where the sampler is biased towards more probable predictions at each step independently, generally gives good results.
As we increase the bias towards higher probability predictions, the handwriting gets neater and neater…
As a final flourish, we can prime the network with a real sequence in the handwriting of a particular writer. The network then continues in this style, generating handwriting mimicking the author’s style.
Combine this with bias, and you also get neater versions of their handwriting!
Conditional random fields as recurrent neural networks
Now we turn our attention to a new challenge problem that we haven’t looked at yet: semantic segmentation. This requires us to label the pixels in an image to indicate what kind of object they represent/are part of (land, building, sky, bicycle, chair, person, and so on…). By joining together regions with the same label, we segment the image based on the meaning of the pixels. Like this:
(The CRF-RNN column in the above figure shows the results from the network architecture described in this paper).
As we’ve seen, CNNs have been very successful in image classification and detection, but there are challenges applying them to pixel-labelling problems. Firstly, traditional CNNs don’t produce fine-grained enough outputs to label every pixel. But perhaps more significantly, even if we could overcome that hurdle, they don’t have any way of understanding that if pixel A is part of, say, a bicycle, then it’s likely that the adjacent pixel B is also part of a bicycle. Or in more fancy words:
CNNs lack smoothness constraints that encourage label agreement between similar pixels, and spatial and appearance consistency of the labelling output. Lack of such smoothness constraints can result in poor object delineation and small spurious regions in the segmentation output.
Conditional Random Fields (a variant of Markov Random Fields) are very good at smoothing. They’re basically models that take into account surrounding context when making predictions. So maybe we can combine Conditional Random Fields (CRF) and CNNs in some way to get the best of both worlds?
The key idea of CRF inference for semantic labelling is to formulate the label assignment problem as a probabilistic inference problem that incorporates assumptions such as the label agreement between similar pixels. CRF inference is able to refine weak and coarse pixel-level label predictions to produce sharp boundaries and fine-grained segmentations. Therefore, intuitively, CRFs can be used to overcome the drawbacks in utilizing CNNs for pixel-level labelling tasks.
Sounds good in theory, but it’s quite tricky in practice. The authors proceed in two stages: firstly showing that one iteration of the mean-field algorithm used in CRF can be modelled as a stack of common CNN layers; and secondly by showing that repeating the CRF-CNN stack with outputs from the previous iteration fed back into the next iteration you can end up with an RNN structure, dubbed CRF-RNN, that implements the full algorithm.
Our approach comprises a fully convolutional network stage, which predicts pixel-level labels without considering structure, followed by a CRF-RNN stage, which performs CRF-based probabilistic graphical modelling for structured prediction. The complete system, therefore, unifies the strengths of both CNNs and CRFs and is trainable end-to-end using the back-propagation algorithm and the Stochastic Gradient Descent (SGD) procedure.
There’s a lot of detail in the paper, some of which passes straight over my head, for example, the following sentence which warranted me bringing out the ‘hot pink’ highlighter:
In terms of permutohedral lattice operations, this can be accomplished by only reversing the order of the separable filters in the blur stage, while building the permutohedral lattice, splatting, and slicing in the same way as in the forward pass.
(What is a permutohedron you may ask? It’s actually not as scary as it sounds…)
Fortunately, we’re just trying to grok the big picture in this write-up, and for that the key is to understand how CNNs can model one mean-field iteration, and then how we stack the resulting structures in RNN formation.
Mean-field iteration as a stack of CNN layers
Consider a vector X with one element per pixel, representing the label assigned to that pixel drawn from some pre-defined set of labels. We construct a graph where the vertices are the elements in X, and edges between the elements hold pairwise ‘energy’ values. Minimising the overall energy of the configuration yields the most probable label assignments. Energy has two components: a unary component which depends only on the individual pixel and roughly speaking, predicts labels for pixels without considering smoothness and consistency; and pairwise energies that provide an image data-dependent smoothing term that encourages assigning similar labels to pixels with similar properties. The energy calculations are based on feature vectors derived from image features. Mean-field iteration is used to find an approximate solution for the minimal energy configuration.
The steps involved in a single iteration are:
- message passing,
- compatibility transformation,
- unary addition, and
Message passing is made tractable by using approximation techniques (those permutohedral lattice thingies) and two Guassian kernels: a spatial kernel and a bilateral kernel. Re-weighting can be implemented as a 1×1 convolution. Each kernel is given independent weights:
The intuition is that the relative importance of the spatial kernel vs the bilateral kernel depends on the visual class. For example, bilateral kernels may have on the one hand a high importance in bicycle detection, because similarity of colours is determinant; on the other hand they may have low importance for TV detection, given that whatever is inside the TV screen may have different colours.
Compatibility transformation assigns penalties when different labels are assigned to pixels with similar properties. It is implemented with a convolutional filter with learned weights (equivalent to learning a label compatibility function).
The addition (copying) and normalisation (softmax) operations are easy.
CRF as a stack of CRF-CNN layers
Multiple mean-field iterations can be implemented by repeating the above stack of layers in such a way that each iteration takes Q value estimates from the previous iteration and the unary values in their original form. This is equivalent to treating the iterative mean-field inference as a Recurrent Neural Network (RNN)… We name this RNN structure CRF-RNN.
Recall that the overall network has a fully-convolution network stage predicting pixels labels in isolation, followed by a CRF-CNN for structured prediction. In one forward pass the computation goes through the initial CNN stage, and then it takes T iterations for data to leave the loop created by the RNN. One the data leaves this loop, a softmax loss layer directly follows and terminates the network.
The resulting network achieves the state-of-the-art on the Pascal VOC 2010-2012 test datasets.
Today we’re looking at the final four papers from the ‘convolutional neural networks’ section of the ‘top 100 awesome deep learning papers‘ list.
- Deep residual learning for image recognition, He et al., 2016
- Identity mappings in deep residual networks, He et al., 2016
- Inception-v4, inception-resnet and the impact of residual connections or learning, Szegedy et al., 2016
- Rethinking the inception architecture for computer vision, Szegedy et al., 2016
Deep residual learning for image recognition
Another paper, another set of state-of-the-art results, this time with 1st place on the ILSVRC 2015 classification tasks (beating GoogLeNet from the year before), as well as 1st place on ImageNet detection, ImageNet localisation, COCO detection, and COCO segmentation competitions. For those of you old enough to remember it, there’s a bit of a Crocodile Dundee moment in this paper: “22 layers? That’s not a deep network, this is a deep network…” How deep? About 100 layers seems to work well, though the authors tested networks up to 1202 layers deep!! Which begs the question, how on earth do you effectively train a network that is hundreds or even a thousand layers deep?
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers?
No, it’s not. A degradation problem occurs as network depth increases, accuracy gets saturated, and then degrades rapidly as further layers are added. So after a point, the more layers you add, the worse the error rates. For example:
An interesting thought experiment leads the team to make a breakthrough and defeat the degradation problem. Imagine a deep network where after a certain (relatively shallow) number of layers, each additional layer is simply an identity mapping from the previous one: “the existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart.”
Suppose the desired mapping through a number of layers for certain features really is close to the identity mapping. The degradation problem suggests that the network finds it hard to learn such mappings across multiple non-linear layers. Lets say the output of one layer is . Now skip forward three layers in the network and imagine that we’d like the ability to easily learn an identity mapping (or close to) for some elements of – an easy way to do this is just to provide the elements of as direct inputs…
But we don’t want to just double the width of the receiving layer, and for some other elements of perhaps we do want to learn some non-linear function. So here comes the clever part: suppose the ideal function we could learn is . If the intervening hidden layers can learn that then they can also learn . Maybe you can see where this is going: we can now simply combine the output of the previous layer with using simple addition and we get ! But constructing things in this way though, we make it easy for the network to include both non-linear elements and also near identity elements (by learning weights close to zero).
Adding these layer jumping + addition short-circuits to a deep network creates a residual network. Here’s a 34-layer example:
Here are 18 and 34 layer networks trained on ImageNet without any residual layers. You can see the degradation effect with the 34 layer network showing higher error rates.
Take the same networks and pop in the residual layers, and now the 34-layer network is handsomely beating the 18-layer one.
Here are the results all the way up to 152 layer networks:
Now that’s deep!
Identity mappings in deep residual networks
This paper analyses the skip-layers introduced in the residual networks (ResNets) that we just looked at to see whether the identity function is the best option for skipping. It turns out that it is, and that using an identity function for activation as well makes residual units even more effective.
Recall that we used as the input when skipping layers. More generally, this is where h is the identity function. But h could be also be another kind of function – for example constant scaling. The authors try a variety of different h functions, but identity works best. When we consider the activation function f as well (the original paper used ReLU) then we have . Things work even better when f is also an identity mapping (instead of ReLU).
To construct an identity mapping f, we view the activation functions (ReLU and Batch Normalization) as “pre-activation” of the weight layers, in contrast to conventional wisdom of “post-activation”. This point of view leads to a new residual unit design, shown in Fig 1(b).
Making both h and f identity mappings allows a signal to be directly propagated from one unit to any other units, in both forward and backward passes.
Here’s the difference the new residual unit design makes when training a 1001 layer ResNet:
Based on this unit, we present competitive results on CIFAR-10/100 with a 1001-layer ResNet, which is much easier to train and generalizes better than the original ResNet. We further report improved results on ImageNet using a 200-layer ResNet, for which the counterpart of [the original ResNet] starts to overfit. These results suggest that there is much room to exploit the dimension of network depth, a key to the success of modern deep learning.
Inception-v4, inception-resnet and the impact of residual connections or learning
While all this ResNet excitement was going on, the Inception team kept refining their architecture, up to Inception v3. The obvious question became: what happens if we take the Inception architecture and we add residual connections to it? Does that further improve training time or accuracy? This paper compares four models: Inception v3, a newly introduced in this paper Inception v4, and variations of Inception v3 and v4 that also have residual connections. It’s also fun to see ever more complex network building blocks being used as modules in higher level architectures. Inception v4 is too much to show in one diagram, but here’s the overall schematic:
And as a representative selection, here’s what you’ll find if you dig into the stem module:
And the ’17 x 17 Inception-B’ module:
Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin.
Since Inception 4 and Inception-ResNet-v2 (Inception 4 with residual connections) give overall very similar results, this seems to suggest that – at least at this depth – residual networks are not necessary for training deep networks.
However, the use of residual connections seems to improve the training speed greatly, which is alone a great argument for their use.
Both the Inception v4 and Inception ResNet-2 models outperform previous Inception networks, by virtue of the increased model size.
Rethinking the Inception architecture for computer vision
This paper comes a little out of order in our series, as it covers the Inception v3 architecture. The bulk of the paper though is a collection of advice for designing image processing deep convolutional networks. Inception v3 just happens to be the result of applying that advice.
In this paper, we start with describing a few general principles and optimization ideas that proved to be useful for scaling up convolution networks in efficient ways.
- Avoid representational bottlenecks – representation size should gently decrease from the inputs to the outputs before reaching the final representation used for task at hand. Big jumps (downward) in representation size cause extreme compression of the representation and bottleneck the model.
- Higher dimensional representationsare easier to process locally in a network, more activations per tile allows for more disentangled features. The resulting networks train faster.
- Spatial aggregation of lower dimensional embeddings can be done without much or any loss in representational power. “For example, before performing a more spread out (e.g. 3×3 convolution), one can reduce the dimension of the input representation before the spatial aggregation without expecting serious adverse effects.”
- Balance network width and depth, optimal performance is achieved by balancing the number of filters per stage and the depth of the network. I.e., if you want to go deeper you should also consider going wider.
Although these principles might make sense, it is not straightforward to use them to improve the quality of networks out of the box. The idea is to use them judiciously in ambiguous situations only.
- Factorize into smaller convolutions, a larger (e.g. 5×5) convolution is disproportionately more expensive than a smaller (e.g. 3×3) one – by a factor of 25/9 in this case. Replacing the 5×5 convolution with a two-layer network of 3×3 convolutions reusing activations between adjacent tiles achieves the same end but uses (9+9)/25 less computation.
The above results suggest that convolutions with filters larger 3 × 3 a might not be generally useful as they can always be reduced into a sequence of 3 × 3 convolutional layers. Still we can ask the question whether one should factorize them into smaller, for example 2×2 convolutions. However, it turns out that one can do even better than 2 × 2 by using asymmetric convolutions, e.g. n × 1. For example using a 3 × 1 convolution followed by a 1 × 3 convolution is equivalent to sliding a two layer network with the same receptive field as in a 3×3 convolution. The two-layer solution is 33% cheaper.
Taking this idea even further (e.g. replacing 7×7 with a 1×7 followed by a 7×1) works well on medium grid sizes (between 12 and 20), so long as it is not used in early layers.
The authors also revisit the question of the auxiliary classifiers used to aid training in the original Inception. “Interestingly, we found that auxiliary classifiers did not result in improved convergence early in the training… near the end of training the network with the auxiliary branches starts to overtake the accuracy of the network without, and reaches a slightly higher plateau.” Removing the lower of the two auxiliary classifiers also had no effect.
Together with the earlier observation in the previous paragraph, this means that original the hypothesis of [Inception] that these branches help evolving the low-level features is most likely misplaced. Instead, we argue that the auxiliary classifiers act as regularizer.
There are several other hints and tips in the paper that we don’t have space to cover. It’s well worth checking out if you’re building these kinds of models yourself.
Today it’s the second tranche of papers from the convolutional neural nets section of the ‘top 100 awesome deep learning papers‘ list:
- Return of the devil in the details: delving deep into convolutional nets, Chatfield et al., 2014
- Spatial pyramid pooling in deep convolutional networks for visual recognition, He et al., 2014
- Very deep convolutional networks for large-scale image recognition, Simonyan & Zisserman, 2014
- Going deeper with convolutions, Szegedy, 2015
Return of the devil in the details: delving deep into convolutional nets
This is a very nice study. CNNs had been beating handcrafted features in image recognition tasks, but it was hard to pick apart what really accounted for the differences (and indeed, the differences between different CNN models too) since comparisons across all of them were not done on a shared common basis. For example, we’ve seen augmentation typically used with CNN training – how much of the difference between IVF (the Improved Fisher Vector hand-engineered feature representation) and CNNs is attributed to augmentation, and not the image representation used? So Chatfield et al. studied IFV shallow representation, three different CNN-based deep representations, and deep representations with pre-training and then fine-tuning on the target dataset. For all of the studies, the same task (PASCAL VOC classification) was used. The three different CNN representations are denoted CNN-F (Fast) – based on Krizhevksy’s architecture; CNN-M (Medium) – using a decreased stride and smaller receptive field of the first convolutional layer; and CNN-S (Slow) based on the ‘accurate’ network from OverFeat.
- Augmentation improves performance by ~3% for both IFV and CNNs. (Or to put it another way, the use of augmentation accounts for 3% of the advantage attributed to deep methods). Flipping on its own helps only marginally, but flipping combined with cropping works well.
- Both IFV and CNNs are affected by adding or subtracting colour information. Retraining CNNs after converting images to grayscale results in about a 3% performance drop.
- CNN based methods still outperform shallow encodings, even accounting for augmentation improvements etc., by a large approximately 10% margin.
- CNN-M and CNN-S both outperform CNN-Fast by 2-3%. CNN-M is about 25% faster than CNN-S.
- Retraining the CNNs so that the final layer was of lower dimensionality resulted in a marginal performance boost.
- Fine-tuning makes a significant difference, improving results by about 2.7%.
In this paper we presented a rigorous empirical evaluation of CNN-based methods for image classification, along with a comparison with more traditional shallow feature encoding methods. We have demonstrated that the performance of shallow representations can be significantly improved by adopting data augmentation, typically used in deep learning. In spite of this improvement, deep architectures still outperform the shallow methods by a large margin. We have shown that the performance of deep representations on the ILSVRC dataset is a good indicator of their performance on other datasets, and that fine-tuning can further improve on already very strong results achieved using the combination of deep representations and a linear SVM.
Spatial pyramid pooling in deep convolutional networks for visual recognition
The CNN architectures we’ve looked at so far have a series of convolutional layers (5 is popular) followed by fully connected layers and an N-way softmax output. One consequence of this is that they can only work with images of fixed size (e.g 224 x 224). Why? The sliding windows used in the convolutional layers can actually cope with any image size, but the fully-connected layers have a fixed sized input by construction. It is the point at which we transition from the convolution layers to the fully-connected layers therefore that imposes the size restriction. As a result images are often cropped or warped to fit the size requirements of the network, which is far from ideal.
Spatial pyramid pooling (SPP) adds a new layer between the convolutional layers and the fully-connected layers. Its job is to map any size input down to a fixed size output. The ideal of spatial pyramid pooling, also know as spacial pyramid matching or just ‘multi-level pooling’ pre-existed in computer vision, but had not been applied in the context of CNNs.
SPP works by dividing the feature maps output by the last convolutional layer into a number of spatial bins with sizes proportional to the image size, so the number of bins is fixed regardless of the image size. Bins are captured at different levels of granularity – for example, one layer of 16 bins dividing the image into a 4×4 grid, another layer of 4 bins dividing the image into a 2×2 grid, and a final layer comprising the whole image. In each spatial bin, the responses of each filter are simply pooled using max pooling.
Since the number of bins is known, we can just concatenate the SPP outputs to give a fixed length representation (see the figure above).
This not only allows arbitrary aspect ratios, but also allows arbitrary scales… When the input image is at different scales, the network will extract features at different scales. Interestingly, the coarsest pyramid level has a single bin that covers the entire image. This is in fact a “global pooling” operation, which is also investigated in several concurrent works.
An SPP layer added to four different networks architects, including AlexNet (Krizhevsky et al.) and OverFeat improved the accuracy of all of them. “The gain of multi-level pooling is not simply due to more parameters, rather it is because the multi-level pooling is robust to the variance in object deformations and spatial layout.”
The SPP technique can also be used for detection. The state-of-the-art (as of time of publication) R-CNN method runs feature extraction on each of 2000 candidate windows extracted from an input image. This is expensive and slow. An SPP-net used for object detection extracts feature maps only once (possible at multiple scales). Then just the spatial pyramid pooling piece is run once for each candidate window. This turns out to give comparable results, but with running times 38x-102x faster depending on the number of scales.
Very deep convolutional networks for large-scale image recognition
We’ve now seen the ConvNet architecture and some variations exploring what happens with different window sizes and strides, and training and testing at multiple scales. In this paper Simonyan and Zisserman hold all those variables fixed, and explore what effect the depth of the network has on classification accuracy.
The basic setup is a fixed-size 224 x 224 RGB input image with mean pixel value (computed over the training set) subtracted from each pixel. A stack of convolutional layers (with varying depth in each of the experiments) uses filters with a 3×3 receptive field, and in one configuration a layer is added with a 1×1 field (which can be seen as a linear transformation of the input channels, followed by non-linearity). The stride is fixed at 1 pixel. Spatial pooling is carried out by five max-pooling layers interleaved with the convolutional layers. This stack is then feed into three fully-connected layers and a final soft-max layer. The hidden layers all use ReLU activation.
Given this basic structure, the actual networks evaluated are shown in the table below (note that only one of them uses local response normalisation – LRN):
Here’s what the authors find:
Firstly, local response normalisation did not improve accuracy (A-LRN vs A), but adds to training time, so it is not employed in the deeper architectures. Secondly, classification error decreases with increased ConvNet depth (up to 19 layers in configuration E).
The error rate of our architecture saturates when the depth reaches 19 layers, but even deeper models might be beneficial for larger datasets. We also compared the net B with a shallow net with five 5 × 5 conv. layers, which was derived from B by replacing each pair of 3 × 3 conv. layers with a single 5 × 5 conv. layer (which has the same receptive field as explained in Sect. 2.3). The top-1 error of the shallow net was measured to be 7% higher than that of B (on a center crop), which confirms that a deep net with small filters outperforms a shallow net with larger filters.
The results above were achieved when training at a single scale. Even better results were achieved by adding scale jittering at training time (lightly rescaled versions of the original image).
Going deeper with convolutions
So now things start to get really deep! This is the paper that introduced the ‘Inception’ network architecture, and a particular instantiation of it called ‘GoogLeNet’ which achieved a new state of the art in the 2014 ISLVRC (ImageNet) competition. GoogLeNet is 22 layers deep, and has a pretty daunting overall structure, which I thought I’d just include here in its full glory!
Despite the intimidating looking structure, GoogLeNet actually uses 12x fewer parameters than the winning Krizhevsky ConvNet of two years prior. At the same time, it is significantly more accurate. Efficiency in terms of power and memory use was an explicit design goal of the Inception architecture:
It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor [efficiency] rather than having a sheer fixation on accuracy numbers. For most of the experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time, so that they do not end up to be a purely academic curiosity, but could be put to real world us, even on large datasets, at a reasonable cost.
You can always make your network ‘bigger’ (both in terms of number of layers – depth, as well as the number of units in each layer – width), and in principle this leads to higher quality models. However, bigger networks have many more parameters making them prone to overfitting. To avoid this you need much more training data. They also require much more computational resource to train. “For example, in a deep vision network if two convolutional layers are chained, any uniform increase in the number of their filters results in a quadratic increase in computation.”
One way to counteract this is to introduce sparsity. Arora et al., in “Provable bounds for learning some deep representations” showed that if the probability distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network technology can be constructed layer after layer by analyzing the correlation statistics of the preceding layer activations and clustering neurons with highly correlated outputs.
Unfortunately, today’s computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures.
Is there any hope for an architecture that makes use of filter-level sparsity, as suggested by the theory, but exploits current hardware by utilizing computations on dense matrices? The Inception architecture started out as an exploration of this goal.
The main idea of the Inception architecture is to consider how an optimal local sparse structure of a convolutional vision network can be approximated and covered by readily available dense components.
Given the layer-by-layer construction approach, this means we just have to find the optimal layer structure, and then stack it. The basic structure of an Inception layer looks like this:
The 1×1 convolutions detect correlated units in local regions, and the larger (3×3 and 5×5) convolutions detect the (smaller number) of more spatially spread out clusters. Since pooling has been shown to have a beneficial effect, a pooling path in each stage is added for good luck too.
The problem with the structure as shown above though, is that it is prohibitively expensive!
While this architecture might cover the optimal sparse structure, it would do it very inefficiently, leading to a computational blow up within a few stages.
The solution is to reduce the network dimensions using 1×1 convolutions on all the 3×3 and 5×5 pathways. “Beside being used as reductions, they also include the use of rectified linear activation making them dual purpose.” This gives a final structure for an Inception layer that looks like this:
In general, an Inception network is a network consisting of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid.
You can see these layers stacked on top of each other in the GoogLeNet model. That network is 22 layers deep (27 if the pooling layers are also counted). The overall number of building blocks used for the construction of the network is about 100! Propagating gradients all the way back through so many layers is a challenge. We know that shallower networks still have strong discriminative performance, and this fact was exploited by adding auxiliary classifiers connected to intermediate layers (yellow boxes in the overall network diagram at the start of this section)…
… These classifiers take the form of smaller convolutional networks put on top of the output of the Inception (4a) and (4d) modules. During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3). At inference time, these auxiliary networks are discarded.
Having recovered somewhat from the last push on deep learning papers, it’s time this week to tackle the next batch of papers from the ‘top 100 awesome deep learning papers.’ Recall that the plan is to cover multiple papers per day, in a little less depth than usual per paper, to give you a broad sweep of what’s in them (See “An experiment with awesome deep learning papers“). You’ll find the first batch of papers in the archives starting from February 27th. For the next three days, we’ll be tackling the papers from the ‘convolutional neural network models’ section, starting with:
- ImageNet classification with deep convolutional neural networks, Krizhevsky et al., 2012
- Maxout networks, Goodfellow et al., 2013
- Network in network, Lin et al., 2013
- OverFeat: Integration recognition, localization and detection using convolutional networks, Sermanent et al., 2013
Ujjwal Karn’s excellent blog post “An intuitive explanation of convolutional neural networks” provides a some great background on how convolutional networks work if you need a refresher before diving into these papers.
ImageNet classification with deep convolutional neural networks
This is a highly influential paper that kicked off a whole stream of work using deep convolutional neural networks for image processing. Two factors changed that made this possible: firstly, the availability of large enough datasets (specifically, the introduction of ImageNet with millions of images, whereas the previous largest datasets had ‘only’ tens of thousands; and secondly the development of powerful enough GPUs to efficiently train large networks.
What makes CNNs such a good fit for working with image data?
Their capacity can be controlled by varying their depth and breadth, and they also make strong and mostly correct assumptions about the nature of images (namely, stationarity of statistics and locality of pixel dependencies). Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have much fewer connections and parameters and so they are easier to train, while their theoretically-best performance is likely to be only slightly worse.
The network that Krizhevsky et al. constructed has eight learned layers – five of them convolutional and three fully-connected. The output of the last layer is a 1000-way softmax which produces a distribution over the 1000 class labels (the ImageNet challenge is to create a classifier that can determine which object is in the image). Because the network is too large to fit in the memory of one GPU, training is split across two GPUs and the kernels of the 2nd, 4th, and 5th convolutional layers are connected only to those kernel maps in the previous layer which reside on the same GPU.
The authors single out four other aspects of the model architecture that they feel are particularly important:
- The use of ReLU activation (instead of
tanh). “Deep convolutional neural networks with ReLUs train several times faster than their equivalents with tanh units… Faster learning has a great influence on the performance of large models trained on large datasets“
- Using multiple GPUs (two!), and splitting the kernels between them with cross-GPU communication only in certain layers. The scheme reduces the top-1 and top-5 error rates by 1.7% and 1.2% respectively compared to a net with half as many kernels in each layer and trained on just one GPU.
- Using local response normalisation, which “implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels”.
- Using overlapping pooling. Let pooling layers be of size z x z, and spaced s pixels apart. Traditionally pooling was used with s = z, so that there was no overlap between pools. Krizhevsky et al. used s = 2 and z = 3 to give overlapping pooling. This reduced the top-1 and top-5 error rates by 0.4% and 0.3% respectively.
To reduce overfitting dropout and data augmentation (translations, reflections, and principal component manipulation) is used during training.
The end result:
On ILSVRC-2010, our network achieves top-1 and top-5 test set error rates of 37.5% and 17.0%.
In ILSVRC-2012, the network achieved a top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. That huge winning margin sparked the beginning of a revolution.
Our results show that a large, deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised learning. It is notable that our network’s performance degrades if a single convolutional layer is removed. For example, removing any of the middle layers results in a loss of about 2% for the top-1 performance of the network. So the depth really is important for achieving our results.
For a more in-depth look at this paper, see my earlier write-up on The Morning Paper.
Maxout networks are designed to work hand-in-glove with dropout. As you may recall, training with dropout is like training an exponential number of models all sharing the same parameters. Maxout networks are otherwise standard multilayer perceptron or deep CNNs that use a special activation function called the maxout unit. The output of a maxout unit is simply the maximum of its inputs.
In a convolutional network, a maxout feature map can be constructed by taking the maximum across k affine feature maps
(i.e., pool across channels, in addition spatial locations). When training with dropout, we perform the elementwise multiplication with the dropout mask immediately prior to the multiplication by the weights in all cases–we do not drop inputs to the max operator.
Maxout units make a piecewise linear approximation to an arbitrary convex function, as illustrated below.
Under evaluation, the combination of maxout and dropout achieved state of the art classification performance on MNIST, CIFAR-10, CIFAR-100, and SVHN (Street View House Numbers).
Why does it work so well? Dropout does exact model averaging in deeper architectures provided that they are locally linear among the space of inputs to each layer that are visited by applying different dropout masks…
We argue that dropout training encourages maxout units to have large linear regions around inputs that appear in the training data… Networks of linear operations and maxout may learn to exploit dropout’s approximate model averaging technique well.
In addition, rectifier units that saturate at zero are much more common with dropout training. A zero value stops the gradient from flowing through the unit making it hard to change under training and become active again.
Maxout does not suffer from this problem because gradient always flows through every maxout unit–even when a maxout unit is 0, this 0 is a function of the parameters and may be adjusted. Units that take on negative activations may be steered to become positive again later.
Network in Network
A traditional convolutional layer applies convolution by applying linear filters to the receptive field, followed by nonlinear activation. The outputs are called feature maps. In this paper Lin et al. point out that such a process cannot learn representations that distinguish well between non-linear concepts.
The convolution filter in CNN is a generalized linear model (GLM) for the underlying data patch, and we argue that the level of abstraction is low with GLM. By abstraction we mean that the feature is invariant to the variants of the same concept.
Even maxout networks impose the constraint that instances of a latent concept lie within a convex set in the input space, to which they make a piecewise linear approximation. A key question therefore, is whether or not input features do indeed require non-linear functions in order to best represent the concepts contained within them. The authors assert that they do:
…the data for the same concept often live on a nonlinear manifold, therefore the representations that capture these concepts are generally highly nonlinear function of the input. In NIN, the GLM is replaced with a ”micro network” structure which is a general nonlinear function approximator. In this work, we choose multilayer perceptron as the instantiation of the micro network, which is a universal function approximator and a neural network trainable by back-propagation.
And that’s the big idea right there, replace the linear convolutional layer with a mini multilayer perceptron network (called an mlpconv layer). We know that such networks are good at learning functions, so let’s just allow the mlpconv network to learn what the best convolution function is. Since the mlpconv layers sit inside a larger network model, the overall approach is called “network in network.”
The second change that Lin et al. make to the traditional architecture comes at the last layer:
Instead of adopting the traditional fully connected layers for classification in CNN, we directly output the spatial average of the feature maps from the last mlpconv layer as the confidence of categories via a global average pooling layer, and then the resulting vector is fed into the softmax layer. In traditional CNN, it is difficult to interpret how the category level information from the objective cost layer is passed back to the previous convolution layer due to the fully connected layers which act as a black box in between. In contrast, global average pooling is more meaningful and interpretable as it enforces correspondance between feature maps and categories, which is made possible by a stronger local modeling using the micro network.
On CIFAR-10 (10 classes of natural images with 50,000 traning images in total, and 10,000 testing images), the authors beat the then state-of-the-art by more than one percent.
On CIFAR-100 they also beat the then best performance (without data augmentation, which was also not used to evaluate the N-in-N approach) by more than one percent. With the street view house numbers dataset, and with MNIST the authors get good results, but not quite state-of-the-art.
OverFeat: Integrated recognition, localization and detection using convolutional networks
OverFeat shows how the features learned by a CNN-based classifier can also be used for localization and detection. On the ILSVRC 2013 dataset OverFeat ranked 4th in classification, 1st in localization, and 1st in detection. We know what the classification problem is (what object is in this image), but what about the localization and detection problems? Localization is like classification, but the network must also produce a bounding box showing the boundary of the detected object:
The detection problem involves images which may contain many small objects, and the network must detect each object and draw its bounding box:
The main point of this paper is to show that training a convolutional network to simultaneously classify, locate and detect objects in images can boost the classification accuracy and the detection and localization accuracy of all tasks. The paper proposes a new integrated approach to object detection, recognition, and localization with a single ConvNet. We also introduce a novel method for localization and detection by accumulating predicted bounding boxes.
There’s a lot of detail in the OverFeat paper that we don’t have space to cover, so if this work is of interest this paper is one where I definitely recommend going on to read the full thing.
Since objects of interest (especially in the later detection task) can vary significantly in size and position within the image, OverFeat applies a ConvNet at multiple locations within the image in a sliding window fashion, and at multiple scales. The system is then trained to produce a prediction of the location and size of the bounding box containing an object relative to the window. Evidence for each object category is accumulated at each location and size.
Starting with classification, the model is based on Krizhevsky et al. (our first paper today) in the first five layers, but no contrast normalisation is used, pooling regions are non-overlapping, and a smaller stride is used (2 instead of 4). Six different scales of input are used, resulting in unpooled layer 5 maps of varying resolution. These are then pooled and presented to the classifier.
The following diagram summarises the classifier approach built on top of the layer 5 feature maps:
At an intuitive level, the two halves of the network — i.e. feature extraction layers (1-5) and classifier layers (6-output) — are used in opposite ways. In the feature extraction portion, the filters are convolved across the entire image in one pass. From a computational perspective, this is far more efficient than sliding a fixed-size feature extractor over the image and then aggregating the results from different locations. However, these principles are reversed for the classifier portion of the network. Here, we want to hunt for a fixed-size representation in the layer 5 feature maps across different positions and scales. Thus the classifier has a fixed-size 5×5 input and is exhaustively applied to the layer 5 maps. The exhaustive pooling scheme (with single pixel shifts (∆x, ∆y)) ensures that we can obtain fine alignment between the classifier and the representation of the object in the feature map.
For localization the classifier layers are replaced by a regression networks trained to predict bounding boxes at each spatial location and scale. The regression predictions are then combined, along with the classification results at each location. Training with multiple scales ensure predictions match correctly across scales, and exponentially increases the confidence of the merged predictions.
Bounding boxes are combined based on the distance between their centres and the intersection of their areas, and the final prediction is made by taking the merged bounding boxes with maximum class scores. Figure 6 below gives a visual overview of the process:
Detection training is similar to classification training, but with the added necessity of predicting a background task when no object is present.
Traditionally, negative examples are initially taken at random for training, then the most offending negative errors are added to the training set in bootstrapping passes… we perform negative training on the fly, by selecting a few interesting negative examples per image such as random ones or most offending ones. This approach is more computationally expensive, but renders the procedure much simpler.