Stochastic program optimization Schkufza et al., CACM 2016

Yesterday we saw that DeepCoder can find solutions to simple programming problems using a guided search. DeepCoder needs a custom DSL, and a maximum program length of 5 functions. In ‘Stochastic program optimization’ Schkufza et al. also use a search strategy to generate code that meets a given specification – however, their input is a loop-free fixed-point x86_64 assembly code sequence, and the output is optimised assembler that does the same thing, but faster. There are almost 400 x86_64 opcodes, and the programs can be much longer than five instructions! The optimiser is called STOKE, and it can create provably correct sequences that match or outperform the code produced by gcc -O3 or icc -O3, and in some cases even expert handwritten assembly. (-O3 is the highest level of optimisation for gcc/icc)

Whereas a traditional optimisation phase for a compiler factors the optimisation problem into small subproblems that can be solved independently, STOKE can consider the whole program and find solutions that can only be obtained through, for example, “the simultaneous consideration of mutually dependent issues such as instruction selection, register allocation, and target-dependent optimisation.”

Here’s an illustration of STOKE at work, the output of gcc -O3 is shown on the left, and the STOKE optimised assembly on the right. In addition to being considerably shorter, it runs 1.6x faster. The original input to the optimisers (not shown) was a 116-line program produced by llvm -O0.

### Optimisation approach

Because we’re not just trying to generate a correct program, but also a fast one, the problem can be framed as a cost minimisation problem with two weighted terms: one term accounting for correctness, and one for performance.

The simplest measure of correctness is functional equality. Remember that we’re given the assembly output of llvm -O0 as input, and this defines the target behaviour. So we can do black-box equality tests: consider both the target program and a candidate rewrite as functions of registers and memory contents, if they produce the same live outputs for all live inputs defined by the target then they are equal. When two programs are equal the cost of the equality term is zero (remember we’re minimising).

An optimisation is any rewrite with zero equality cost and lower performance cost (i.e., better performance) than the target.

Discovering these optimizations requires the use of a cost minimization procedure. However, in general we expect cost functions of this form to be highly irregular and not amenable to exact optimization techniques. The solution to this problem is to employ the common strategy of using an MCMC sampler.

Markov Chain Monte Carlo (MCMC) sampling draws elements from a probability density function such that regions of higher probability are sampled more often than regions of lower probability. When used for cost minimisation then in the limit most samples should be taken from the minimum (i.e., optimal) values of a function. The eagle-eyed reader may have spotted a small problem here – we don’t have a probability density function! But the Metropolis-Hastings algorithm will let us obtain samples from an approximate cost probability density function given that we can measure cost.

Given an overall computation budget, the algorithm maintains a current rewrite (initially equal to the target) and repeatedly draws a new sample, replacing the current rewrite with this sample if it is accepted. The acceptance criteria come from the Metropolis-Hastings algorithm: if the new sample has a lower cost (higher probability) it is always accepted, but if the new sample has a higher cost it may still be accepted, with a probability that decreases according to how much worse it is. Thus we hope to avoid the search becoming trapped in local minima.

### Application to x86_64 assembly

When we come to apply this general approach in the context of x86_64 we have to make three key decisions: (i) how to efficiently test equality, (ii) how to efficiently measure performance, and (iii) how to implement MCMC sampling.

For equality, we could use a symbolic validator, but his turns out to be too slow (less than 1000 evaluations a second on modestly sized code sequences). Instead, an approximation to full equality is used. A number of test cases are evaluated, and the cost term is actually defined based on how closely the output matches the target for each test input, based on the number of bits that differ between the outputs. This is much faster than using a theorem prover, and also gives a much smoother cost function than the stepped 1/0 (equal/unequal) symbolic equality test. Test case evaluations can be done at a rate of between 1 and 10 million per second.

Performance is also approximated, based on a static approximation of the cost of each instruction. (This is faster than compiling and then executing a program multiple times in order to eliminate transient effects).

For sampling, rewrites are represented as loop-free sequences of instructions of length l, with a special ‘unused’ token used to represent unused instruction slots. Starting with the target program, one of four possible moves is made:

• An instruction is randomly selected, and its opcode is replaced by a random opcode.
• An instruction is randomly selected, and one of its operands is replaced by a random operand.
• Two lines of code are randomly selected and interchanged.
• An instruction is randomly selected and replaced either by a random instruction or the UNUSED token. (Proposing an UNUSED token amounts to deleting an instruction, and replacing an UNUSED token is like inserting one).

The approach is also therefore a little reminiscent of evolutionary algorithms (but with a population of one!).

### STOKE

STOKE implements the above ideas. It runs a binary generated by a standard compiler under instrumentation to generate test cases (using Intel’s PinTool) for a loop-free target of interest, and then runs a set of synthesis threads. From the validated rewrites, the top 20% by performance are then re-ranked based on actual runtime (not the approximation used during generation), and the best is returned to the user.

As well as measuring the actual runtime of the top results, we also want to know that the rewrite is truly equivalent to the target:

STOKE uses a sound procedure to validate the equality of loop-free code sequences. Both target and rewrite are converted into SMT formulae in the quantifier free theory of bit-vector arithmetic used by Z3, producing a query that asks whether both sequences produce the same side effects on live outputs when executed from the same initial machine state. Depending on type, registers are modeled as between 8- and 256-bit vectors, and memory is modeled as two vectors: a 64-bit address and an 8-bit value (x86_64 is byte addressable).

STOKE can generate code that lies in an entirely different region of the search space to the original target code (and it differs from standard optimisers in this respect). This enables it to generate expert-level solutions, but in the original formulation it only does so with low probability.

Dividing the cost minimisation process into two distinct phases helps to discover solutions in very different parts of the search space. In the first synthesis phase, code sequences are evaluated solely based on correctness, and then a subsequent optimization phase searches for the fastest sequence within each of the correct sequences.

### Results

STOKE was evaluated on benchmarks from the literature and from high-performance code, the results are summarised in the table below. Benchmarks p0 – p25 are from “Hacker’s Delight.”

For example:

The results shown in this CACM update improve on those in the original paper (3 years prior) by an order of magnitude or more.

### Synthesising controllers…

Before we close, I’d like to give a quick mention to ‘Sound and automated synthesis of digital stabilizing controllers for continuous plants,’ which looks at the generation of stable controllers for digital control systems. It turns out these are far from easy to write by hand given the challenges of time discretization, the noise introduced by A/D and D/A conversion, and the use of finite word length arithmetic. DSSynth can efficiently synthesize stable controllers for a set of intricate benchmarks taken from the literature, often in under a minute. To give you an idea of the kind of thing we’re talking about, one of the examples in the paper is a cruise control system.

… we leverage a very recent step-change in the automation and scalability of program synthesis. Program synthesis engines use a specification as the starting point, and subsequently generate a sequence of candidate programs from a given template. The candidate programs are iteratively refined to eventually satisfy the specification. Program synthesizers implementing Counter-Example Guided Inductive Synthesis (CEGIS) are now able to generate programs for highly non-trivial specifications with a very high degree of automation. Modern synthesis engines combine automated testing, genetic algorithms, and SMT-based automated reasoning.§

DeepCoder: Learning to write programs Balog et al., ICLR 2017

I’m mostly trying to wait until the ICLR conference itself before diving into the papers to be presented there, but this particular paper follows nicely on from yesterday, so I’ve decided to bring it forward. In ‘Large scale evolution of image classifiers‘ we saw how it’s possible to learn a model for an image classification problem instead of trying to hand-engineer it. (Which incidentally paves the way for bootstrapping sophisticated ML systems the same way that we might bootstrap a compiler). But that’s a fairly constrained domain, surely the average programmer’s job is safe – we can’t learn to write any old program. Or can we? Well ok, no we can’t, but DeepCoder can learn how to solve simple programming tasks of the kind found in the most basic problems on programming competition websites. And it’s only going to get better!

Here’s an example:

This is the domain of Inductive Program Synthesis (IPS). DeepCoder is shown sets of inputs and outputs, and must learn a program that will produce the given outputs when presented with the given inputs (no memorization allowed!). Fundamentally this is a guided search across the space of all possible programs. To make that tractable we need to choose a language (a DSL) in which we’re going to express programs – using a general purpose language such as C++ yields far too big a search space. DeepCoder’s secret sauce is a neural network that is trained to predict the kinds of functions that might be useful when trying to recreate the outputs for a given set of inputs. Knowing the most likely functions the program will ultimately need to include guides the search and helps it find solutions much faster. The approach of integrating a learning module into an IPS system is called LIPS (Learning Inductive Program Synthesis).

### The DeepCoding DSL

DeepCoding programs are simple sequences of function calls, the result of each call initialising a fresh variable that holds either a single integer or an integer array. The output of the program is the return value of the last function call.

The DSL contains the first-order functions head, last, take, drop, access, minimum, maximum, reverse, sort, and sum, and the higher-order functions map, filter, count, zipwith, and scanl. The pre-defined lambda functions which can be passed to these higher-order functions are:

• for map: (+1), (-1), (*2), (/2)2 (*(-1))2 (**2), (*3), (/3), (*4), (/4)
• for filter and count: (>0), (<0), (%2 == 0), (%2 == 1)
• for zipwith and scanl: (+), (-), (*), min, max

There’s is no explicit looping, but of course many of the provided functions do provide branching and looping internally.

### The Learning Module

The learning module learns to predict the probability that each of the above functions is used in a given program, based on seeing an input-output pair. To take a simple example, given the input [3, 8, 1, 7] and the output [4, 12, 28, 32] the network should predict high probability for sort and (*4). The overall network structure looks like this:

With only three hidden layers, it’s actually a fairly simple structure compared to some of the models we looked at last week.

First, we represent the input and output types (singleton or array) by a one-hot-encoding, and we pad the inputs and outputs to a maximum length L with a special NULL value. Second, each integer in the inputs and in the output is mapped to a learned embedding vector of size E = 20. (The range of integers is restricted to a finite range and each embedding is parametrized individually.) Third, for each input-output example separately, we concatenate the embeddings of the input types, the inputs, the output type, and the output into a single (fixed-length) vector, and pass this vector through H = 3 hidden layers containing K = 256 sigmoid units each. The third hidden layer thus provides an encoding of each individual input-output example. Finally, for input-output examples in a set generated from the same program, we pool these representations together by simple arithmetic averaging.

To network outputs a log-unnormalised array of probabilities of each function appearing in the source code:

To generate training examples, DSL programs are enumerated, pruned to remove obvious issues such as redundant variables, or the existence of shorter equivalent programs (approximated by identical behaviour on a set of inputs). Valid inputs for a program are generated by putting a constraint on the output values and propagating this constraint backward through the program. Input-output pairs are then generated by picking inputs from the pre-computed valid ranges and executing the program to obtain the output values.

### Searching for Solutions

At this point we have a set of inputs and outputs, and some clues as to which functions are most likely to be included in a program that generated those outputs. We know use this information to guide a search. The team evaluate a few different search strategies:

• Depth-first search (DFS) over programs of some maximum length T. When the search extends a partial program, it considers functions in probability order (highest probability first of course!).
• A sort and add scheme which maintains a set of active functions and performs DFS with the active function set only. When the search fails, the next most probable function (or set of functions) is added to the active set and the search restarts.
• Use of the Sketch SMT-based program synthesis tool, with a similar sort-and-add scheme used to guide its search
• Use of the $\lambda^2$ program synthesis tool also using a sort-and-add scheme

### Let’s generate some programs!

The main experiment looked at programs of length T=5 ( a search space on the order of 1010, supported by a neural network trained on programs of length T=4. The table below shows the results when trying to find 100 programs (Sketch is dropped from this part of the evaluation as it is significantly slower than the other methods).

The thing to remember when looking at the results is that there is no surprise that the search strategies can actually find successful programs. So what we’re really interested in is how long does it take to do so, and how much if any the learning component actually helps.

In the table above, the percentage columns tell you how long it took each variation to solve 20%, 40%, and 60% of the program generation challenges out of the 100 presented. The baseline row shows how long the search takes when given a function probability distribution that simply mirrors the global incidence of each function in the 500 program test set corpus. Thus the DeepCoder row is truly telling us what difference the neural network prediction make to search time.

We hypothesize that the substantially larger performance gains on Sort and add schemes as compared to gains on DFS can be explained by the fact that the choice of attribute function (predicting presence of functions anywhere in the program) and learning objective of the neural network are better matched to the Sort and add schemes.

Further experiments varying the length of the test programs (from 1 to 4 functions), and the length of the generated programs (from 1 to 5 functions) showed that the neural network can generalize to programs of different lengths than the programs in the training set.

Our empirical results show that for many programs, this technique [neural network guided search] improves the runtime of a wide range of IPS baselines by 1-3 orders. We have found several problems in real online programming challenges that can be solved with a program in our language, which validates the relevance of the class of problems that we have studied in this work. In sum, this suggests that we have made significant progress towards being able to solve programming competition problems, and the machine learning component plays an important role in making it tractable.

Of course, as the authors themselves point out, “there remain some limitations…“.

Large-scale evolution of image classifiers Real et al., 2017

I’m sure you noticed the bewildering array of network architectures in use when we looked at some of the top convolution neural network papers of the last few years last week (Part 1, Part2, Part 3). With sufficient training data, these networks can achieve amazing feats, but how do you find the best network architecture for your problem in the first place?

Discovering neural network architectures… remains a laborious task. Even within the specific problem of image classification, the state of the the art was attained through many years of focused investigation by hundreds of researchers.

If you’re an AI researcher and you come up against a difficult problem where it’s hard to encode the best rules (or features) by hand, what do you do? Get the machine to learn them for you of course! If what you want to learn is a function, and you can optimise it through back propagation, then a network is a good solution (see e.g. ‘Network in Network‘, or ‘Learning to learn by gradient descent by gradient descent‘). If what you want to learn is a policy that plays out over some number of time steps then reinforcement learning would be a good bet. But what if you wanted to learn a good network architecture? For that the authors turned to a technique from the classical AI toolbox, evolutionary algorithms.

As you may recall, evolutionary algorithms start out with an initial population, assess the suitability of population members for the task in hand using some kind of fitness function, and then generate a new population for the next iteration by mutations and combinations of the fittest members. So we know that we’re going to start out with some population of initial model architectures (say, 1000 of them), define a fitness function based on how well they perform when trained, and come up with a set of appropriate mutation operations over model architectures. That’s the big picture, and there’s just one more trick from the deep learning toolbox that we need to bring to bear: brute force! If in doubt, overwhelm the problem with bigger models, more data, or in this case, more computation:

We used slightly-modified known evolutionary algorithms and scaled up the computation to unprecedented levels, as far as we know. This, together with a set of novel and intuitive mutation operators, allowed us to reach competitive accuracies on the CIFAR-10 dataset. This dataset was chosen because it requires large networks to reach high accuracies, thus presenting a computational challenge.

The initial population consists of very simple (and very poorly performing) linear models, and the end result is a fully trained neural network with no post-processing required. Here’s where the evolved models stand in the league table:

That’s a pretty amazing result when you think about it. Really top-notch AI researchers are in very short supply, but computation is much more readily available on the open market. ‘Evolution’ evolved an architecture, with no human guidance, that beats some of our best models from the last few years.

Let’s take a closer look at the details of the evolutionary algorithm, and then we’ll come back and dig deeper into the evaluation results.

### Evolving models

We start with a population of 1000 very simple linear regression models, and then use tournament selection. During each evolutionary step, a worker process (250 of them running in parallel) chooses two individuals at random and compares their fitness. The worst of the pair is removed from the population, and the better model is chosen as a parent to help create the next generation. A mutation is applied to the parent to create a child. The worker then trains the child, evaluates it on the validation set, and puts it back into the population.

Using this strategy to search large spaces of complex image models requires considerable computation. To achieve scale, we developed a massively-parallel, lock-free infrastructure. Many workers operate asynchronously on different computers. They do not communicate directly with each other. Instead, they use a shared file-system, where the population is stored.

Training and validation takes place on the CIFAR-10 dataset consisting of 50,000 training examples and 10,000 test examples, all labeled with 1 of 10 common object classes. Each training runs for 25,600 steps – brief enough so that each individual can be trained somewhere between a few seconds and a few hours, depending on the model size. After training, a single evaluation on the validation set provides the accuracy to use as the model’s fitness.

We need architectures that are trained to completion within an evolutionary experiment… [but] 25,600 steps are not enough to fully train each individual. Training a large enough model to completion is prohibitively slow for evolution. To resolve this dilemma, we allow the children to inherit the parents’ weights whenever possible.

The final piece of the puzzle then, is the encoding of model architectures, and the mutation operations defined over them. A model architecture is encoded as a graph (its DNA). Vertices are tensors or activations (either batch normalisation with ReLUs, or simple linear units). Edges in the graph are identity connections (for skipping) or convolutions.

When multiple edges are incident on a vertex, their spatial scales or numbers of channels may not coincide. However, the vertex must have a single size and number of channels for its activations. The inconsistent inputs must be resolved. Resolution is done by choosing one of the incoming edges as the primary one. We pick this primary edge to be the one that is not a skip connection.

Activation functions are similarly reshaped (using some combination of interpolation, truncation, and padding), and the learning rate is also encoded in the DNA.

When creating a child, a worker picks a mutation at random from the following set:

• Alter learning rate
• Identity (effectively gives the individual further training time in the next generation)
• Reset weights
• Insert convolution (at a random location in the ‘convolutional backbone’). Convolutions are 3×3 with strides of 1 or 2.
• Remove convolution
• Alter stride (powers of 2 only)
• Alter number of channels (of a random convolution)
• Filter size (horizontal or vertical at random, on a random convolution, odd values only)
• Insert one-to-one (adds a one-to-one / identity connection)
• Add skip (identity between random layers)
• Remove skip (removes a random skip)

### Evaluation results

Here’s an example of an evolution experiment, with selected population members highlighted:

Five experiment runs are done, and although not all models reach the same accuracy, they get pretty close. It took 9 x 1019 FLOPs on average per experiment.

The following chart shows how accuracy improves over time during the experiments:

We observe that populations evolve until they plateau at some local optimum. The fitness (i.e. validation accuracy) value at this optimum varies between experiments (Above, inset). Since not all experiments reach the highest possible value, some populations are getting “trapped” at inferior local optima. This entrapment is affected by two important meta-parameters (i.e. parameters that are not optimized by the algorithm). These are the population size and the number of training steps per individual.

The larger the population size, the more thoroughly the space of models can be explored, which helps to reach better optima. More training time means that a model needs to undergo fewer identity mutations to reach a given level of training (remember that the end result of the evolution process is a fully trained model, not just a model architecture).

Two other approaches to escaping local optima are increasing the mutation rate, and resetting weights. When it looks like members of the population are trapped in poor local optima, the team tried applying 5 mutations instead of 1 for a few generations. During this period some population members escape the local optimum, and none get worse:

To avoid getting trapped by poorer architectures that just happened to have received more training (e.g. through the identity mutation), the team also tried experiments in which the weights are simultaneously reset across all population members when a plateau is reached. The populations suffer a temporary degradation (as to be expected), but ultimately reach a higher optima.

The IEEE Global Initiative for Ethical Considerations in Artificial Intelligence and Autonomous Systems. Ethically Aligned Design: A Vision For Prioritizing Wellbeing With Artificial Intelligence And Autonomous Systems, Version 1. IEEE, 2016. http://standards.ieee.org/develop/indconn/ec/autonomous_systems.html.

Something a little different for today… the IEEE recently put out a first version of their “Ethically Aligned Design” report for public discussion. It runs to 136 pages (!) but touches on a number of very relevant issues.

This document represents the collective input of over one hundred global thought leaders in the fields of Artificial Intelligence, law and ethics, philosophy, and policy from the realms of academia, science, and the government and corporate sectors.

The report itself is divided into eight sections, each of which seems to be the result of the deliberations of a different sub-committee. The eight areas are:

1. General Principles
2. Embedding values into autonomous intelligent systems
3. Methodologies to guide ethical research and design
4. Safety and beneficence of Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI)
5. Personal data and individual access control
6. Reframing autonomous weapons systems
7. Economics/humanitarian issues
8. Law

I’m going to focus on the first five of these areas today, and of necessity in reducing 136 pages to one blog post, I’ll be skipping over a lot of details and just choosing the parts that stand out to me on this initial reading.

### General Principles

Future AI systems may have the capacity to impact the world on the scale of the agricultural or industrial revolutions.

This section opens with a broad question, “How can we ensure that AI/AS do not infringe human rights?” (where AI/AS stands for Artificial Intelligence / Autonomous Systems throughout the report). The first component of the answer connects back to documents such as the Universal Declaration of Human Rights and makes a statement that I’m sure very few would disagree with, although it offers little help in the way of implementation:

AI/AS should be designed and operated in a way that respects human rights, freedoms, human dignity, and cultural diversity.

The other two components of the answer though raise immediately interesting technical considerations:

• AI/AS must be verifiably safe and secure throughout their operational lifetime.
• If an AI/AS causes harm it must always be possible to discover the root cause (traceability) for said harm.

The second of these in particular is very reminiscent of the GDPR ‘right to an explanation,’ and we looked at some of the challenges with provenance and explanation in previous editions of The Morning Paper.

A key concern over autonomous systems is that their operation must be transparent to a wide range of stakeholders for different reasons (noting that the level of transparency will necessarily be different for each stakeholder). Stated simply, a transparent AI/AS is one in which it is possible to discover how and why the system made a particular decision, or in the case of a robot, acted the way it did.

The report calls for new standards describing “measurable, testable levels of transparency, so that systems can be objectively assessed and levels of compliance determined.

### Embedding values

This is an interesting section. The overall argument / intention seems to be that we want to build systems that make decisions which align with the way the impacted communities would like decisions to be made. The actual wording raises a few questions though…

Society does not have universal standards or guidelines to help embed human norms or moral values into autonomous intelligent systems (AIS) today. But as these systems grow to have increasing autonomy to make decisions and manipulate their environment, it is essential they be designed to adopt, learn, and follow the norms and values of the community they serve, and to communicate and explain their actions in as transparent and trustworthy manner possible, given the scenarios in which they function and the humans who use them.

What if the norms and values of the community they serve aren’t desirable? For example, based on all the horrific stories that are increasingly being shared, the ‘norm’ of how women are treated in IT is not something we would ever want to propagate into an AIS. There are many examples in history of things that were once accepted norms which we now find very unacceptable. Could we not embed norms and values (e.g., non-discrimination) of a better, more noble version of ourselves and our communities? Presuming of course we can all agree on what ‘better’ looks like…

Values to be embedded in AIS are not universal, but rather largely specific to user communities and tasks.

This opens the door to ‘moral overload’, in which an AIS is subject to many possibly conflicting norms and values. What should we do in these situations? The recommended best practice seems guaranteed to produce discrimination against minorities (but then again, so does democracy when viewed through the same lens, this stuff is tricky!):

Our recommended best practice is to prioritize the values that reflect the shared set of values of the larger stakeholder groups. For example, a self-driving vehicle’s prioritization of one factor over another in its decision making will need to reflect the priority order of values of its target user population, even if this order is in conflict with that of an individual designer, manufacturer, or client.

In the same section though, we also get:

Moreover, while deciding which values and norms to prioritize, we call for special attention to the interests of vulnerable and under-represented populations, such that these user groups are not exploited or disadvantaged by (possibly unintended) unethical design.

The book “Moral Machines: Teaching robots right from wrong” is recommended as further reading in this area.

Understanding whether / ensuring that systems actually implement the intended norms requires transparency. Two levels of transparency are envisaged: firstly around the information conveyed to the user while an autonomous system interacts, and secondly enabling the system to be evaluated as a whole by a third-party.

A system with the highest level of traceability would contain a black-box like module such as those used in the airline industry, that logs and helps diagnose all changes and behaviors of the system.

### Methodologies to guide ethical research and design

The report highlights two key issues relating to business practices involving AI:

• a lack of value-based ethical culture and practices, and
• a lack of values-aware leadership

Businesses are eager to develop and monetize AI/AS but there is little supportive structure in place for creating ethical systems and practices around its development or use… Engineers and design teams are neither socialized nor empowered to raise ethical concerns regarding their designs, or design specifications, within their organizations. Considering the widespread use of AI/AS and the unique ethical questions it raises, these need to be identified and addressed from their inception

Companies should implement ‘ethically aligned design’ programs (from which the entire report derives its title). Professional codes of conduct can support this (there’s a great example from the British Computer Society in this section of the report).

The lack of transparency about the AI/AS manufacturing process presents a challenge to ethical implementation and oversight. Regulators and policymakers have an important role to play here the report argues. For example:

…when a companion robot like Jibo promises to watch your children, there is no organization that can issue an independent seal of approval or limitation on these devices. We need a ratings and approval system ready to serve social/automation technologies that will come online as soon as possible.

CloudPets anyone? What a disgrace.

For further reading, “An FDA for Algorithms,” and “The Black Box Society” are recommended.

There’s a well made point by Frank Pasquale, Professor of Law at the University of Maryland about the importance (and understandability) of the training data vs the algorithm too:

…even if machine learning processes are highly complex, we may still want to know what data was fed into the computational process. Presume as complex a credit scoring system as you want. I still want to know the data sets fed into it, and I don’t want health data in that set…

### Safety and beneficence of AGI and ASI

This section stresses the importance of a ‘safety mindset’ at all stages.

As AI systems become more capable, unanticipated or unintended behavior becomes increasingly dangerous, and retrofitting safety into these more generally capable and autonomous AI systems may be difficult. Small defects in AI architecture, training, or implementation, as well as mistaken assumptions, could have a very large impact when such systems are sufficiently capable.

The paper “Concrete problems in AI safety” (on The Morning Paper backlog) describes a range of possible failure modes.

Any AI system that is intended to ultimately have capabilities with the potential to do harm should be design to avoid these issues pre-emptively. Retrofitting safety into future more generally capable AI systems may be difficult:

As an example, consider the case of natural selection, which developed an intelligent “artifact” (brains) by simple hill-climbing search. Brains are quite difficult to understand, and “refactoring” a brain to be trustworthy when given large amounts of resources and unchecked power would be quite an engineering feat. Similarly, AI systems developed by pure brute force might be quite difficult to align.

### Personal data and individual access control

This is the section most closely aligned with the GDPR, and at its heart is the problem of the asymmetry of data:

Our personal information fundamentally informs the systems driving modern society but our data is more of an asset to others than it is to us. The artificial intelligence and autonomous systems (AI/AS) driving the algorithmic economy have widespread access to our data, yet we remain isolated from gains we could obtain from the insights derived from our lives.

The call is for tools allowing every individual citizen control over their own data and how it is shared. There’s also this very interesting reminder about Western cultural norms here too:

We realize the first version of The IEEE Global Initiative’s insights reflect largely Western views regarding personal data where prioritizing an individual may seem to overshadow the use of information as a communal resource. This issue is complex, as identity and personal information may pertain to single individuals, groups, or large societal data sets.

What is personal data? Any data that can be reasonably linked to an individual based on their unique physical, digital, or virtual identity. That includes device identifiers, MAC addresses, IP addresses, and cookies. Guidance on determining what constitutes personal data can be found in the U.K. Information Commissioner’s Office paper, “Determining what is personal data.”

As a tool for any organization regarding these issues, a good starting point is to apply the who, what, why and when test to the collection and storage of personal information:

• Who requires access and for what duration?
• What is the purpose of the access? Is it read, use and discard, or collect, use and store?
• Why is the data required? To fulfil compliance? Lower risk? Because it is monetized? In order to provide a better service/experience?
• When will it be collected, for how long will it be kept, and when will it be discarded, updated, re-authenticated…

The report also points out how difficult informed consent can be. For example, “Data that appears trivial to share can be used to make inferences that an individual would not wish to share…

### Afterword

I’ve barely scratched the surface, but this post is getting too long already. One of the key takeaways is that this is a very complex area! I personally hold a fairly pessimistic view when it comes to hoping that the unrestrained forces of capitalism will lead to outcomes we desire. Therefore even though it may seem painful, some kind of stick (aka laws and regulations) does ultimately seem to be required. Will our children (or our children’s children) one day look back in horror at the wild west of personal data exploitation when everything that could be mined about a person was mined, exploited, packaged and sold with barely any restriction?

Let’s finish on a positive note though. It’s popular to worry about AI and Autonomous Systems without also remembering that they can be a force for tremendous good. In addition, as well as introducing unintended bias and discrimination, they can also be used to eliminate it in a way we could never achieve with human decision makers. An example I’ve been talking about here draws inspiration from the Adversarial Neural Cryptography paper of all things. There we get a strong hint that the adversarial network structure introduced with GANs can also be applied in other ways. Consider a network that learns an encoding of information about a person (but explicitly excluding, say, information about race and gender). Train it in conjunction with two other networks, one that learns to make the desired business predictions based on the learned representation, and one (the adversarial net) that attempts to predict race and gender based on that same representation. When the adversarial net cannot do better than random chance, we have a pretty good idea that we’ve eliminated unintended bias from the system…

To round out the week, I thought I’d take a selection of fun papers from the ‘More papers from 2016’ section of top 100 awesome deep learning papers list.

The texture networks paper we’ve covered before, so the link in the above list is to The Morning Paper write-up (but I felt like it belonged in this group nevertheless).

### Colorful image colorization

Given a grayscale photograph as input, this paper attacks the problem of hallucinating a plausible color version of the photograph.

How is this possible? Well, we’ve seen that networks can learn what various parts of the image represent. If you see enough images you can learn that grass is (usually) green, the sky is (sometimes!) blue, and ladybirds are red. The network doesn’t have to recover the actual ground truth colour, just a plausible colouring.

Therefore, our task becomes much more achievable: to model enough of the statistical dependencies between the semantics and the textures of grayscale images and their color versions in order to produce visually compelling results.

Results like this:

Training data for the colourisation task is plentiful – pretty much any colour photo will do. The tricky part is finding a good loss function – as we’ll see soon, many loss functions produce images that look desaturated, whereas we want vibrant realistic images.

The network is based on image data using the CIE Lab colourspace. Grayscale images have only the lightness, L, channel, and the goal is to predict the a (green-red) and b (blue-yellow) colour channels. The overall network architecture should look familiar by now, indeed so familiar that supplementary details are pushed to an accompanying website.

(That website page is well worth checking out by the way, it even includes a link to a demo site on Algorithmia where you can try the system out for yourself on your own images).

Colour prediction is inherently multi-modal, objects can take on several plausible colourings. Apples for example may be red, green, or yellow, but are unlikely to be blue or orange. To model this, the prediction is a distribution of possible colours for each pixel. A typical objective function might use e.g. Euclidean loss between predicted and ground truth colours.

However, this loss is not robust to the inherent ambiguity and multimodal nature of the colorization problem. If an object can take on a set of distinct ab values, the optimal solution to the Euclidean loss will be the mean of the set. In color prediction, this averaging effect favors grayish, desaturated results. Additionally, if the set of plausible colorizations is non-convex, the solution will in fact be out of the set, giving implausible results.

What can we do instead? The ab output space is divided into bins with grid size 10, and the top Q = 313 in-gamut (within the range of colours we want to use) are kept:

The network learns a mapping to a probability distribution over these Q colours (a Q-dimensional vector). The ground truth colouring is also translated into a Q-dimensional vector, and the two are compared using a multinomial cross entropy loss. Notably this includes a weighting term to rebalance loss based on colour-class rarity.

The distribution of ab values in natural images is strongly biased towards values with low ab values, due to the appearance of backgrounds such as clouds, pavement, dirt, and walls. Figure 3(b) [below] shows the empirical distribution of pixels in ab space, gathered from 1.3M training images in ImageNet. Observethat the number of pixels in natural images at desaturated values are orders of magnitude higher than for saturated values. Without accounting for this, the loss function is dominated by desaturated ab values.

The final predicted distribution then needs to be mapped to a point estimated in ab space. Taking the mode of the predicted distribution leads to vibrant but sometimes spatially inconsistent results (see RH column below). Taking the mean brings back another form of the desaturation problem (see LH column below).

To try to get the best of both worlds, we interpolate by re-adjusting the temperature T of the softmax distribution, and taking the mean of the result. We draw inspiration from the simulated annealing technique, and thus refer to the operation as taking the annealed-mean of the distribution.

Here are some more colourings from a network trained on ImageNet, which were rated by Amazon Mechanical Turk participants to see how lifelike they are.

And now for my very favourite part of the paper:

Since our model was trained using “fake” grayscale images generated by stripping ab channels from color photos, we also ran our method on real legacy black and white photographs, as shown in Figure 8 (additional results can be viewed on our project webpage). One can see that our model is still able to produce good colorizations, even though the low-level image statistics of the legacy photographs are quite different from those of the modern-day photos on which it was trained.

Aren’t they fabulous! Especially the Migrant Mother colouring.

The representations learned by the network also proved useful for object classification, detection, and segmentation tasks.

### Generative visual manipulation on the natural image manifold

So we’ve just seen that neural networks can help us with our colouring. But what about those of us that are more artistically challenged and have a few wee issues making realistic looking drawings (or alterations to existing drawings) in the first place? In turns out that generative adversarial neural networks can help.

It’s a grand sounding paper title, but you can think of it as “Fiddling about with images while ensuring they still look natural.” I guess that wouldn’t look quite so good in the conference proceedings ;).

Today, visual communication is sadly one-sided. We all perceive information in the visual form (through photographs, paintings, sculpture, etc), but only a chosen few are talented enough to effectively express themselves visually… One reason is the lack of “safety wheels” in image editing: any less-than-perfect edit immediately makes the image look completely unrealistic. To put another way, classic visual manipulation paradigm does not prevent the user from “falling off” the manifold of natural images.

As we know, GANs can be trained to learn effective representations of natural looking images (“the manifold of natural images“). So let’s do that, but then instead of using the trained GAN to generate images, use it as a constraint on the output of various image manipulation operations, to make sure the results lie on the learned manifold at all times. The result is an interactive tool that helps you make realistic looking alterations to existing images. It helps to see the tool in action, you can see a video here. The authors also demonstrate ‘generative transformation’ of one image to look more like another, and my favourite, creating a new image from scratch based on a user’s sketch.

The intuition for using GANs to learn manifold approximations is that they have been shown to produce high-quality samples, and that Euclidean distance in the latent space often corresponds to a perceptually meaningful visual similarity. This means we can also perform interpolation between points in the latent space.

Here’s what happens when the latent vector is updated based on user edits (top row, adding black colour and changing the shape):

In the interactive tool, each update step takes about 50-100ms, working only on the mapped representation of the original image. When the user is done, the generated image captures roughly the desired change, but the quality is degraded as compared to the original image.

To address this issue, we develop a dense correspondence algorithm to estimate both the geometric and color changes induced by the editing process.

This motion and colour flow algorithm is used to estimate the colour and shape changes in the generated image sequence (as user editing progressed), and then transfer them back on top of the original photo to generate photo-realistic images.

The user interface gives the user a colouring brush for changing the colour of regions, a sketching brush to outline shapes or add fine details, and a warping ‘brush’ for more explicit shape modifications.

Here are some results from user edits:

Transformations between two images also take place in the GAN-learned representation space and are mapped back in the same way:

It’s also possible to use the brush tools to create an image from scratch, and then add more scribbles to refine the result. How good is this! :

### WaveNet: a generative model for raw audio

Enough with the images already! What about generating sound? How about text-to-speech sound generation yielding state of the art performance? Hearing is believing, so check out these samples:

(You can find more on the DeepMind blog at https://deepmind.com/blog/wavenet-generative-model-raw-audio/).

We show that WaveNets can generate raw speech signals with subjective naturalness never before reported in the field of text-to-speech (TTS), as assessed by human raters.

The architecture of WaveNet is inspired by PixelRNN (See “RNN models for image generation” from a couple of weeks ago). The foundation is very simple – take a waveform $\mathbf{x}$ with with T values, ${x_1, ..., x_T}$. And let the probability of the $x_t$ be conditioned on all of the values that precede it: $p(x_t | x_1, ..., x_{t-1})$. Now the joint probability of the overall waveform is modelled by:

$p(\mathbf{x}) = \prod_{t=1}^{T} p(x_t | x_1, ..., x_{t-1})$

This can be modelled by a stack of convolutional layers. “By using causal convolutions, we make sure the model cannot violate the ordering in which we model the data…” (the prediction at timestep t cannot depend on any of the future timestamps). This can be implemented by shifting the output of a normal convolution by one or more timesteps.

At training time, the conditional predictions for all timesteps can be made in parallel because all timesteps of ground truth x are known. When generating with the model, the predictions are sequential: after each sample is predicted, it is fed back into the network to predict the next sample.

Causal convolutions need lots of layers to increase their receptive field. WaveNet uses dilated convolutions to increase receptive fields by orders of magnitude, without greatly increasing computational cost.

A dilated convolution is a convolution where the filter is applied over an area large than its length by skipping input values with a certain step.

WaveNet uses dilation doubling in every layer up to a limit of 512, before repeating (1,2,4, …, 512, 1,2,4, …, 512, …).

A straight softmax output layer would need 65,356 probabilities per timestep to model all possible values for raw audio stored as a sequence of 16-bit integer values. The data is quantized to 256 possible values using a non-linear quantization scheme which was found to produce a significantly better reconstruction than a simple linear scheme:
$f(x_t) = sign(x_t) \frac{\ln(1 + \mu|x_t|)}{\ln(1 + \mu)}$

where $-1 < x_t < 1$ and $\mu = 255$. The network uses both residual and parameterised skip connections throughout to speed up convergence. By conditioning the model on additional inputs, WaveNet can be guided to produce audio with the required characteristics (e.g., a certain speaker’s voice). For TTS, information about the text is fed as an extra input. > For the first experiment we looked at free-form speech generation (not conditioned on text). We used the English multi-speaker corpus from CSTR voice cloning toolkit (VCTK) (Yamagishi, 2012) and conditioned WaveNet only on the speaker. The conditioning was applied by feeding the speaker ID to the model in the form of a one-hot vector. The dataset consisted of 44 hours of data from 109 different speakers.

Since it wasn’t conditioned on text, the model generates made-up but human language-like words in a smooth way (see second audio clip at the top of this section). It can model speech from any of the speakers by conditioning it on the one-hot encoding – thus the model is powerful enough to capture the characteristics of all 109 speakers from the dataset in a single model.

The second experiment trained WaveNet on the same single-speaker speech databases that Google’s North American and Mandarin Chinese TTS systems are built on. WaveNet was conditioned on linguistic features derived from the input texts. In subjective paired comparison tests, WaveNet beat the best baselines:

WaveNet also achieved the highest ever score in a mean opinion score test where users had to rate naturalness on a scale of 1-5 (scoring over 4 on average). (See the first speech sample at the top of this section).

The third set of experiments trained WaveNet on two music datasets (see the third speech sample at the top of this section). “…the samples were often harmonic and aesthetically pleasing, even when produced by unconditional models.

From the WaveNet blog post on the DeepMind site:

WaveNets open up a lot of possibilities for TTS, music generation and audio modelling in general. The fact that directly generating timestep per timestep with deep neural networks works at all for 16kHz audio is really surprising, let alone that it outperforms state-of-the-art TTS systems. We are excited to see what we can do with them next.

### Google’s neural machine translation system: bridging the gap between human and machine translation

Google’s Neural Machine Translation (GNMT) system is an end-to-end learning system for automated translation. Previous NMT systems suffered in one or more of three key areas: training and inference speed, coping with rare words, and sometimes failing to translate all of the words in a source sentence. GNMT is now in production at Google, having handsomely beaten the Phrase-Based Machine Translation (PBMT) system used in production at Google beforehand.

Understanding how it all fits together will draw upon many of the papers we’ve looked at so far. At the core it’s a sequence-to-sequence learning network with an encoder network, a decoder network, and an attention network.

The encoder transforms a source sentence into a list of vectors, one vector per input symbol. Given this list of vectors, the decoder produces one symbol at a time, until the special end-of-sentence symbol (EOS) is produced. The encoder and decoder are connected through an attention module which allows the decoder to focus on different regions of the source sentence during the course of decoding.

The decoder is a combination of an RNN network and a softmax layer. Deeper models give better accuracy, but the team found that LSTM layers worked well up to 4 layers, barely with 6 layers, and very poorly beyond 8 layers. What to do? We learned the answer earlier this week, add residual connections:

Since in translation words in the source sentence may appear anywhere in the output sentence, the encoder network uses a bi-directional RNN for the encoder. Only the bottom layer is bi-direction – one LSTM layer processes the sentence left-to-right, while its twin processes the sentence right-to-left.

The encoder and decoder networks are placed on multiple GPUs, with each layer running on a different GPU. As well as using multiple GPUs, to get inference time down quantized inference involving reduce precision arithmetic is also used.

One of the main challenges in deploying our Neural Machine Translation model to our interactive production translation service is that it is computationally intensive at inference, making low latency translation difficult, and high volume deployment computationally expensive. Quantized inference using reduced precision arithmetic is one technique that can significantly reduce the cost of inference for these models, often providing efficiency improvements on the same computational devices.

The model is trained using full-precision floats, it is only for production inference that approximation is used. Here are the decoding times for 6003 English-French sentences across CPU, GPU, and Google’s Tensor Processing Unit (TPU) respectively:

Firstly, note that the TPU beats the CPU and GPU hands-down. The CPU beats the GPU because “our current decoder implementation is not fully utilizing the computation capacities that a GPU can theoretically offer during inference.”

#### Dealing with out of vocabulary words

Neural Machine Translation models often operate with fixed word vocabularies even though translation is fundamentally an open vocabulary problem (names, numbers, dates etc.)… Our most successful approach […] adopts the wordpiece model (WPM) implementation initially developed to solve a Japanese/Korean segmentation problem for the Google speech recognition system. This approach is completely data-driven and guaranteed to generate a deterministic segmentation for any possible sequence of characters.

For example, “Jet makers feud over seat width with big orders at stake” turns into the word pieces: “_J et _makers _fe ud _over _seat _width _with _big _orders _at _stake.” The words ‘Jet’ and ‘feud’ are both broken into two word pieces.

Given a training corpus and a number of desired tokens D, the optimization problem is to select D wordpieces such that the resulting corpus is minimal in the number of wordpieces when segmented according to the chosen wordpiece model.

#### Overall model performance.

The following chart shows side-by-side scores for translations made by the previous production system (PBMT), the new GNMT system, and humans fluent in both languages.

Side-by-side scores range from 0 to 6, with a score of 0 meaning “completely nonsense translation”, and a score of 6 meaning “perfect translation: the meaning of the translation is completely consistent with the source, and the grammar is correct”. A translation is given a score of 4 if “the sentence retains most of the meaning of the source sentence, but may have some grammar mistakes”, and a translation is given a score of 2 if “the sentence preserves some of the meaning of the source sentence but misses significant parts”. These scores are generated by human raters who are fluent in both languages and hence often capture translation quality better than BLEU scores.

Today we’re pressing on with the top 100 awesome deep learning papers list, and the section on recurrent neural networks (RNNs). This contains only four papers (joy!), and even better we’ve covered two of them previously (Neural Turing Machines and Memory Networks, the links below are to the write-ups). That leaves up with only two papers to cover today, however the first paper does run to 43 pages and it’s a lot of fun so I’m glad to be able to devote a little more space to it.

These papers are easier to understand with some background in RNNs and LSTMs. Christopher Olah has a wonderful post on “Understanding LSTM networks” which I highly recommend.

### Generating sequences with recurrent neural networks

This paper explores the use of RNNs, in particular, LSTMs, for generating sequences. It looks at sequences over discrete domains (characters and words), generating synthetic wikipedia entries, and sequences over real-valued domains, generating handwriting samples. I especially like the moment where Graves demonstrates that the trained networks can be used to ‘clean up’ your handwriting, showing what a slightly neater / easier to read version of your handwriting could look like. We’ll get to that shortly…

RNNs can be trained for sequence generation by processing real data sequences one step at a time and predicting what comes next. Assuming the predictions are probabilistic, novel sequences can be generated from a trained network by iteratively sampling from the network’s output distribution, then feeding in the sample as input at the next step. In other words by making the network treat its inventions as if they were real, much like a person dreaming.

Using LSTMs effectively gives the network a longer memory, enabling it to look back further in history to formulate its predictions.

The basic RNN architecture used for all the models in the paper looks like this:

Note how each output vector $y_t$ is used to parameterise a predictive distribution $Pr(x_{t+1} | y_t)$ over the next possible inputs (the dashed lines in the above figure). Also note the use of ‘skip connections’ as we looked at in yesterday’s post.

The LSTM cells used in the network look like this:

They are trained with the full gradient using backpropagation. To prevent the derivatives becoming too large, the derivative of the loss with respect to the inputs to the LSTM layers are clipped to lie within a predefined range.

Onto the experiments…

#### Text prediction

For text prediction we can either use sequences of words, or sequences of characters. With one-hot encodings, the number of different classes for words makes for very large input vectors (e.g. a vocabulary with 10’s of thousands of words of more). In contrast, the number of characters is much more limited. Also,

… predicting one character at a time is more interesting from the perspective of sequence generation, because it allows the network to invent novel words and strings. In general, the experiments in this paper aim to predict at the finest granularity found in the data, so as to maximise the generative flexibility of the network.

The Penn Treebank dataset is a selection of Wall Street Journal articles. It’s relatively small at just over a million words in total, but widely used as a language modelling benchmark. Both word and character level networks were trained on this corpus using a single hidden layer with 1000 LSTM units. Both networks are capable of overfitting the training data, so regularisation is applied. Two forms of regularisation were experimented with: weight noise applied at the start of each training sequence, and adaptive weight noise, where the variance of the noise is learned along with the weights.
The word-level RNN performed better than the character-based one, but the gap closes with regularisation (perplexity of 117 in the best word-based configuration, vs 122 for the best character-based configuration).

“Perplexity can be considered to be a measure of on average how many different equally most probable words can follow any given word. Lower perplexities represent better language models…” ([][] http://www1.icsi.berkeley.edu/Speech/docs/HTKBook3.2/node188_mn.html )

Much more interesting is a network that Graves trains on the first 96M bytes of the Wikipedia corpus (as of March 3rd 2006, captured for the Hutter prize competition). This has seven hidden layers of 700 LSTM cells each. This is an extract of the real Wikipedia data:

And here’s a sample generated by the network (for additional samples, see the full paper):

The sample shows that the network has learned a lot of structure from the data, at a wide range of different scales. Most obviously, it has learned a large vocabulary of dictionary words, along with a subword model that enables it to invent feasible-looking words and names: for example “Lochroom River”, “Mughal Ralvaldens”, “submandration”, “swalloped”. It has also learned basic punctuation, with commas, full stops and paragraph breaks occurring at roughly the right rhythm in the text blocks.

It can correctly open and close quotation marks and parentheses, indicating the models memory and these often span a distance that a short-range context cannot handle. Likewise, it can generate distinct large-scale regions such as XML headers, bullet-point lists, and article text.

Of course, the actual generated articles don’t make any sense to a human reader, it is just their structure that is mimicked. When we move onto handwriting though, the outputs do make a lot of sense to us…

#### Handwriting prediction

To test whether the prediction network could also be used to generate convincing real-valued sequences, we applied it to online handwriting data (online in this context means that the writing is recorded as a sequence of pen-tip locations, as opposed to offline handwriting, where only the page images are available). Online handwriting is an attractive choice for sequence generation due to its low dimensionality (two real numbers per data point) and ease of visualisation.

The dataset consists of handwritten lines on a smart whiteboard, with x,y co-ordinates and end-of-stroke markers (yes/no) captured at each time point. The main challenge was figuring out how to determine a predictive distribution for real-value inputs. The solution is to use mixture density neworks. Here the outputs of the network are used to parameterise a mixture distribution. Each output vector consists of the end of stroke probability e, along with a set of means, standard deviations, correlations, and mixture weights for the mixture components used to predict the x and y positions. See pages 20 and 21 for the detailed explanation.

Here are the mixture density outputs for predicted locations as the word under is written. The small blobs show accurate predictions while individual strokes are being written, and the large blobs show greater uncertainty at the end of strokes when the pen is lifted from the whiteboard.

The best samples were generated by a network with three hidden layers of 400 LSTM cells each, and 20 mixture components to model the offsets. Here are some samples created by the network.

The network has clearly learned to model strokes, letters and even short words (especially common ones such as ‘of’ and ‘the’). It also appears to have learned a basic character level language models, since the words it invents (‘eald’, ‘bryoes’, ‘lenrest’) look somewhat plausible in English. Given that the average character occupies more than 25 timesteps, this again demonstrates the network’s ability to generate coherent long-range structures

#### Handwriting generation

Those samples do of course look like handwriting, but as with our Wikipedia example, the actual words are nonsense. Can we learn to generated handwriting for a given text? To meet this challenge a soft window is convolved with the text string and fed as an extra input to the prediction network.

The parameters of the window are output by the network at the same time as it makes the predictions, so that it dynamically determines an alignment between the text and the pen locations. Put simply, it learns to decide which character to write next.

The network learns how far to slide the text window at each step, rather than learning an absolute position. “Using offsets was essential to getting the network to align the text with the pen trace.

And here are samples generated by the resulting network:

Pretty good!

#### Biased and primed sampling to control generation

One problem with unbiased samples is that they tend to be difficult to read (partly because real handwriting is difficult to read, and partly because the network is an imperfect model). Intuitively, we would expect the network to give higher probability to good handwriting because it tends to be smoother and more predictable than bad handwriting. If this is true, we should aim to output more probable elements of Pr(x|c) if we want the samples to be easier to read. A principled search for high probability samples could lead to a difficult inference problem, as the probability of every output depends on all previous outputs. However a simple heuristic, where the sampler is biased towards more probable predictions at each step independently, generally gives good results.

As we increase the bias towards higher probability predictions, the handwriting gets neater and neater…

As a final flourish, we can prime the network with a real sequence in the handwriting of a particular writer. The network then continues in this style, generating handwriting mimicking the author’s style.

Combine this with bias, and you also get neater versions of their handwriting!

### Conditional random fields as recurrent neural networks

Now we turn our attention to a new challenge problem that we haven’t looked at yet: semantic segmentation. This requires us to label the pixels in an image to indicate what kind of object they represent/are part of (land, building, sky, bicycle, chair, person, and so on…). By joining together regions with the same label, we segment the image based on the meaning of the pixels. Like this:

(The CRF-RNN column in the above figure shows the results from the network architecture described in this paper).

As we’ve seen, CNNs have been very successful in image classification and detection, but there are challenges applying them to pixel-labelling problems. Firstly, traditional CNNs don’t produce fine-grained enough outputs to label every pixel. But perhaps more significantly, even if we could overcome that hurdle, they don’t have any way of understanding that if pixel A is part of, say, a bicycle, then it’s likely that the adjacent pixel B is also part of a bicycle. Or in more fancy words:

CNNs lack smoothness constraints that encourage label agreement between similar pixels, and spatial and appearance consistency of the labelling output. Lack of such smoothness constraints can result in poor object delineation and small spurious regions in the segmentation output.

Conditional Random Fields (a variant of Markov Random Fields) are very good at smoothing. They’re basically models that take into account surrounding context when making predictions. So maybe we can combine Conditional Random Fields (CRF) and CNNs in some way to get the best of both worlds?

The key idea of CRF inference for semantic labelling is to formulate the label assignment problem as a probabilistic inference problem that incorporates assumptions such as the label agreement between similar pixels. CRF inference is able to refine weak and coarse pixel-level label predictions to produce sharp boundaries and fine-grained segmentations. Therefore, intuitively, CRFs can be used to overcome the drawbacks in utilizing CNNs for pixel-level labelling tasks.

Sounds good in theory, but it’s quite tricky in practice. The authors proceed in two stages: firstly showing that one iteration of the mean-field algorithm used in CRF can be modelled as a stack of common CNN layers; and secondly by showing that repeating the CRF-CNN stack with outputs from the previous iteration fed back into the next iteration you can end up with an RNN structure, dubbed CRF-RNN, that implements the full algorithm.

Our approach comprises a fully convolutional network stage, which predicts pixel-level labels without considering structure, followed by a CRF-RNN stage, which performs CRF-based probabilistic graphical modelling for structured prediction. The complete system, therefore, unifies the strengths of both CNNs and CRFs and is trainable end-to-end using the back-propagation algorithm and the Stochastic Gradient Descent (SGD) procedure.

There’s a lot of detail in the paper, some of which passes straight over my head, for example, the following sentence which warranted me bringing out the ‘hot pink’ highlighter:

In terms of permutohedral lattice operations, this can be accomplished by only reversing the order of the separable filters in the blur stage, while building the permutohedral lattice, splatting, and slicing in the same way as in the forward pass.

(What is a permutohedron you may ask? It’s actually not as scary as it sounds…)

Fortunately, we’re just trying to grok the big picture in this write-up, and for that the key is to understand how CNNs can model one mean-field iteration, and then how we stack the resulting structures in RNN formation.

#### Mean-field iteration as a stack of CNN layers

Consider a vector X with one element per pixel, representing the label assigned to that pixel drawn from some pre-defined set of labels. We construct a graph where the vertices are the elements in X, and edges between the elements hold pairwise ‘energy’ values. Minimising the overall energy of the configuration yields the most probable label assignments. Energy has two components: a unary component which depends only on the individual pixel and roughly speaking, predicts labels for pixels without considering smoothness and consistency; and pairwise energies that provide an image data-dependent smoothing term that encourages assigning similar labels to pixels with similar properties. The energy calculations are based on feature vectors derived from image features. Mean-field iteration is used to find an approximate solution for the minimal energy configuration.

The steps involved in a single iteration are:

• message passing,
• re-weighting,
• compatibility transformation,
• normalisation

Message passing is made tractable by using approximation techniques (those permutohedral lattice thingies) and two Guassian kernels: a spatial kernel and a bilateral kernel. Re-weighting can be implemented as a 1×1 convolution. Each kernel is given independent weights:

The intuition is that the relative importance of the spatial kernel vs the bilateral kernel depends on the visual class. For example, bilateral kernels may have on the one hand a high importance in bicycle detection, because similarity of colours is determinant; on the other hand they may have low importance for TV detection, given that whatever is inside the TV screen may have different colours.

Compatibility transformation assigns penalties when different labels are assigned to pixels with similar properties. It is implemented with a convolutional filter with learned weights (equivalent to learning a label compatibility function).

The addition (copying) and normalisation (softmax) operations are easy.

#### CRF as a stack of CRF-CNN layers

Multiple mean-field iterations can be implemented by repeating the above stack of layers in such a way that each iteration takes Q value estimates from the previous iteration and the unary values in their original form. This is equivalent to treating the iterative mean-field inference as a Recurrent Neural Network (RNN)… We name this RNN structure CRF-RNN.

Recall that the overall network has a fully-convolution network stage predicting pixels labels in isolation, followed by a CRF-CNN for structured prediction. In one forward pass the computation goes through the initial CNN stage, and then it takes T iterations for data to leave the loop created by the RNN. One the data leaves this loop, a softmax loss layer directly follows and terminates the network.

The resulting network achieves the state-of-the-art on the Pascal VOC 2010-2012 test datasets.

Today we’re looking at the final four papers from the ‘convolutional neural networks’ section of the ‘top 100 awesome deep learning papers‘ list.

### Deep residual learning for image recognition

Another paper, another set of state-of-the-art results, this time with 1st place on the ILSVRC 2015 classification tasks (beating GoogLeNet from the year before), as well as 1st place on ImageNet detection, ImageNet localisation, COCO detection, and COCO segmentation competitions. For those of you old enough to remember it, there’s a bit of a Crocodile Dundee moment in this paper: “22 layers? That’s not a deep network, this is a deep network…” How deep? About 100 layers seems to work well, though the authors tested networks up to 1202 layers deep!! Which begs the question, how on earth do you effectively train a network that is hundreds or even a thousand layers deep?

Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers?

No, it’s not. A degradation problem occurs as network depth increases, accuracy gets saturated, and then degrades rapidly as further layers are added. So after a point, the more layers you add, the worse the error rates. For example:

An interesting thought experiment leads the team to make a breakthrough and defeat the degradation problem. Imagine a deep network where after a certain (relatively shallow) number of layers, each additional layer is simply an identity mapping from the previous one: “the existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart.”

Suppose the desired mapping through a number of layers for certain features really is close to the identity mapping. The degradation problem suggests that the network finds it hard to learn such mappings across multiple non-linear layers. Lets say the output of one layer is $\mathbf{x}$. Now skip forward three layers in the network and imagine that we’d like the ability to easily learn an identity mapping (or close to) for some elements of $\mathbf{x}$ – an easy way to do this is just to provide the elements of $\mathbf{x}$ as direct inputs…

But we don’t want to just double the width of the receiving layer, and for some other elements of $\mathbf{x}$ perhaps we do want to learn some non-linear function. So here comes the clever part: suppose the ideal function we could learn is $H(\mathbf{x})$. If the intervening hidden layers can learn that then they can also learn $F(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}$. Maybe you can see where this is going: we can now simply combine the output of the previous layer with $\mathbf{x}$ using simple addition and we get $F(\mathbf{x}) + \mathbf{x} = H (\mathbf{x})$ ! But constructing things in this way though, we make it easy for the network to include both non-linear elements and also near identity elements (by learning weights close to zero).

Adding these layer jumping + addition short-circuits to a deep network creates a residual network. Here’s a 34-layer example:

Here are 18 and 34 layer networks trained on ImageNet without any residual layers. You can see the degradation effect with the 34 layer network showing higher error rates.

Take the same networks and pop in the residual layers, and now the 34-layer network is handsomely beating the 18-layer one.

Here are the results all the way up to 152 layer networks:

Now that’s deep!

### Identity mappings in deep residual networks

This paper analyses the skip-layers introduced in the residual networks (ResNets) that we just looked at to see whether the identity function is the best option for skipping. It turns out that it is, and that using an identity function for activation as well makes residual units even more effective.

Recall that we used $F(\mathbf{x}) + \mathbf{x}$ as the input when skipping layers. More generally, this is $F(\mathbf{x}) + h(\mathbf{x})$ where h is the identity function. But h could be also be another kind of function – for example constant scaling. The authors try a variety of different h functions, but identity works best. When we consider the activation function f as well (the original paper used ReLU) then we have $f(F(\mathbf{x}) + h(\mathbf{x}))$. Things work even better when f is also an identity mapping (instead of ReLU).

To construct an identity mapping f, we view the activation functions (ReLU and Batch Normalization) as “pre-activation” of the weight layers, in contrast to conventional wisdom of “post-activation”. This point of view leads to a new residual unit design, shown in Fig 1(b).

Making both h and f identity mappings allows a signal to be directly propagated from one unit to any other units, in both forward and backward passes.

Here’s the difference the new residual unit design makes when training a 1001 layer ResNet:

Based on this unit, we present competitive results on CIFAR-10/100 with a 1001-layer ResNet, which is much easier to train and generalizes better than the original ResNet. We further report improved results on ImageNet using a 200-layer ResNet, for which the counterpart of [the original ResNet] starts to overfit. These results suggest that there is much room to exploit the dimension of network depth, a key to the success of modern deep learning.

### Inception-v4, inception-resnet and the impact of residual connections or learning

While all this ResNet excitement was going on, the Inception team kept refining their architecture, up to Inception v3. The obvious question became: what happens if we take the Inception architecture and we add residual connections to it? Does that further improve training time or accuracy? This paper compares four models: Inception v3, a newly introduced in this paper Inception v4, and variations of Inception v3 and v4 that also have residual connections. It’s also fun to see ever more complex network building blocks being used as modules in higher level architectures. Inception v4 is too much to show in one diagram, but here’s the overall schematic:

And as a representative selection, here’s what you’ll find if you dig into the stem module:

And the ’17 x 17 Inception-B’ module:

Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin.

Since Inception 4 and Inception-ResNet-v2 (Inception 4 with residual connections) give overall very similar results, this seems to suggest that – at least at this depth – residual networks are not necessary for training deep networks.

However, the use of residual connections seems to improve the training speed greatly, which is alone a great argument for their use.

Both the Inception v4 and Inception ResNet-2 models outperform previous Inception networks, by virtue of the increased model size.

### Rethinking the Inception architecture for computer vision

This paper comes a little out of order in our series, as it covers the Inception v3 architecture. The bulk of the paper though is a collection of advice for designing image processing deep convolutional networks. Inception v3 just happens to be the result of applying that advice.

In this paper, we start with describing a few general principles and optimization ideas that proved to be useful for scaling up convolution networks in efficient ways.

• Avoid representational bottlenecks – representation size should gently decrease from the inputs to the outputs before reaching the final representation used for task at hand. Big jumps (downward) in representation size cause extreme compression of the representation and bottleneck the model.
• Higher dimensional representationsare easier to process locally in a network, more activations per tile allows for more disentangled features. The resulting networks train faster.
• Spatial aggregation of lower dimensional embeddings can be done without much or any loss in representational power. “For example, before performing a more spread out (e.g. 3×3 convolution), one can reduce the dimension of the input representation before the spatial aggregation without expecting serious adverse effects.”
• Balance network width and depth, optimal performance is achieved by balancing the number of filters per stage and the depth of the network. I.e., if you want to go deeper you should also consider going wider.

Although these principles might make sense, it is not straightforward to use them to improve the quality of networks out of the box. The idea is to use them judiciously in ambiguous situations only.

• Factorize into smaller convolutions, a larger (e.g. 5×5) convolution is disproportionately more expensive than a smaller (e.g. 3×3) one – by a factor of 25/9 in this case. Replacing the 5×5 convolution with a two-layer network of 3×3 convolutions reusing activations between adjacent tiles achieves the same end but uses (9+9)/25 less computation.

The above results suggest that convolutions with filters larger 3 × 3 a might not be generally useful as they can always be reduced into a sequence of 3 × 3 convolutional layers. Still we can ask the question whether one should factorize them into smaller, for example 2×2 convolutions. However, it turns out that one can do even better than 2 × 2 by using asymmetric convolutions, e.g. n × 1. For example using a 3 × 1 convolution followed by a 1 × 3 convolution is equivalent to sliding a two layer network with the same receptive field as in a 3×3 convolution. The two-layer solution is 33% cheaper.

Taking this idea even further (e.g. replacing 7×7 with a 1×7 followed by a 7×1) works well on medium grid sizes (between 12 and 20), so long as it is not used in early layers.

The authors also revisit the question of the auxiliary classifiers used to aid training in the original Inception. “Interestingly, we found that auxiliary classifiers did not result in improved convergence early in the training… near the end of training the network with the auxiliary branches starts to overtake the accuracy of the network without, and reaches a slightly higher plateau.” Removing the lower of the two auxiliary classifiers also had no effect.

Together with the earlier observation in the previous paragraph, this means that original the hypothesis of [Inception] that these branches help evolving the low-level features is most likely misplaced. Instead, we argue that the auxiliary classifiers act as regularizer.

There are several other hints and tips in the paper that we don’t have space to cover. It’s well worth checking out if you’re building these kinds of models yourself.