Skip to content

Information distribution aspects of design methodology

October 17, 2016

Information distribution aspects of design methodology Parnas, 1971

We’re continuing with Liskov’s listthis week, and today’s paper is another classic from David Parnas in which you can see some of the same thinking as in ‘On the criteria….’ Parnas talks about the modules of a system (for contemporary feel, we could call them ‘microservices’ once more), how we end up with inadvertent tangling / coupling between microservices, and what we can do to prevent that. One of his key recommendations is that information about a microservice be carefully controlled – releasing information about how it works internally outside of the team that are working on it is not just unhelpful, it is positively harmful. Writing in 1971, this information communication was primarily through documentation, and so this is what Parnas discusses. I wonder if in 2016 he would make the claim “making the source code for a microservice available outside of the team working on it is harmful” ??? That would certainly be a statement likely to cause robust debate!


Update: David Parnas himself posted a comment answering my question. Wow! Here’s his reply:

You wrote, “I wonder if in 2016 he would make the claim “making the source code for a microservice available outside of the team working on it is harmful” ??? That would certainly be a statement likely to cause robust debate!” The answer is YES. Papers that I wrote subsequently talk about the need to design and then document interfaces and then give other developers the interface documentation instead of the code. In fact, you can find that in the 1973 papers. The difference is that I have much better techniques for documentation today than I did forty years ago.

David Lorge Parnas


 

A system has structure: the set of modules (microservices) in the system, and the connections between them.

Many assume that the “connections” are control transfer points, passed parameters, and shared data… Such a definition of “connection” is a highly dangerous oversimplification which results in misleading structure descriptions. The connections between microservices are the assumptions which the microservices make about each other.

Consider a change that needs to be made. “What changes can be made to one microservice without involving change to other services?”

We may make only those changes which do not violate the assumptions made by other microservices about the service being changed.

Making design decisions

During the design and development of a system we make decisions which eliminate some possibilities for system structure. This evolves over time. Three common considerations during the decision process are:

  1. Making sure the system delivers a great experience for its users
  2. Ensuring the system can be delivered as quickly as possible
  3. Making sure the system can easily support future changes

Each of these considerations suggests an optimal partial order for decision making, but those orderings are usually inconsistent!

  • Delivering a great experience suggests using an outside-in approach whereby the external characteristics are decided first (rather than letting them be unnoticed implications of decisions about other aspects of system structure).
  • Delivering the system as quickly as possible requires dividing it early into separate microservices:

Competitive pressures may require the use of large groups to produce a system in a sharply limited period of time. Additional developers speed up a project significantly only after the project has been divided into sub-projects in such a way that separate groups can work with little interaction (i.e. spending significantly less time in inter-group decisions than in intra-group decisions).

The desire to make the division early in the project lifecycle so that the team can ‘get on with it’ encourages a splitting along familiar lines and in agreement with the organisational structure (Conway).

Time pressures encourage groups to make the split before the externals are defined. Consequently, we find some adverse effect on the usability of the product. Haste also makes poor internal structure more likely.

  • When it comes to changeability, Parnas makes an interesting observation: the earlier a decision was made, the more difficult it is likely to be to change it because other parts of the system grow assumptions about it.

These considerations suggest that the early decisions should be those which are the least likely to change; i.e. those based on “universal” truths or reasoning which takes into account little about a particular environment… the possibility of change suggests using the most general information first.

Since the thing that often changes the most is the external characteristics, starting there may make the system harder to change.

Good programmers fight against the system

The crux of Parnas’ argument hinges on this assertion:

A good programmer makes use of the available information given him or her.

Sometimes those uses are obvious: calling a subroutine in another module, or reusing a reference table. Sometimes they are less obvious, for example exploiting knowledge that a list is searched or sorted in a certain order.

Such uses of information have been so costly that we observe a strange reaction. The industry has started to encourage bad programming…. Derogatory names such as “kludger,” “hacker,” and “bit twiddler” are used for the sort of fellow who writes terribly clever programs which cause trouble later on. They are subtly but effectively discouraged by being assigned to work on small independent projects such as application routines (the Siberia of the software world) or hardware diagnostic routines (the coal mines). In both situations the programmer has little opportunity to make use of information about other modules.

(I wonder what the modern Siberia and coal mines would be?)…

Those that remain (the non-bit-twiddlers) are usually poor programmers. While a few refrain from using information because they know it will cause trouble, most refrain because they are not clever enough to notice that the information can be used. Such people also miss opportunities to use facts which should be used. Poor programs result.

A programmer can disastrously increase the connectivity of the system structure by using information he or she possesses about other services.

We must deliberately control information distribution

If you buy Parnas’ argument that if you make information available, good programmers can’t help but make use of it, whether that is in the overall good of the system or not, then an obvious solution presents itself: don’t make information available!

We can avoid many of the problems discussed here by rejecting the notion that design information (or source code?) should be accessible to everyone. Instead we should allow the designers, those who specify the structure, to control the distribution of design information as it is developed.

Say your system depends on an external service, it’s likely that all you know about that service is the documentation for its REST API, or the client SDK built on top. But when you depend on an internal service, you typically know much more…

We should not expect a programmer to decide not to use a piece of information, rather he should not possess information that he should not use.

The last word

I consider the internal restriction of information within development groups to be of far more importance than its restriction from users or competitors. Much of the information in a system document would only harm a competitor if they had it. (They might use it!).

And the modern ‘system document’ is generally the source code, written in a suitably high-level language.

Program development by stepwise refinement

October 14, 2016

Program development by stepwise refinement Wirth, CACM 1971

This is the second of Barbara Liskov’s 7 ‘must-read’ CS papers. Wirth’s main point is that we have a tendency to focus far too much on mastering the syntax and style associated with a particular programming language, and nowhere near enough time on the process by which new programs are designed in the first place.

… active programming consists of the design of new programs, rather than contemplation of old programs… Clearly, programming courses should teach methods of design and construction, and the selected examples should be such that a gradual development can be nicely illustrated.

By far the bulk of the paper though, is dedicated to one worked example of this principle, introducing the idea of stepwise refinement.  The goal is to write a program to solve the eight-queens problem.

Picture credit: Wikipedia

Wirth starts out with a very high level sketch of a solution, and gradually fills in the details to make both a more concrete and a more efficient implementation over time.

In each step, one or several instruction of the given program are decomposed into more detailed instructions. This successive decomposition or refinement of specifications terminates when all instructions are expressed in terms of an underlying computer or programming language, and must therefore be guided by the facilities available on that computer or language.

It’s not just about refinement of the tasks involved in the program; crucially the data and data representations may also need to be refined. “It is natural to refine program and data specifications in parallel.” It’s interesting to think about the task-based decomposition that naturally arises from this method, and contrast it with Parnas’ advice in “On the criteria to be used in decomposing systems into modules“. The stepwise refinement example in this paper though is ‘in the small,’ and really about the stepwise refinement of an algorithm as much as anything.

Reading through the worked example, I’m glad we have higher-level programming languages to work with these days! (The paper uses Algol 60). There are a few too many global variables and pointers for modern tastes.  The thinking process is still valuable though.

A guideline in the process of stepwise refinement should be the principle to decompose decisions as much as possible, to untangle aspects which are only seemingly interdependent, and to defer those decisions which concern details of representation as long as possible.

Here are the steps in Wirth’s refinement:

Up-front thinking…

  1. Start out with a very simple brute force algorithm that tries all possible placements until it finds one that works.
  2. Realise that this would be highly inefficient (about 7 hours on the technology of the day), and thus we need to find a shortcut…
  3. Reduce the number of candidate solutions that must be generated and tried, using a little domain knowledge: in this case, that there must be one queen in every column. This process would find an answer in about 100 seconds on the technology of the day.
  4. Realise that we can do better by being systematic about the way we generate trial solutions via stepwise construction of trial solutions.  That is, we start out by placing a queen in the first column, and then adding a queen in successive columns at each next step. If at any point in this process one or more queens conflict with each other we know there is no point going any further, so instead we can backtrack and try again from there.  This version generates 876 partial solutions before finding a complete one, and takes about 0.09 seconds on the technology of the day.

Development of the program…

Only at this point does Wirth think about moving into code. The first version is expressed in terms of a number of high-level procedures, whose details have yet to be filled in:

The next step is to begin to fill-in the details of those procedures, at which point we can no longer defer any decision about data representation.

A common difficulty in program design lies in the unfortunate fact that at the stage where decisions about data representations have to be made, it often is still difficult to foresee the details of the necessary instructions operating on the data, and often quite impossible to estimate the advantages of one possible representation over another. In general, it is therefore advisable to delay decisions about data representation as long as possible…

(Or as Parnas would advise us, to hide information about data representation from the rest of the program).

With a little thought, Wirth rejects the straightforward 8×8 boolean matrix in favour of an array of columns.  Analysing the efficiency of the resulting structure, Wirth then introduces auxiliary variables that make it more computationally efficient to determine whether or not a given square is safe to place a queen on.

Upon every introduction of auxiliary data, care has to be taken of their correct initialization…

The final version of the program still displays the structure of the version designed in the first step… “other, equally valid solutions can be suggested and be developed by the same process of stepwise refinement.

Wirth concludes with five remarks about the process:

  1. Program construction consists of a sequence of refinement steps
  2. The degree of modularity obtained during the refinement will determine the ease or difficulty with which a program can be adapted to changes or extensions
  3. During stepwise refinement, a notation natural to the problem in hand should be used as long as possible… “the direction in which the notation develops during the process of refinement is determined by the language in which the program must ultimately be specified.”
  4. Each refinement implies a number of design decisions based on a set of design criteria.
  5. Even the small example program in the paper requires quite a lengthy explanation, indicating that careful programming is not a trivial subject!

Among these criteria [for the design decisions] are efficiency, storage economy, clarity, and regularity of structure. Students must be taught to be conscious of the involved decisions and to critically examine and to reject solutions, sometimes even if they are correct as far as the result is concerned; they must learn to weigh the various aspects of design alternatives in the light of these criteria. In particular, they must be taught to revoke earlier decisions and to back up, if necessary, even to the top.

Go to statement considered harmful

October 13, 2016

Go to statement considered harmful Dijkstra, CACM 1968

It sounds like the Heidelberg Laureate Forum this summer was a great event. Johanna Pirker was there and took notes on Barbara Liskov’s talk, including 7 papers that Liskov highlighted as ‘must reads’ for computer scientists.  I’m sure you’ve figured out where this is going… for the next seven days we’ll be covering those papers, starting with the classic ‘Go to statement considered harmful.’

I don’t think we really need to have a serious debate about the merits of goto anymore, but that doesn’t make Dijkstra’s short note any less relevant today. Dijkstra’s argument hinges on the realisation that when the gap between the program as laid out in text and the runtime behaviour of the program becomes too large, we’re heading for trouble. It could perhaps be summarized as “It should be obvious what the program does from looking at it.”

My first remark is that, although the programmer’s activity ends when he has constructed a correct program, the process taking place under control of his program is the true subject  matter of his activity, for it is this process that has to accomplish the desired effect; it is this  process that in its dynamic behavior has to satisfy the desired specifications. Yet, once the  program has been made, the “making’ of the corresponding process is delegated to the machine.

We’re better at understanding static relationships (such as the order of statements in a printout) says Dijkstra, than we are at understanding processes that evolve over time:

For that reason we should do (as wise programmers aware of our limitations) our utmost to shorten the conceptual gap between the static program and the dynamic process, to make the correspondence between the program (spread out in text space) and the process (spread out in time) as trivial as possible.

If you just had a sequence of assignment statements one after the other, then the relationship would be very straightforward indeed – we could say where we ‘are’ in the program at any point during runtime by pointing at the corresponding point in the printout.

When we add procedures and loops things get more complicated. It’s no longer enough to point just at a place in the text, we also need the equivalent of a stack trace to show the nesting of procedures, and a ‘dynamic index’ showing how many iterations have been performed of each loop.

The main point is that the values of these indices are outside programmer’s control; they are  generated (either by the write-up of his program or by the dynamic evolution of the process)  whether he wishes or not. They provide independent coordinates in which to describe the  progress of the process. Why do we need such independent coordinates? The reason is – and this seems to be inherent  to sequential processes – that we can interpret the value of a variable only with respect to the  progress of the process.

The reason go to is harmful therefore, is that it makes it terribly hard to find a meaningful set of coordinates with which to describe the progress of a process.

The go to statement as it stands is just too primitive; it is too much an invitation to make a mess of one’s program.

A salient reminder that sometimes we can be too clever for our own good. If we can’t easily understand what a program does by looking at it, we could be heading for trouble…

Towards deep symbolic reinforcement learning

October 12, 2016

Towards deep symbolic reinforcement learning Garnelo et al, 2016

Every now and then I read a paper that makes a really strong connection with me, one where I can’t stop thinking about the implications and I can’t wait to share it with all of you. For me, this is one such paper.

In the great see-saw of popularity for artificial intelligence techniques, symbolic reasoning and neural networks have taken turns, each having their dominant decade(s). The popular wisdom is that data-driven learning techniques (machine learning) won. Symbolic reasoning systems were just too hard and fragile to be successful at scale. But what if we’re throwing the baby out with the bath water? What if instead of having to choose between the two approaches, we could combine them: a system that can learn representations, and then perform higher-order reasoning about those representations? Such combinations could potentially bring to bear the fullness of AI research over the last decades. I love this idea because:

(a) it feels intuitively right (we as humans learn to recognise types of things, and then form probability-based rules about their behaviour for example);

(b) it is very data efficient, to appropriate a phrase:  “a rule can be worth 1000(+) data points!”

(c) it opens up the possibility to incorporate decades of research into modern learning systems, where you can’t help but think there would be some quick wins.

… we show that the resulting system – though just a prototype – learns effectively, and, by acquiring a set of symbolic rules that are easily comprehensible to humans, dramatically outperforms a conventional, fully neural Deep Reinforcement Learning system on a stochastic variant of the game.

In short, this feels to me like something that could represent a real breakthrough and a step-change in the power of learning systems. You should take that with a pinch of salt, because I’m just an interested outsider following along. I hope that I’m right though!

Why combine symbolic reasoning and deep learning?

Contemporary Deep Reinforcement Learning (DRL) systems achieve impressive results but still suffer from a number of drawbacks:

  1. They inherit from deep learning the need for very large training sets, so that they learn very slowly
  2. They are brittle in the sense that a trained network that performs well on one task often performs poorly on a new task, even if it is very similar to the original one.
  3. They do not use high-level processes such as planning, causal reasoning, or analogical reasoning to fully exploit the statistical regularities present in the training data.
  4. They are opaque – it is typically difficult to extract a human-comprehensible chain of reasons for the action choice the system makes .

In contrast, classical AI uses language-like propositional representations to encode knowledge.

  Thanks to their compositional structure, such representations are amenable to endless extension and recombination, an essential feature for the acquisition and deployment of high-level abstract concepts, which are key to general intelligence. Moreover, knowledge expressed in propositional form can be exploited by multiple high-level reasoning processes and has general-purpose application across multiple tasks and domains. Features such as these, derived from the benefits of human language, motivated several decades of research in symbolic AI. But as an approach to general intelligence, classical symbolic AI has been disappointing. A major obstacle here is the symbol grounding problem…

The symbol grounding problem is this: where do the symbols come from? They are typically hand-crafted rather than grounded in data from the real world. This brings a number of problems:

  • They cannot support support ongoing adaptation to a new environment
  • They cannot capture the rich statistics of real-world perceptual data
  • They create a barrier to full autonomy

Machine learning has none of these problems!

So what if we could take the good bits from machine learning, and combine them with the good bits from classical AI? Use machine learning to learn symbolic representations, and then use symbolic reasoning on top of those learned symbols for action selection (in the case of DRL).

You end with a system architecture that looks like this:

Four principles for building Deep Symbolic Reinforcement Learning (DSRL) systems

  1. Support conceptual abstraction by mapping high-dimensional raw input into a lower dimensional conceptual state space, and then using symbolic methods that operate at a higher level of abstraction.  “This facilitates both data efficient learning and transfer learning as well as providing a foundation for other high-level cognitive processes such as planning, innovative problem solving, and communication with other agents (including humans).
  2. Enable compositional structure that supports combining and recombining elements in an open-ended way.  “To handle uncertainty, we propose probabilistic first-order logic for the semantic underpinnings of the low-dimensional conceptual state space representation onto which the neural front end must map the system’s high-dimensional raw input.
  3. Build on top of common sense priors – it is unrealistic to expect an end-to-end reinforcement learning system to succeed with no prior assumptions about the domain. For example: objects frequently move, and typically do so in continuous trajectories, there are stereotypical events such as beginning to move, stopping, coming into contact with other objects, and so on.
  4. Support causal reasoning through discovery of the causal structure of the domain and symbolic rules expressed in terms of both the domain and the common sense priors.

To carry out analogical inference at a more abstract level, and thereby facilitate the transfer of expertise from one domain to another, the narrative structure of the ongoing situation needs to be mapped to the causal structure of a set of previously encountered situations. As well as maximising the benefit of past experience, this enables high-level causal reasoning processes to be deployed in action selection, such as planning, lookahead, and off-line exploration (imagination).

A worked example

Well it all sounds good, but does it actually work? The prototype system built by the authors learns to play four variations on a simple game. An agent moves around a square space populated by circles and crosses. It receives a positive reward for every cross it ‘collects’, and a negative reward for every circle. Four variations are used: the first has only circles in a fixed grid, the second both circles and crosses in a fixed grid, the third only circles but in a random grid, and the fourth both circles and crosses in a random grid.

The system has three stages: low-level symbol generation, representation building, and reinforcement learning.

The first stage use a convolutional neural network (autoencoder) trained on 5000 randomly generated images on game objects scattered across the screen. The activations in the middle layer of the CNN are used directly for detection of objects in the scene. The salient areas in an image result in higher activation throughout the layers of the convolutional network. Given the geometric simplicity of the games, this is enough to extract the individual objects from any given frame.

The objects identified this way are then assigned a symbolic type according to the geometric properties computed by the autoencoder. This is done by comparing the activation spectra of the salient pixels across features.

At this point we know the type and position of the objects.

The second stage learns to track objects across frames in order to learn from their dynamics. The system is supported in this task by a common sense prior, “object persistence over time”. This is broken down into three measures ultimately combined into a single value: how close an object in one frame is to an object in another frame (spatial proximity is an indicator it may be the same object); how likely it is that an object transformed from one type into another (by learning a transition probability matrix); and what change there is in the neighbourhood  of an object.

The representation learned at the end of the second stage differs from that of the first stage in two key ways:

  1. It is extended to understand changes over time, and
  2. Positions of objects are represented by relative coordinates to other objects rather than using absolute coordinates. “This approach is justified by the common sense prior that local relations between multiple objects are more relevant than the global properties of single objects.”

We  now have a concise spatio-temporal representation of the game situation, one that captures not only what objects are in the scene along with their locations and types, but also what they are doing. In particular, it represents frame-to-frame interactions between objects, and the changes in type and relative position that result. This is the input to reinforcement learning, the third and final stage of the pipeline.

Finally we enter the reinforcement learning stage, where the relative location representation of objects greatly reduces the state space, on the assumption that things that are far apart (both in the game and in the real world) tend to have little influence on each other.

In order to implement this independence we train a separate Q function for each interaction between two object types. The main idea is to learn several Q functions for the different interactions and query those that are relevant for the current situation. Given the simplicity of the game and the reduced state space that results from the sparse symbolic representation we can approximate the optimal policy using tabular Q-learning.

Evaluation results

You can use precision and recall metrics to evaluate agent performance – where precision is interpreted as the percentage of objects you collect that are positive (crosses), and recall is interpreted as the percentage of available positive objects that you actually do collect.

The first results show that the agent improves with training to a precision of 70%. It’s interesting to compare the performance of DQN and the DSRL system. DQN does better in the the grid scenario, but when objects are position at random the DQN agent struggles to learn an effective policy within 1000 epochs. The DSRL system performs much better on this variant:

To evaluate transfer learning, both DQN and DSRL were trained on the grid variation, and then tested on the random variant.  DQN does no better than chance, whereas the DSRL system is able to approach 70% precision.

  We conjecture that DQN struggles with this game because it has to form a statistical picture of all possible object placements, which would require a much larger number of games. In contrast, thanks to the conceptual abstraction made possible by its symbolic front end, our system very quickly “gets” the game and forms a set of general rules that covers every possible initial configuration. This demonstration merely hints at the potential for a symbolic front end to promote data efficient learning, potential that we aim to exploit more fully in future work. Our proof-of-concept system also illustrates one aspect of the architecture’s inherent capacity for transfer learning…

Moreover, the DSRL system can explain its actions: every action choice can be analysed in terms of the Q functions involved in the decision. These Q functions describe the types of objects involved in the interaction as well as their relations, so we can track back to the reasons that led to a certain decision.

And this is just the beginning…

  • We could wire in a more sophisticated deep network capable of unsupervised learning to disentangle representations in the earlier stages.
  • We can exploit many more of the achievements in classical AI, notably the incorporation of inductive logic programming, formal techniques for analogical reasoning, and building in a planning component that exploits the knowledge of the cause structure of the domain acquired during the learning process.

  In domains with sparse reward, it’s often possible for an agent to discover a sequence of actions leading to a reward state through off-line search rather than on-line exploration. Contemporary logic-based planning methods are capable of efficiently finding large plans in complex domains (eg:[37]), and it would be rash not to exploit the potential of these techniques.

  • Finally, one day the symbolic components of the proposed architecture could well be one day using neurally-based implementations of symbolic reasoning functions.

In the mean time, an architecture that combines deep neural networks with directly implemented symbolic reasoning seems like a promising research direction.

Progressive neural networks

October 11, 2016

Progressive neural networks Rusu et al, 2016

If you’ve seen one Atari game you’ve seen them all, or at least once you’ve seen enough of them anyway. When we (humans) learn, we don’t start from scratch with every new task or experience, instead we’re able to build on what we already know. And not just for one new task, but the accumulated knowledge across a whole series of experiences is applied to each new task. Nor do we suddenly forget everything we knew before – just because you learn to drive (for example), that doesn’t mean you suddenly become worse at playing chess. But neural networks don’t work like we do. There seem to be three basic scenarios:

  • Training starts with a blank slate
  • Training starts from a model that has been pre-trained in a similar domain, and the model is then specialised for the target domain (this can be a good tactic when there is lots of data in the pre-training source domain, and not so much in the target domain). In this scenario, the resulting model becomes specialised for the new target domain, but in the process may forget much of what it knew about the source domain (“catastrophic forgetting”).  This scenario is called ‘fine tuning’ by the authors.
  • Use pre-trained feature representations (e.g. word vectors) as richer features in some model.

The last case gets closest to knowledge transfer across domains, but can have limited applicability.

This paper introduces progressive networks, a novel model architecture with explicit support for transfer across sequences of tasks. While fine tuning incorporates prior knowledge only at initialization, progressive networks retain a pool of pretrained models throughout training, and learn lateral connections from these to extract useful features for the new task.

The progressive networks idea is actually very easy to understand (somewhat of a relief for someone like myself who is just following along as an interested outsider observing developments in the field!). Some of the key benefits include:

  • The ability to incorporate prior knowledge at each layer of the feature hierarchy
  • The ability to reuse old computations and learn new ones
  • Immunity to catastrophic forgetting

Thus they are a stepping stone towards continual / life-long learning systems.

Here’s how progressive networks work. Start out by training a neural network with some number L of layers to perform the initial task. Call this neural network the initial column of our progressive network:

When it comes time to learn the second task, we add an additional column and freeze the weights in the first column (thus catastrophic forgetting, or indeed any kind of forgetting is impossible by design). The outputs of layer l in the original network becomes additional inputs to layer l+1 in the new column.

The new column is initialized with random weights.

We make no assumptions about the relationship between tasks, which may in practice be orthogonal or even adversarial. While the fine tuning stage could potentially unlearn these features, this may prove difficult. Progressive networks side-step this issue by allocating a new column for each task, whose weights are initialized randomly.

Suppose we now want to learn a third task. Just add a third column, and connect the outputs of layer l in all previous columns to the inputs of layer l+1 in the new column:

This input connection is made through an adapter which helps to improve initial conditioning and also deals with the dimensionality explosion that would happen as more and more columns are added:

…we replace the linear lateral connection with a single hidden layer MLP (multi-layer perceptron). Before feeding the lateral activations into the MLP, we multiply them by a learned scalar, initialized by a random small value. Its role is to adjust for the different scales of the different inputs. The hidden layer on the non-linear adapter is a projection onto an n<sub>l</sub> dimensional subspace (n<sub>l</sub> is the number of units at layer _l).

As more tasks are added, this ensures that the number of parameters coming from the lateral connections remains in the same order.

Progressive networks in practice

The evaluation uses the A3C framework that we looked at yesterday. It’s superior convergence speed and ability to train on CPUs made it a natural fit for the large number of sequential experiments required for the evaluation.  To see how well progressive networks performed, the authors compared both two and three-column progressive networks against four different baselines:

  • (i) A single column trained on the target task (traditional network learning from scratch)
  • (ii) A single column, using a model pre-trained on a source task, and then allowing just the final layer to be fine tuned to fit the target task
  • (iii) A single column, using a model pre-trained on a source task, and then allowing the whole model to be fine tuned to fit the target task
  • (iv) A two-column progressive network, but where the first column is simply initialized with random weights and then frozen.

The experiments include:

  • Learning to play the Atari pong game as the initial task, and then trying to learn to play a variety of synthetic variants (extra noise added to the inputs, change the background colour, scale and translate the input, flip horizontally or vertically).
  • Learning three source games (three columns, one each for Pong, RiverRaid, and Seaquest) and then seeing how easy it is to learn a new game – for a variety of randomly selected target games.
  • Playing the Labyrinth 3D maze game – each column is a level (track) in the game, and we see how the network learns new mazes using information from prior mazes.

For the Pong challenge, baseline 3 (fine tuning a network pre-trained on Pong prior to the synthetic change) performed the best of the baselines, with high positive transfer. The progressive network outperformed even this baseline though, with better mean and median scores.

As the mean is more sensitive to outliers, this suggests that progressive networks are better able to exploit transfer when transfer is possible (i.e. when source and target domains are compatible).

For the game transfer challenge the target games experimented with include Alien, Asterix, Boxing, Centipede, Gopher, Hero, James Bond, Krull, Robotank, Road Runner, Star Gunner, and Wizard of Wor.

Across all games, we observe that progressive nets result in positive transfer in 8 out of 12 target tasks, with only two cases of negative transfer. This compares favourably to baseline 3, which yields positive transfer in only 5 out of 12 games.

The more columns (the more prior games the progressive network has seen), the more progressive networks outperform baseline 3.

Seaquest -> Gopher (two quite different games) is an example of negative transfer:

Sequest -> RiverRaid -> Pong -> Boxing is an example where the progressive networks yield significant transfer increase.

With the Labyrinth tests, the progressive networks once again yield more positive transfer than any of the baselines.

Limitations and future directions

Progressive networks are a stepping stone towards a full continual learning agent: they contain the necessary ingredients to learn multiple tasks, in sequence, while enabling transfer and being immune to catastrophic forgetting. A downside of the approach is the growth in number of parameters with the number of tasks. The analysis of Appendix 2 reveals that only a fraction of the new capacity is actually utilized, and that this trend increases with more columns. This suggests that growth can be addressed, e.g. by adding fewer layers or less capacity, by pruning [9], or by online compression [17] during learning. Furthermore, while progressive networks retain the ability to solve all K tasks at test time, choosing which column to use for inference requires knowledge of the task label. These issues are left as future work.

The other observation I would make is that the freezing prior columns certainly prevents catastrophic forgetting, but also prevents any ‘skills’ a network learns on subsequent tasks being used to improve performance on previous tasks. It would be interesting to see backwards transfer as well, and what could be done there without catastrophic forgetting.

Asynchronous methods for deep reinforcement learning

October 10, 2016

Asynchronous methods for deep reinforcement learning Mnih et al. ICML 2016

You know something interesting is going on when you see a scalability plot that looks like this:

That’s a superlinear speedup as we increase the number of threads, giving a 24x performance improvement with 16 threads as compared to a single thread. The result comes from the Google DeepMind team’s research on asynchronous methods for deep reinforcement learning. In fact, of the four asynchronous algorithms that Mnih et al experimented with, the “asynchronous 1-step Q-learning” algorithm whose scalability results are plotted above is not the best overall. That honour goes to “A3C”, the Asynchronous Advantage Actor-Critic, which exhibits regular slightly sub-linear scaling as you add threads. How come it’s the best then? Because its absolute performance, as measured by how long it takes to achieve a given reference score when learning to play Atari games, is the best.

DeepMind’s DQN sytem is a Deep-Q-Network reinforcement learning system that learned to play Atari games. DQN relied heavily on GPUs. A3C beats DQN easily, using just CPUs:

When applied to a variety of Atari 2600 domains, on many games asynchronous reinforcement learning achieves better results, in far less time than previous GPU-based algorithms, using far less resource than massively distributed approaches.

Here you can see a comparison of the learning speed of the asynchronous algorithms vs DQN, with DQN trained on a single GPU, and the asynchronous algorithms trained using 16 CPU cores on a single machine.

(Click for larger view)

And when it comes to overall performance levels achieved, look how well A3C does compared to many other state of the art systems, despite significantly reduced training times.

Let’s take a step back and explore what’s going on here.

Asynchronous learning

We’re talking about reinforcement learning systems, and in particular for the experiments conducted in this paper, reinforcement learning systems used to learn how to play Atari games (57 of them), drive a car in the TORCS car racing simulator:

Solve manipulation and locomotion problems in rigid-body physics domains:

And explore previously unseen 3D maze environments in the game Labyrinth:

We’ve looked at a few papers introducing asynchronous support to machine learning, including those that rely on convergence properties of particular algorithms. Last week we looked at Cyclades that parallelises large classes of machine learning algorithms across multiple CPUs.

Whereas those methods are variations on a theme of ‘do the same thing (or a very close approximation to the same thing), but in parallel’, the asynchronous methods here exploit the parallel nature of multiple threads to enable a different approach altogether. DQN and other deep reinforcement learning algorithms use experience replay, capturing an agent’s data which can subsequently be batched and/or sampled over different time-steps.

Deep RL algorithms based on experience replay have achieved unprecedented success in challenging domains such as Atari 2600. However, experience replay has several drawbacks: it uses more memory and computation per real interaction; and it requires off-policy learning algorithms that can update from data generated by an older policy.

Instead of experience replay, one of the key insights in this paper is that you can achieve many of the same objectives of experience replay by playing many instances of the game in parallel.

… we make the observation that multiple actor-learners running in parallel are likely to be exploring different parts of the environment. Moreover, one can explicitly use different exploration policies in each actor-learner to maximize this diversity. By running different exploration policies in different threads, the overall changes being made to the parameters by multiple actor-learners applying online updates in parallel are likely to be less correlated in time than a single agent applying online updates. Hence, we do not use a replay memory and rely on parallel actors employing different exploration policies to perform the stabilizing role undertaken by experience replay in the DQN training algorithm.

This explains the superlinear speed-up in training time required to reach a given level of skill: the more games are being explored in parallel, the better the training input to the network.

I really like this idea that the very nature of doing things in parallel opens up the possibility to use a fundamentally different approach. I don’t think that insight would naturally occur to me, and it makes me wonder if there are other scenarios where it might also apply. A

The algorithms

In reinforcement learning an agent interacts with an environment by taking actions and receiving a reward. At each time step the agent receives the state of the world and a reward score from the previous time step, and selects an action from some universe of possible actions. An action value function, typically represented as Q determines the expected reward for choosing a given action in a given state when following some policy π. There are two broad approaches to learning: value-based and policy-based.

In value-based model-free reinforcement learning methods the action value function is represented using a function approximation, such as a neural network…. In contrast to value-based methods, policy-based model-free methods directly parameterize the policy π(a|s;θ) and update the parameters θ by performing, typically approximate, gradient descent.

(a represents an action, s the state).

Because the parallel approach no longer relies on experience replay, it becomes possible to use ‘on-policy’ reinforcement learning methods such as Sarsa and actor-critic. The authors create asynchronous variants of one-step Q-learning, one-step Sarsa, n-step Q-learning, and advantage actor-critic. Since the asynchronous advantage actor-critic (A3C) algorithm appears to dominate all the others, I’ll just concentrate on that one.

A3C uses a ‘forward-view’ and n-step updates. Forward view means that the algorithm selects actions using its exploration policy for up to tmax steps in the future. The agent will then receive up to tmax rewards from the environment since its last update. The policy and value functions are then updated for each state-action pair and associated reward over the tmax steps. For each update, the algorithm use “the longest possible n-step return.” In other words, the update includes all steps up to and including the step we are currently performing the update for: a 2-step update for the second state-action, reward pair, a 3-step update for the third , and so on.

Here’s the pseudo-code for the algorithm, taken from the supplementary materials:

(V is a function that determines the value of some state s under policy π.)

We typically use a convolutional neural network that has one softmax output for the policy π(at|st;θ) and one linear output for the value function _V(stv), with all non-output layers shared.

Experiments

If you watch some of the videos I linked earlier you can see how well A3C learns to perform a variety of tasks. Playing Atari games has been well covered before. The TORCS car racing simulator is more challenging:

TORCS not only has more realistic graphics than Atari 2600 games, but also requires the agent to learn the dynamics of the car it is controlling…. A3C reached between roughly 75% and 90% of the score obtained by a human tester on all four game configurations in about 12 hours of training.

The Mujoco physics engine simulations required a reinforcement learning approach adapted to continuous actions, which A3C was able to do. It was tested on a number of manipulation and locomotion tasks, and found good solutions in less than 24 hours, and often just a few hours.

The final experiments used A3C on a new 3D maze environment called Labyrinth:

This task is much more challenging than the TORCS driving domain because the agent is faced with a new maze in each episode and must learn a general strategy for exploring random mazes… The final average score indicates that the agent learned a reasonable strategy for exploring random 3D mazes using only a visual input.

Closing thoughts

We’ve seen a number of papers showing how various machine learning tasks can be made more efficient in terms of elapsed training time by exploiting asynchronous parallel workers, as well as more efficient algorithms. There’s another kind of efficiency that’s equally important though: data efficiency, a concept that was much discussed at the recent London Deep Learning Summit. Data efficiency refers to the amount of data that an algorithm needs to achieve a given level of performance. Breakthroughs in data efficiency could have an even bigger impact than breakthroughs in computational efficiency.

And on the topic of computers learning to play games, since Go has now fallen, when will we see a reinforcement learning system beat the (human) champions in esports games too? That would make a great theatre for a battle🙂.

Incremental knowledge base construction using DeepDive

October 7, 2016

Incremental knowledge base construction using DeepDive Shin et al., VLDB 2015

When I think about the most important CS foundations for the computer systems we build today and will build over the next decade, I think about

  • Distributed systems
  • Database systems / data stores (dealing with data at rest)
  • Stream processing (dealing with data in motion)
  • Machine learning (extracting information from data)

(What did I miss? Anything you’d add to the list?)

Regular readers will no doubt have noticed that these are the subject areas I most often cover on The Morning Paper. I’ve chosen today’s paper as representative of a large body of work at Stanford on a system called DeepDive. DeepDive sits at a very interesting intersection of the above topics, and its goal is to build a knowledge base – stored in a relational database – from information in large volumes of semi-structured and unstructured data. Such data is sometimes called dark data, and creating a knowledge base from it is the task of knowledge base construction (KBC).

The ultimate goal of KBC is to obtain high-quality structured data from unstructured information. These databases are richly structured with tens of different entity types in complex relationships… they can ingest massive numbers of documents – far outstripping the document counts of even well-funded human curation efforts.

The most important measures of KBC systems are precision (how often a claimed tuple is correct), and recall (of all the possible tuples to extract, how many are actually extracted). DeepDive itself has been used to build dozens of high-quality KBC systems (see descriptions of some of them at http://deepdive.stanford.edu/showcase/apps), and won KBC competitions.

In all cases, we have seen the process of developing KBC systems is iterative; quality requirements change, new data sources arrive, and new concepts are needed in the application. This led us to develop techniques to make the entire pipeline incremental in the face of changes both to the data and to the DeepDive program. Our primary technical contributions are to make the grounding and inference phases more incremental.

The incremental techniques improved system performance by two-orders of magnitude in five real KBC scenarios, while keeping the quality high enough to aid in the development process. Before we can understand those techniques, we need to cover some background on the overall KBC process itself.

KBC, the big picture

The process begins with a corpus of documents, and ends with a relational database containing facts extracted from those documents. KBC systems are typically developed incrementally with developers reviewing the quality of fact extraction (precision and recall) and adding new rules and data as needed. The resulting database will contain entities, relations, mentions, and relation mentions.

  • An entity is a real-world person, place, or thing
  • A relation associates two or more entities
  • A mention is a span of text that refers to an entity or relationship
  • A relation mention is a phrase that connects two mentions participating in a relation (e.g. London is the capital of England).

Ingestion and extraction of candidate mappings and features

As documents are ingested they are split into sentences. Each sentence is marked up using standard NLP pre-processing tools, including HTML stripping, part-of-speech tagging, and linguistic parsing, and then stored as a row in the database.

Once the document sentences are loaded and pre-processed, DeepDive looks for candidate mappings and features.

Candidate mappings are extracted using SQL queries. These extract possible mentions, relations, and entities. At this stage we go for high recall (get anything that might be interesting) and don’t worry too much about precision. For example, if we’re building a knowledge base including information about married partners, than we might extract a MarriedCandidate mapping every time two people are mentioned in the same sentence.

Features are supplementary information that the subsequent learning phases can use to learn which of the candidate mappings should indeed be considered true relations with high probability. In the marriage example, we might extract as a feature the phrase that appears between the two mentions. The system will learn a weight indicating how much confidence to place in a given phrase as an indication of a true relation.

In general, phrase could be an arbitrary UDF that operates in a per-tuple fashion. This allows DeepDive to support common examples of features such as “bag-of-words” to context-aware NLP features, to highly domain-specific dictionaries and ontologies. In addition to specifying sets of classifiers, DeepDive inherits Markov Logic’s ability to specify rich correlations between entities via weighted rules.

Supervised learning phase

There are two techniques for providing labelled training data. One is simply to provide true or false labels for evidence relations. The other is a technique called distant supervision. Distant supervision uses a supplementary dataset of facts to create larger sets of labelled data. Continuing the marriage example, suppose we have an external database of trusted facts indicating spouse relationships, and we choose a particular relationship P1 is married to P2. We then label as true every candidate marriage mapping the system has extracted that mentions P1 and P2. This will of course label as true indications of marriage some phrases that really aren’t, for example, “Crypto Bob (P1) exchanged messages with Crypto Alice (P2)”.

At first blush, this rule seems incorrect. However, it generates noisy, imperfect examples of sentences that indicate two people are married. Machine learning techniques are able to exploit redundancy to cope with the noise and learn the relevant phrases.

Inference and learning using a factor graph

Now the fun begins…

A DeepDive program defines a standard structure called a factor graph through inference rules specified by the developer. From the DeepDive website:

A rule expresses concepts like “If John smokes then he is likely to have cancer” and, in other words, describes the factor function of a factor and which variables are connected to this factor. Each rule has a weight (either computed by DeepDive or assigned by the user), which represents the confidence in the correctness of the rule. If a rule has a high positive weight, then the variables appearing in the rule are likely to take on values that would make the rule evaluate to true.

The factor graph contains two types of nodes: variables and factors. Variables may be either evidence variables (we know their true value, as a result of the supervision phase) or query variables whose value we want to predict. Factors define the relationships between the variables in the graph, “Each factor can be connected to many variables and comes with a factor function to define the relationship between these variables.”

Inference now takes place over the factor graph, using Gibbs sampling. Again, the website explains this well:

Exact inference is an intractable problem on factor graphs, but a commonly used method in this domain is Gibbs sampling. The process starts from a random possible world and iterates over each variable v, updating its value by taking into account the factor functions of the factors that v is connected to and the values of the variables connected to those factors (this is known as the Markov blanket of v). After enough iterations over the random variables, we can compute the number of iterations during which each variable had a specific value and use the ratio between this quantity and the total number of iterations as an estimate of the probability of the variable taking that value.

Thus DeepDive uses joint inference, determining the values of all events at the same time. This allows events to influence each other if they are directly or indirectly connected through influence rules.

Closing the loop

At the end of the learning and inference phrase, DeepDive has computed a marginal probability p for each candidate fact. Typically a user then selects facts with high confidence (e.g. p > 0.95) to produce the final knowledge base. It’s unlikely the system will be perfect first time around:

Typically, the user needs to inspect errors and repeat, a process that we call error analysis. Error analysis is the process of understanding the most common mistakes (incorrect extractions, too-specific features, candidate mistakes, etc.) and deciding how to correct them. To facilitate error analysis, users write standard SQL queries.

Incremental development

I only have space to cover this very briefly, see §3 in the paper for full details. DeepDive supports incremental grounding and_incremental inference_. Incremental grounding is the process of updating an existing factor graph based on new information, and incremental inference is the process of updating the learned probabilities given the change in the graph.

For incremental grounding, DeepDive uses standard incremental view maintenance techniques from the database community. Specifically, it uses the DRED algorithm of Gupta et al. (1993).

For incremental inference two different approaches are implemented – neither dominates the other and so the system chooses which to use at runtime using a simple set of rules based on the kinds of changes made to the graph.

  • The sampling approach stores a set of possible worlds sampled from the original distribution, rather than storing all possible worlds, and then performs inference over these samples. “We use a (standard) Metropolis-Hastings scheme to ensure convergence to the updated distribution.”
  • The variational approach stores a factor graph with fewer factors that approximates the original distribution. On the smaller graph, running inference and learning is often faster.