Skip to content

REX: A development platform and online learning approach for runtime emergent software systems

December 5, 2016

REX: A development platform and online learning approach for runtime emergent software systems Porter et al. OSDI 2016

If you can get beyond the (for my taste, ymmv) somewhat grand claims and odd turns of phrase (e.g., “how the software ‘feels’ at a given point in time” => metrics) then there’s something quite interesting at the core of this paper. Given a system in which there are multiple different implementations of system components (interface implementations) – for example, each using differing algorithms with differing properties – how do you select the best-performing combination of components at any given point in time?

It almost sounds like a non-problem (who has lots of different implementations of each component of their system sitting around just in case…?), but the motivating web server example persuaded me not to dismiss it out of hand, and you can imagine the same approach being used for selecting amongst configuration options that have major impacts on runtime strategy as well.

The authors study a web server with two major internal component interfaces the RequestHandler interface, and the HTTPHandler interface. There are thread-per-client and thread-pool strategy implementations of the request handler interface, and four implementations of the http handler interface covering all combinations of caching/non-caching and compressing/non-compressing. Caching can be achieved using a number of different cache implementations, and likewise for compression:

All told in fact, the web server has over 30 different components (not all shown in the above figure), leading to 42 different possible ways of assembling a web server.

The evaluation shows that the component combinations which give the best performance at any point in time depend very much on the workload characteristics. For example, the type of resources most commonly requested (text or image), the size of the returned documents, and how diverse a set of documents are requested (‘entropy’ in the table below):

What the team set out to do is build a system that learns the best combinations under different conditions and adapts the runtime configuration accordingly as the workload changes.

These results confirm there are different optimal configurations of components that can form our target system in different environments. Low entropy and high text conditions favor configurations with caching and compression; low entropy and low text conditions favor configurations with caching only; and high entropy conditions favor a mixture of configurations. The subtleties within these results, and the fact that issues such as disk/memory latency will vary across machines, further motivate a real-time, machine-learning-based solution to building software.

If you thought integration testing was bad before, wait till you try to debug a system that deliberately explores all possible combinations of components at runtime!!!

REX is the end result. Here it is learning the optimal configuration when given a workload of small text files. There is a big reduction in ‘regret’ after only a few iterations (each iteration lasting 10 seconds).

Regret is defined as ( 1/response_timechosen action – 1/response_timeoptimal action).

By including a categorization of workload as input to the learning algorithm, the system is able to adapt to changing workloads over time. This makes a big difference as shown below:

When you put it all together, you get a system that performs well with challenging real-world traces:

In Fig. 9 (below) we show results from a real-world web server workload of highly mixed resource requests, taken from the publicly available NASA server trace [2]. This is a challenging request pattern due to its high variance over time, in which different kinds of resource are requested in very different volumes. As a result our learning approach finds it more difficult to compare like-for-like results as different configurations are tested. Initially regret here is generally high, but decreases steadily up to the 40th iteration mark.

How it works

Building a system like REX requires a component programming model that facilitates runtime swapping of components. For that, the authors use their own component programming language called Dana. Dana proxies all objects flowing into and out of components and will keep mementos (‘transfer fields’) on behalf of component instances that can be used to hold state across different component implementations when the runtime decides to swap one implementation out for another one using a hot-swap mechanism.

When you start a Dana application, the first component is loaded and introspected to determine what components it needs (interfaces it depends on). Components implementing those interfaces are discovered in the project (this really isn’t new, sorry – see e.g., the way that Spring Boot does classpath scanning to discover components and configure itself at runtime). What is different though is that instead of selecting a single configuration, REX will build a list of all possible combinations, which will be systematically explored at runtime. To guide the exploration REX captures events arriving from the outside world, and metrics from the implementation. In the web server example the events are simply the arriving http requests and the sizes of the corresponding responses, and the chosen metric is the average response time. (No, mean latency is not generally a useful metric, but let’s roll with it for now…).

The interesting bit is how REX then learns the optimal configurations…

Online learning must balance the exploration of under-tested configurations with exploiting configurations already known to perform well. The canonical form of this is the multi-armed bandit problem, devised for clinical trials, and recently a dominant paradigm for optimization on the web. A multi-armed bandit has a set of available actions called arms. Each time an arm is chosen, a random reward is received, which depends (only) on the selected arm. The objective is to maximize the total reward obtained. While short-term reward is maximized by playing arms currently believed to have high reward, long-term benefit is maximized by exploring to ensure that we don’t fail to find the best arm.

Each ‘arm’ is a system configuration, and the reward given by ‘playing an arm’ (deploying a configuration) is determined by the metrics. Thompson sampling is used for the multi-armed bandit problem, in which each arm is played proportionally to the probability that it is the best arm given the information to date. Bayesian inference is used to produce ‘posterior distributions’ encoding beliefs about the expected values of configurations, which are modeled as bell curves.

The center of the bell curve represents the average reward seen on that arm to date, and the spread of the curve represents the level of uncertainty. When a Thompson sample is taken from such a curve, for the corresponding arm to be played it must either have a high average (representing the case where we know a configuration to be good) or a wide spread giving us the chance of drawing a high value (representing the case where we still have high uncertainty).

The effect is that the arms most likely to be played are those that experience suggests are likely to perform well, and those that may perform well but we have insufficient information about. Arms for which we have good information that they will perform badly are played with very low probability. As more information is gained, and beliefs concentrate on the truth, no arms will remain for which there is insufficient information. Thus, in the long term, optimal arms are played with very high probability.

Forming beliefs about all possible combinations of components could quickly give rise to very large numbers of combinations to explore.

We therefore follow and use a regression framework based on classical experimental design to share information across the different arms available to us… This is formalized by modelling the expected reward for a given configuration as a function of the components deployed within that configuration. In detail, we code each interface as a factor [categorical] variable, with number of levels equal to the number of available components for that interface.

Each possible component has a corresponding coefficient in the expected reward equation. Make an m x k ‘action’ matrix where each row m represents a configuration, and each column k represents a regression coefficient. After sampling a vector of coefficients Β, the matrix can be multiplied by this vector simultaneously solving for all configurations. The row with the highest resulting value is chosen and deployed. After ten seconds the resulting reward is observed (that doesn’t seem very long for e.g. caches to get warm???) and the result is stored. The posterior distribution is then updated before repeating the process.

To help the system ‘unlearn’ what it already knows when the workload characteristics change, additional features are added which represent the current workload. In the current system, these features were manually defined by the authors as the workload entropy (# of different resources requested in a given time frame) and text volume (% of content requested in a given timeframe that was textual). Terms are added to the expected reward equation capturing these variables.

A final thought : it would be interesting to compare the results to the behaviour of a DQN trained to explore the same state space and maximise its reward… Actions taken by the agent would simply correspond to deploying a given configuration, and the Q reward function would be determined by the resulting runtime metrics.

Slicer: Auto-sharding for datacenter applications

December 2, 2016

Slicer: Auto-sharding for datacenter applications Adya et al. (Google)  OSDI 2016

Another piece of Google’s back-end infrastructure is revealed in this paper, ready to spawn some new open source implementations of the same ideas no doubt. Slicer is a general purpose sharding service. I normally think of sharding as something that happens within a (typically data) service, not as a general purpose infrastructure service. What exactly is Slicer then? It has two key components: a data plane that acts as an affinity-aware load balancer, with affinity managed based on application-specified keys; and a control plane that monitors load and instructs applications processes as to which keys they should be serving at any one point in time.  In this way, the decisions regarding how to balance keys across application instances can be outsourced to the Slicer service rather than building this logic over and over again for each individual back-end service. Slicer is focused exclusively on the problem of balancing load across a given set of  backend tasks, other systems are responsible for adding and removing tasks.

Experience taught us that sharding is hard to get right: the plumbing is tedious, and it can take years to tune and cover corner cases. Rebuilding a sharder for every application wastes engineering effort and often produces brittle results.

Slicer is used by over 20 different services at Google, where it balances 2-7M requests per second from over 100,000 connected application client processes. Slicer has two modes of key assignment, offering both eventually consistent and strongly consistent models. In eventually consistent mode, Slicer may allow overlapping eventually consistent key assignments when adapting to load shifts. In strong consistency mode no task can ever believe a key is assigned to it if Slicer does not agree. All of the production use to date uses the eventually consistent model.

Slicer’s high-level model

Slicer is a general-purpose sharding service that splits an application’s work across a set of tasks that form a job within a datacenter, balancing load across the tasks. A “task” is an application process running on a multitenant host machine alongside tasks from other applications. The unit of sharding in Slicer is a key, chosen by the application.

Application clients use a client-side library called Clerk, and application server tasks integrated with Slicer’s Slicelet library. Slicelet enables a task to learn when a slice is assigned to it, or when a slice is removed from it.  The Slicer Service itself monitors load and task availability to generate new key-task assignments and thus manage availability of all keys. It’s up to the application to decide what to use for its keys – they could be fine-grained such as user ids, or coarse-grained as a particular language model to be supported by a speech recogniser.

Slicer hashes each application key into a 63-bit slice key; each slice in an assignment is a range in this hashed keyspace. Manipulating key ranges makes Slicer’s workload independent of whether an application has ten keys or a billion and means that an application can create new keys without Slicer on the critical path. As a result, there is no limit on the number of keys nor must they be enumerated. Hashing keys simplifies the load balancing algorithm because clusters of hot keys in the application’s keyspace are likely uniformly distributed in the hashed keyspace.

Slicer will honour a minimum level of redundancy per-key to protect availability, and automatically increases replication for hot slices.

The Clerk interface provides a single function for finding the addresses of assigned tasks given a key, but in Google most applications don’t use the Clerk library directly and instead benefit from transparent integration with Google’s RPC system Stubby, or Google’s Front End (GFE) http proxy.

The weighted-move sharding algorithm

We balance load because we do not know the future: unexpected surges of traffic arrive at arbitrary tasks. Maintaining the system in a balanced state maximizes the buffer between current load and capacity for each task, buying the system time to observe and react.

Slicer monitors key load (request rate and/or application reported custom metrics) to determine when load balancing changes are required. The overall objective is to minimize load imbalance, the ratio of the maximum task load to the mean task load.  When making key assignments Slicer must also consider the minimum and maximum number of tasks per key specified in configuration options, and should attempt to limit key churn – the fraction of keys impacted by reassignment. Key churn itself is a source of additional load and overhead.  In order to scale to billions of keys, Slicer represents assignments using key ranges, aka slices. Thus sometimes it is necessary to split a key range to cope with a hot slice, and sometimes existing slices are merged.  The sharding algorithm proceeds in five phases:

  1. Reassign keys away from tasks that are no longer part of the job (e.g., due to hardware failure)
  2. Increase / decrease key redundancy as required to conform to configured constrained on minimum and maximum number of tasks (the configuration could now be violated due to actions in phase 1, or the configuration itself may have changed)
  3. Find two adjacent cold slices and merge them into one, so long as the receiving task’s load does not exceed the maximum task load as a result, and the merged slice still has less than mean slice load. Repeat this step so long as: (i) there are more than 50 slices per task in aggregate, and (ii) no more than 1% of the keyspace has moved.
  4. Pick a sequence of moves with the highest weight (described below) and apply them. Repeat until an empirically determined  key churn budget of 9% is exhausted.
  5. Split hot slices without changing their task assignments. This will open new move options in the next round. Repeat splitting so long as (i) the split slice is at least twice as hot as the mean slice, and (ii) there are fewer than 150 slices per task in aggregate.

During step 4, the weight of a move under consideration is defined as the reduction in load imbalance for the tasks affected by the move, divided by the key churn cost.

The constants in the algorithm (50-150 slices per task, 1% and 9% key movement per adjustment) were chosen by observing existing applications. Experience suggests the system is not very sensitive to these values, but we have not measured sensitivity rigourously.

When balancing based on CPU utilization (one of the available metric options), if the maximum task load is less than 25% then Slicer suppresses rebalancing as no task is at risk of overload.

Also of interest to followers of hashing is that the Google team tried consistent hashing schemes and found they didn’t work as well:

A variant of consistent hashing with load balancing support yielded both unsatisfactory load balancing, and large, fragmented assignments.

See §4.4.4 for more details on this.

Slicer’s implementation

Slicer aims to combine the high-quality, strongly consistent sharding decisions of a centralized system with the scalability, low latency, and fault tolerance associated with local decisions.

Slicer is conceptually a centralized service, but its implementation is highly distributed. Assigner components run in several Google datacenters around the world and generate assignments using the weighted-move sharding algorithm we just looked at. Decisions are written into optimistically-consistent storage.

A two-tier distributor tree distributes decisions following an assignment change. This happens following a pull-model: subscribers ask a distributor for a job’s assignment, and if a distributor doesn’t have it, it asks the Assigner. Slicer is designed to maintain request routing even in the face of infrastructure failures or failures of Slicer itself. The control-plane / data-plane separation means that most failures hinder timely re-optimization of assignments, but do not delay request routing.

The following part of the fault-tolerating design caught my eye:

The Distributors share a nontrivial code base and thus risk a correlated failure due to a code or configuration error. We have yet to experience such a correlated failure, but our paranoia and institutional wisdom motivated us to guard against it.

To mitigate this, a Backup Distributor service was built on a different code base which is deliberately kept very simple and slowly evolving – it satisifies application requests simply by reading directly from the backing store.

Slicer in production

In a one-week period, Slicer perform 260 billion requests for a subset of its Stubby clients: 99.98% of these succeeded, which establishes a lower bound on Slicer’s availability (requests may have failed for other reasons too). In another week, 272 billion requests arrived for a given backend service, of which only 11.6 million (0.004%) had been misrouted. (Many applications can tolerate misdirected requests with only an impact on latency or overhead, not availability).

There are charts and graphs aplenty in §5 of the paper if that’s your thing!

Morpheus: Towards automated SLOs for enterprise clusters

December 1, 2016

Morpheus: Towards automated SLOs for enterprise clusters Jyothi et al. OSDI 2016

I’m really impressed with this paper – it covers all the bases from user studies to find out what’s really important to end users, to data-driven engineering, a sprinkling of algorithms, a pragmatic implementation being made available in open source, and of course, impactful results. If I was running a big-data cluster, I’d definitely be paying close attention!  Morpheus is a cluster scheduler that optimises for user satisfaction (we don’t see that often enough as a goal!), and along the way makes up to an order-of-magnitude reduction in the number of jobs that miss their deadlines while maintaining cluster utilisation and lowering cluster footprint by 14-28%.

In this paper, we present Morpheus, a system designed to resolve the tension between predictability and utilization—that we discovered thorough analysis of cluster workloads and operator/user dynamics. Morpheus builds on three key ideas: automatically deriving SLOs and job resource models from historical data, relying on recurrent reservations and packing algorithms to enforce SLOs, and dynamic reprovisioning to mitigate inherent execution variance.

What do users of big data clusters care about?

The team gathered some very interesting insights into what goes on in (Microsoft’s ?) big data clusters by a process of:

  • Analyzing execution logs from millions of jobs running on clusters with over 50K nodes
  • Looking at infrastructure deployment and upgrade logs
  • End user interviews, analysis of discussion threads and escalation tickets from end-users, operators, and decision makers, and
  • A selection of targeted micro-benchmarks

The findings are fascinating in their own right.

Based on interviews with cluster operators and users, we isolate one observable metric which users care about: job completion by a deadline. Specifically, from analysing the escalation tickets, some users seem to form expectations such as: “95% of job X runs should complete by 5pm”. Other users are not able to specify a concrete deadline;  but do state that other teams rely on the output of their job, and may need it in a timely manner.

  • Over 75% of the workload is production jobs
  • Users are 120x more likely to complain about performance (un)predictability than about fairness
  • Over 60% of the jobs in the largest clusters are recurrent. Most of these recurring jobs are production jobs operating on continuously arriving data, hence are periodic in nature. The periods tend to be natural values (once an hour, once a day etc.).
  • Manual tuning of jobs is hard: 75% of jobs are over-provisioned even at their peak. 20% of jobs are more than 10x over-provisioned!
  • Users don’t change the provisioning settings for their periodic jobs beyond the initial set-up (80% of periodic jobs saw no change in their resource provisioning)
  • Runtimes are unpredictable – noisy neighbours (the amount of sharing) have some impact, but even when removing common sources of inherent variability (data availability, failures, network congestion), runtime remain unpredictable (e.g. due to stragglers). The chart below shows the wide variability in runtimes of five different TPC-H queries on a 500 container system:

  • Cluster hardware keeps evolving. For example over a one year period the ratio between machines of type SKU1 and machines of type SKU2 changed from 80/20 to 55/45, as well as the total number of nodes also changing.

This is notable, because even seemingly minor hardware differences can impact job runtime significantly – e.g., 40% difference in runtime on SKU1 vs SKU2 for a Spark production job.

  • Jobs don’t stand still – within a one-month trace of data, 15-20% of periodic jobs had at least one large code delta (more than 10% code difference), and over 50% had at least one small delta.

Even an optimal static tuning is likely going to drift out of optimality over time.

Based on their findings, the team concluded that:

  1. History-based approaches can model the “normal” behaviour of a job
  2. Handling outliers without wasting resources requires a dynamic component that performs reprovisioning online, and
  3. While each source of variance can be addressed with an ad-hoc solution, providing a general-purpose line of defence is paramount.

Morpheus overview

Given a job that is periodically submitted by a user (with manually provisioned resources), Morpheus quietly monitors the job over time and captures:

  • a provenance graph with data dependencies and ingress/egress operations
  • telemetry history in the form of resource skylines capturing resource utilisation

Following a number of successful runs of the job, an SLO inference component performs an offline analysis. Using the provenance graph it determines a deadline for completion, the SLO (Service Level Objective), and using the skylines it determines a model of the job resource demand over time.

At this point, the system is ready to take over resource provisioning for the job, but before it does so:

The user signs-off (or optionally overrides) the automatically-generated SLO and job resource model.

(Wouldn’t that be great as a user, to know that the system is going to monitor and try to honour an SLO for your jobs!).

Morpheus then uses recurring reservations for each managed job. These set aside resources over time for running the job, based on the job resource model, and each new instance of the job runs using these dedicated resources.

To address variability, a dynamic reprovisioning component monitors job progress online and adjusts the reservation in real-time to mitigate the inherent execution variability.

Job runs continue to be monitored to learn and refine the SLO and job resource model over time.

Let’s look a little more closely at a couple of the key steps

Agreeing on SLOs…

Petabytes of logs gathered daily across the production environments are processed to create a “semantically rich and compact (few TBs!) graph representation of the raw logs.”

This representation gives us a unique vantage point with nearly perfect close-world knowledge of the meaningful events in the cluster.

Periodic jobs are uncovered by looking for (templatized) job name matches, an approximate match on source-code signatures (what is an approximate match for a signature??? I’m guessing this is not a hash-based signature…), and submissions with near-constant inter-arrival time.

We say that a job has an actionable deadline if its output is consumed at an approximately fixed time, relative to the start of the period (e.g., everyday at 4pm), and if there is a non-trivial amount of slack between the job end and the deadline.

Morpheus then collects usage patterns for these jobs and determines resource skylines. It then becomes a linear programming optimisation problem to determine the best resource allocations. The cost function uses one term which penalises for over-allocation, and another term which penalises for under-allocation.

The LP has (O(N ×K)) number of variables and constraints. Our sampling granularity is typically one minute, and we keep roughly one-month worth of data. This generates less than 100K variables and constraints. A state-of-the-art solver (e.g., Gurobi, CPlex) can solve an LP of millions of variables and constraints in up to few minutes. Since we are way below the computational limit of top solvers, we obtain a solution within a few seconds for all periodic jobs in our clusters.

Make it so

With all the requirements collected, a packing algorithm called LowCost is used to match jobs to resources.  LowCost allocates containers to all periodic jobs, such that their requirements are met by the deadline, and at the same time it tries to minimize waiting time for non-periodic ad-hoc jobs. The overall objective of LowCost is to minimise the maximum total allocation over time. It does so following a greedy procedure which places containers iteratively at cost-efficient positions.

While reservations can eliminate sharing-induced unpredictability, they provide little protection against inherent unpredictability arising from hard-to-control exogenous causes, such as infrastructure issues (e.g., hardware replacements (see §2), lack of isolation among tasks of multiple jobs, and framework code updates) and job-centric issues (changes in the size, skew, availability of input data, changes in code/functionalities, etc.).

These issues are dealt with by the dynamic reprovisioning algorithm (DRA). DRA continuously monitors the resource consumption of a job, compares it with the resources allocated in the reservation and ‘stretches’ the skyline of resources to accommodate a slower-than-expected job execution.

Key results

In a simulation run from the largest dataset available to the team, Morpheus was able to automatically derive SLOs for over 70% of the millions of instances of periodic jobs in the trace. (The remainder don’t have enough data – eg, they are weekly jobs and so only appear four times in the one-month trace). Compared to the current manual provisioning, Morpheus reduces worst-case SLO misses by 13x! The packing algorithm does a good job of driving utilisation at the same time, lowering the overall cluster cost.

SLO extraction and packing contribute to lower the baseline cluster size by 6%.  Job resource modeling (using the extracted skyline rather than user supplied provisioning) lowers cluster size by a further 16%.  Combined they give a 19% reduction.

By turning on/off our dynamic reprovisioning, we can either (1) match the user utilization level and deliver 13x lower violations, or (2) match the current SLO attainment and _reduce cluster size by over 60% (since we allow more aggressive tuning of the LP, and repair underallocations dynamically).

Firmament: Fast, centralized cluster scheduling at scale

November 30, 2016

Firmament: Fast, centralized cluster scheduling at scale Gog et al. OSDI’ 16

Updated link to point to official usenix hosted version

As this paper demonstrates very well, cluster scheduling is a tricky thing to get right at scale. It sounds so simple on the surface: “here are some new jobs/tasks – where should I run them?” Of course the slightly more nuanced question, and where the troubles begin, is “where should I run them in order to optimize for this objective, given these constraints…?” Typically you have a trade-off between distributed schedulers that can operate at scale and make fast decisions, and a centralized scheduler that can make higher quality decisions (e.g, improve utilisation, load balance, or whatever else your policy dictates) but struggles to make those decisions quickly enough as workload scales. What we have here is Firmament, a new centralised scheduler that combines high-quality placements on a par with advanced centralized schedulers, and the speed and scale of a distributed scheduler. It comes from a strong team of researchers across Cambridge, MIT, and Google, and is available in open source at A Firmament scheduler plugin for Kubernetes is also under development.

Taking a production workload trace from a 12,500 machine Google cluster, and replaying it at 300x speed, Firmament is able to keep pace and place 75% of tasks with sub-second latency. In a 40-machine test cluster the team compared Firmament against the schedulers from Sparrow, Mesos, Kubernetes, and Docker SwarmKit. At this scale Firmament’s task placement latency is around 5ms. As you can see below, Firmament comes closest to the idle baseline (each task in isolation) above the 80th percentile for a workload consisting entirely of short batch analytics tasks.

When using a more realistic mix of short tasks with long running interactive services and batch jobs, we see that Firmament greatly improves the tail of the task response time distribution for the short tasks:

… Firmament’s 99th percentile response time is 3.4x better than the SwarmKit and Kubernetes ones, and 6.2x better than Sparrow’s. The tail matters, since the last task’s response time often determines a batch job’s overall response time (the ‘straggler’ problem).

What do we want from a scheduler?

  • Good quality task placements
  • Sub-second task placement latency even at scale
  • The ability to cope with tricky situations such as a flood of incoming jobs or cluster oversubscription.

Better task placements by the cluster scheduler lead to higher machine utilization, shorter batch job runtime, improved load balancing, more predictable application performance, and increased fault tolerance.

Achieving all of these goals at once of course is hard: task placement requires solving an algorithmically complex optimization problem in multiple dimensions. Low placement latency requires that you make decisions quickly.

Go with the flow

The bird’s-eye view of Firmament looks like this:

Information about the cluster topology, current system health (monitoring data) and the jobs and tasks to be placed is fed into the Firmament scheduler. Something called a flow network (we’ll get into that shortly) is constructed based on these inputs and the scheduling policy. The flow network is passed to a min-cost, max-flow solver which in turn ouputs the optimal flow. Workload placements are then extracted from this flow graph.

One key feature at this level that helps Firmament produce high-quality decisions is that it addresses task placement by considering the whole workload (new and existing) in one batch as opposed to the more common approach of pulling placement tasks off of a queue one at a time:

… processing one task at at time has two crucial downsides: first, the scheduler commits to a placement early and restricts its choices for further awaiting tasks, and second, there is limited opportunity to amortize work.

The core idea of flow-based scheduling comes from Quincy, and guarantees optimal task placements for a given policy. But Quincy itself is too slow to meet the placement latency targets at scale. In flow-based scheduling, you map the task placement problem into a flow network, use standard min-cost, max-flow optimizations on the network, and then map the results back to task placements. It’s hard to understand without an example, so let’s take a look at a simplified one:

On the left you see the tasks to be placed. They come from two jobs, one with three tasks and one with two tasks. These tasks are the sources of ‘flow’ in the network. On the right hand side you see the single sink node. Everything must flow from the sources (tasks) to the sink through the network in order for placement to be successful. To get to the sink, flow must pass through one of the machine nodes (which implies placement of the task on that machine when the flow graph is mapped backed into the problem domain). Alternatively, a task may remain unscheduled in this particular round, represented by it flowing through an unscheduled node (one per job). The placement preferences are represented as costs on the arcs.

The solver finds the best solution if every task has an arc to each machine scored according to the scheduling policy, but this requires thousands of arcs per task on a large cluster. Policy-defined aggregator nodes, similar to the unscheduled aggregators, reduce the number of arcs required to express a scheduling policy.

Here’s an example of a mapping for a network aware policy with intermediate request aggregator nodes with arcs representing available bandwidth.

There are a variety of min-cost max-flow (MCMF) algorithms including cycle canceling, succesive shortest path, cost scaling, and relaxation. Successive shortest path for example repeatedly selects a source node and then sends flow from it to the sink along the shortest path. The worst-case complexities of these algorithms are as follows:

So in theory, successive shortest path should work best for this problem.

Theory, meet practice

…since MCMF algorithms are known to have variable runtimes depending on the input graph, we decided to directly measure performance.

That exercise resulted in the following “Big-Oh” moment:

The relaxation algorithm, which has the highest worst-case time complexity, actually performs the best in practice. It outperforms cost scaling (used in Quincy) by two orders of magnitude: on average, relaxation completes in under 200ms even on a cluster of 12,500 machines.

Sometimes there’s a big difference between worst case and expected case! The relaxation algorithm turns out to be particularly efficient when most scheduling choices are straightforward. It’s not always the best choice though. Look what happens when we drive up utilisation:

Further investigation (I’m omitting a ton of detail here!) showed that an incremental cost scaling algorithm could improve on the base cost scaling approach by up to 50%, and an arc-prioritisation heuristic reduced relaxation costs by 45%.

In the final implementation, instead of implementing some fancy predictive algorithm to decide when to use cost scaling and when to use relaxation, the Firmament implementation makes the pragmatic choice of running both of them (which is cheap) and uses the solution from whichever one finishes first!

A technique called price refine (originally developed for use within cost scaling) helps when the next run of the incremental cost scaling algorithm must execute based on the state from a previous relaxation algorithm run. It speeds up incremental cost scaling by up to 4x in 90% of cases in this situation.

There is plenty more detail in the paper itself.

The Last Word

Firmament demonstrates that centralized cluster schedulers can scale to large clusters at low placement latencies. It chooses the same high-quality placements as an advanced centralized scheduler, at the speed and scale typically associated with distributed schedulers.

Early detection of configuration errors to reduce failure damage

November 29, 2016

Early detection of configuration errors to reduce failure damage Xu et al, OSDI ’16

Here’s one of those wonderful papers that you can read in the morning, and be taking advantage of the results the same afternoon! Remember the ‘Simple testing can prevent most critical failures‘ paper from OSDI’14 that we looked at last month? In that paper we learned that trivial mistakes in error handling, which are easy to test for, accounted for a vast majority of catastrophic production incidents. Well, as soon as you’ve got your error / exception handlers sorted out, you might want to read today’s paper to discover another class of easy-to-test for bugs that are also disproportionately responsible for nasty production failures.

Facebook’s ‘Holistic configuration management‘ paper stresses the importance of version control and testing for configuration, as configuration errors are a major source of site errors. Xu et al. study configuration parameters in the wild, focusing especially on those associated with reliability, availability, and serviceability (RAS) features. What they find is that very often configuration values are not tested as part of system initialization. The program runs along happily until reaching a point (say for example, it needs to failover) where it needs to read some configuration for the first time, and then it blows up – typically when you most need it.

So here’s the short takeaway test all of your configuration settings as part of system initialization and fail-fast if there’s a problem. Do that, and you’ll cut out another big source of major production errors. The paper itself is in two parts: the first part (where I’ll focus most of my attention in this short write-up) is an analysis of these latent configuration errors in real code bases; the second part introduces a tool called PCheck, which if your program is written in C or Java can even find latent configuration usage and automatically write tests for you!

Latent configuration errors can result in severe failures, as they are often associated with configurations used to control critical situations such as fail-over, error handling, backup, load balancing, mirroring, etc… Their detection or exposure is often too late to limit the failure damage.

In a study of real-world configuration issues in the the products of COMP-A, “major storage company in the US,” with footnote “we are required to keep the company and its products anonymous,” it turns out that 75% of all high severity  configuration-related errors are caused by latent configuration errors. It may well be that the authors are required to keep COMP-A anonymous, but I couldn’t help noticing the author affiliations printed in big type on the front page. A more than fair chance that company is NetApp I would say!

The authors also studied a number of real-world open-source systems (see table below), and inspected usage of all of their RAS-related configuration parameters.

They looked at how many of those parameters were explicity checked vs simply being used when first required, yielding the results below:

Many of the studied RAS parameters do not have any special code for checking the correctness of their settings. Instead, the correctness is verified (implicitly) when the parameters’ values are actually used in operations such as a file open call.

Here’s an example of a real-world latent configuration error in MapReduce:

And here are some other bugs found during the study, in the most recent versions of the software under inspection in (a) HDFS:

and (b), Apache httpd:

So we know that many configuration parameters aren’t checked before usage. It’s also the case that many of these parameters aren’t used during system startup (and so are not verified even implicitily):

Many (12-38.6%) of the studied RAS configuration parameters are not used at all during the system’s initialization phase.

Put these two finding together, and what you have is a collection of ticking time bombs! Remember that since these are configuration settings they may be changed on deployment – i.e. these are not bugs that unit testing can catch.

4.7-38.6% of the studied RAS parameters do not have any early checks and and thereby subject to latent configuration errors which can cause severe impact on the system’s dependability.

Here’s the summary of how many of these ticking time bombs can exist in the systems studied:

The threats are prevalent: Latent configuration errors can reside in 10+% of the RAS parameters in five out of six systems. As all theses latent configuration erros are discovered in the latest versions, any of them could appear in a real deployment and would impair the system’s dependability in a latent fashion.

The authors wrote a tool called PCheck which uses static code analysis to find instructions that load configuration parameters into program variables, looks for all instructions that use the parameter value, figures out the execution context of those instructions and composes checkers that can be run at initialization to verify the configurarion is well-formed in the target environment. I feel bad skipping over all of the details of the authors hard work here, but I’m going to refer you to the paper for full details if you’re interested.

With the PCheck tests in place, the authors harvested 830 configuration files for the studied systems (from mailing lists and technical forums) and validated them. With the checks in place, 70+% of latent configuration errors were detected.

PCheck reports 282 true configuration errors (from 830 configs!) and three false alarms. Many (37.5-87.8%) of the reported configuration erros can only be detected by considering the system’s native execution environment. These configuration settings are valid in terms of format and syntax (in fact, they are likely to be correct in the original hosts). However, they are erroneous when used on the current system because the values violate environment constraints such as undefined environment variables, non-existing file-paths, unreachable IP addresses etc..

(Which leaves me a little unsure as to what the authors count as a ‘true configuration error’
– a configuration file which may have been perfectly valid on the system from which it was cut-and-pasted onto a mailing list may of course give problems in another context, but this doesn’t imply it was truly a configuration error on the original system).

Nevertheless, I think the overall message of this paper is clear: Ladies and Gentlemen, please eagerly check all of your configuration variables on system startup.

This paper advocates early detection of configuration errors to minimize failure davage, especially in cloud and data-center systems. Despite all the efforts of validation, review, and testing, configuration errros (even those obvious errors) still cause many high-impact incidents of today’s Internet and cloud systems.

Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services

November 28, 2016

Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services Veeraraghavan et al. (Facebook) OSDI 2016

How do you know how well your systems can perform under stress? How can you identify resource utilization bottlenecks? And how do you know your tests match the condititions experienced with live production traffic? You could try load modeling (simulation), but it’s not really feasible to model systems undergoing constantly evolving workloads, frequent software release cycles, and complex dependencies. That leaves us with load testing. When load testing we have two further challenges:

  1. Ensuring that the load test workload is representative of real traffic. This can be addressed using shadow traffic – replaying logs in a test environment.
  2. Dealing with side-effects of the load testing that may propagate deep into the system. This can be addressed by stubbing, but it’s hard to keep up with constantly changing code-bases, and reduces the fidelity of end-to-end tests by not stressing dependencies that otherwise would have been affected.

Facebook use a very chaos-monkey-esque approach to this problem, they simply run load tests using live traffic using a system called Kraken, and have been doing so for about three years now. The core idea is really simple, just update the routing layer to direct more traffic at the systems you want to test. Here are four things to like about the approach:

  1. Live traffic does a very good job of simulating the behaviour of… live traffic.
  2. You don’t need any special test set-ups.
  3. Live traffic tests expose bottlenecks that arise due to complex system dependencies
  4. It forces teams to harden their systems to handle traffic bursts, overloads etc., thus increasing the system’s resilience to faults.

On that last point, it’s easy to see how a busy service team might ignore (deprioritise) a performance report from a specialist performance team stressing their system. It’s a whole different thing when you know your system is going to be pushed to its limits in production at some point. It’s just like when you know a chaos monkey is likely to crash a component at some point.

And of course, just as with Chaos Monkey, it’s an idea that can sound scary-as-hell when you first hear about it.

Safety is a key constraint when working with live traffic on a production system.

Why is continuous load testing important?

Frequent load testing of services is a best practice that few follow today, but that can bring great benefits.

The workload of a web service is constantly changing as its user base grows and new products are launched. Further, individual software systems might be updated several times a day or even continually… an evolving workload can quickly render models obsolete.

Not only that, but the infrastructure systems supporting the a given service constantly change too, and a data centre may be running hundreds of software systems with complex interactions.

Running frequent performance tests on live systems drove a number of good behaviours and outcomes:

  • It highlighted areas with non-linear responses to traffic increases, and where there was insufficient information to diagnose performance problems
  • It encouraged subsystem developers to identify system-specific counters for performance, error rate, and latency that could be monitored during a test.

We focused on these three metrics (performance, error rates, and latency) because they represent the contracts that clients of a service rely on – we have found that nearly every production system wishes to maintain or decrease its latency and error rate while maintaining or improving performance.

  • It changed the culture (with deliberate effort), shifting from load tests as a painful event for a system to survive, to something that developers looked forward to as an opportunity to better understand their systems.
  • It improved monitoring, and the understanding of which metrics are the most critical bellwethers of non-linear behaviours
  • Performance testing as a regular discipline dramatically improved the capacity of Facebook’s systems over time:

Kraken has allowed us to identify and remediate regressions, and address load imbalance and resource exhaustion across Facebook’s fleet. Our initial tests stopped at about 70% of theoretical capacity, but now routinely exceed 90%, providing a 20% increase in request serving capacity.

A 20% increase is a big deal when you think of all the costs involved in operating systems at Facebook scale!

Getting to that 20% increases does of course require work to address the findings resulting from the load tests. As a short aside, one of the companies I’m involved with, Skipjaq, is also a big believer in the benefits of frequent performance testing. They look at individual services and automate the exploration of the configuration space for those services (OS, JVM, server process settings etc.) to optimise performance on or across infrastructure with no code changes required. Performance tuning at this level is something of a black art, which most teams are either not equipped to deal with, or don’t have time to deal with given the constantly changing workloads and code bases as previously described. Early results with real customer applications and workloads suggest there are big performance gains to be had here for many applications too.

How do you load test safely using live production systems?

Load-testing with live production systems (and traffic, so every request matters) requires careful monitoring and live traffic adjustment.

Our insight was that Kraken running on a data center is equivalent to an operational issue affecting the site – in both cases our goal is to provide a good user experience. We use two metrics, the web servers’ 99th percentile responses time and HTTP fatal error rate (50x’s), as proxies for the user experience, and determined in most cases this was adequate to avoid bad outcomes. Over time, we have added other metrics to improve safety such as the median queueing delay on web servers, the 99th percentile CPU utilization on cache machines, etc.. Each metric has an explicit threshold demarcating the vitality of the system’s health. Kraken stops the test when any metric reaches its limit, before the system becomes unhealthy.

The essence of Kraken

Kraken shifts traffic in the Facebook routing infrastructure by adjusting the weights that control load balancing. The edge weights control how traffic is routed from a POP to a region, the cluster weights control routing to clusters within regions, and server weights balance across servers within a cluster. Kraken sits as part of a carefully designed feedback loop (red lines in the figure below) that evaluates the capacity and behaviour of the system under test to adjust the stress it is putting on systems. The traffic shifting module queries Gorilla for system health before determining the next traffic shift to the system under test. Health metric definitions themselves are stored in a distributed configuration management system.

At the start of a test, Kraken aggressively increases load and maintains the step size while the system is healthy. We have observed a trade-off between the rate of load increase and system health. For systems that employ caching, rapid shifts in load can lead to large cache miss rates and lower system health than slow increases in load. In practice, we find that initial load increase increments of around 15% strike a good balance between load test speed and system health.

As system health metrics approach their thresholds, the load increases are dramatically reduced, down to increments of 1%. This allows the capture of more precise capacity information at high load.

Here’s an example cluster load test. Every 5 minutes, Kraken inspects cluster health and decides how to shift traffic. It takes about 2 minutes for a load shift to occur and the results to be seen. The test stopped when the p99 latency (red line) exceeded its threshold level for too long.

In the example above, the system only hit a peak utilization of 75%, well below Facebook’s target of 93%. The charts below show how Kraken can start to explain performance gaps. Chart (a) shows the widening gap between cluster load test capacity as measured, and the theoretical capacity, and (b) shows the increasing time spent waiting for cache responses as cluster load increases.

Section 5 in the paper is packed full of examples of issues found by Kraken and how they were resolved.

Lessons learned

We have learned that running live traffic load tests without compromising system health is difficult. Succeeding at this approach has required us to invest heavily in instrumenting our software systems, using and building new debugging tools, and encouraging engineers to collaborate on investigating and resolving issues.

  • Simplicity is key to Kraken’s success – the stability of simple sytems is needed to debug complex issues
  • Identifying the right metrics that capture a system’s performance, error rate, and latency is difficult.

We have found it useful to identify several candidate metrics and then observe their behaviour over tens to hundreds of tests to determine whech ones provide the highest signal. However, once we identify stable metrics, their thresholds are easy to configure and almost never change once set.

  • Specialized error handling mechanisms such as automatic failover and fallbacks make systems harder to debug. “We find that such mitigations need to be well instrumented to be effective in the long run, and prefer more direct methods such as graceful degradation.”
  • There are quick fixes (allocating capacity, changing configuration or load balancing strategies) that have been essential for rapidly solving production issues. Profiling, performance tuning, and system redesign are only undertaken when the benefit justifies the cost.

I’ll leave you with this quote:

…Kraken allowed us to surface many bottlenecks that were hidden until the systems were under load. We identified problems, experimented with remedies, and iterated on our solutions over successive tests. Further, this process of continually testing and fixing allowed us to develop a library of solutions and verify health without permitting regressions.

Building machines that learn and think like people

November 25, 2016

Building machines that learn and think like people Lake et al., arXiv 2016

Pro-tip: if you’re going to try and read and write up a paper every weekday, it’s best not to pick papers that run to over 50 pages. When the paper is as interesting as “Building machines that learn and think like people” though, it’s a pleasure to take a little extra time. You may recall from Monday’s write-up of “AI and Life in 2030” that alongside steady progress, I harbour a secret hope that we have one or more big breakthroughs over the next decade.  Where might those breakthroughs come from? I’m sure we’ll be able to take on problems of increasing scale, but I suspect progress there will feel more incremental. The places where we could still make order-of-magnitude style improvements seem to be:

  • Data efficiency – training a model to a certain level of proficiency using much less training data than today’s models require.
  • Training time – closely correlated to data efficiency, achieving a certain level of proficiency with greatly reduced training time.
  • Adaptability – being able to more effectively take advantage of prior ‘knowledge’ (trained models) when learning a new task (which also implies needing less data, and shorter training times of course). (See e.g. “Progressive neural networks” ).

Plus I hope, a few wonderful surprises coming out of research teams and industrial labs.  ‘Building machines that learn and think like people’ investigates some of these questions by asking how humans seem to learn, where we still outperform state-of-the-art machine learning systems, and why that might be. It’s in a similar vein to “Towards deep symbolic reinforcement learning“, one of my favourite papers from the last couple of months.

For those of you short on time, here’s the TL;DR version from the abstract:

We review progress in cognitive science suggesting that truly human-like learning and thinking machines will have to reach beyond current engineering trends in both what they learn, and how they learn it. Specifically, we argue that these machines should (a) build causal models of the world that support explanation and understanding, rather than merely solving pattern recognition problems; (b) ground learning in intuitive theories of physics and psychology, to support and enrich the knowledge that is learned; and (c) harness compositionality and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations.

You’ll be missing out on a whole lot if you just stop there though.

Pattern recognition vs model building

Like Garnelo et al., Lake et al. see an important difference between learning systems that are fundamentally based on statistical pattern recognition, and learning systems that build some model of the world they can reason over.

The pattern recognition approach discovers features that have something in common – classification labels for example – across a large diverse set of training data. The model building approach creates models to understand and explain the world, to imagine consequences of actions, and make plans.

The difference between pattern recognition and model-building, between prediction and explanation, is central to our view of human intelligence. Just as scientists seek to explain nature, not simply predict it, we see human thought as fundamentally a model-building activity.

Two challenges that reveal current limitations

In cognitive science, the mind is not perceived as starting out as a collection of general purpose neural networks with few initial constraints. Instead, (most) cognitive scientists believe we start out with a number of early inductive biases that include core concepts such as number, space, agency, and objects, as well as learning algorithms that rely on prior knowledge to extract knowledge from small amounts of training data.  Lake et al. present two simple challenge problems that highlight some of these differences.

Character recognition

If the field of machine learning has a pet store / pet shop equivalent, then recognising the digits 0-9 from the MNIST data set might be it. Machines can now achieve human-level performance on this task, so what’s the problem?  Compared to a machine learning system, people:

  • Learn from fewer examples (we can learn to recognise a new handwritten character from a single example)
  • Learn richer representations…

People learn more than how to do pattern recognition, they learn a concept – that is, a model of the class that allows their acquired knowledge to be flexibly applied in new ways. In addition to recognising new example, people can also generate new examples, parse a character into its most important parts and relations, and generate new characters given a small set of related characters. These additional abilities come for free along with the acquisition of the underlying concept. Even for these simple visual concepts, people are still better and more sophisticated learners than the best algorithms for character recognition. People learn a lot more from a lot less, and capturing these human-level learning abilities in machines is the Characters Challenge.

Playing Frostbite

Frostbite is one of the 49 Atari games that the DeepMind team trained a DQN to play. Human-level performance was achieved on 29 of the games, but Frostbite was one the DQN had particular trouble with as it requires longer-range planning strategies. “Frostbite Bailey” must construct an igloo within a time limit while jumping on ice floes, gathering fish, and avoiding hazards.

Although it is interesting that the DQN learns to play games at human-level performance while assuming very little prior knowledge, the DQN may be learning to play Frostbite and other games in a very different way than people do.

  • It needs much more training time – the DQN was compared to professional gamers who each had 2 hours of training time; DQN had 38 days and achieved less than 10% of human-level performance during a controlled test session.
  • Humans can grasp the basics of the game after just a few minutes of play. “We speculate that people do this by inferring a general schema to describe the goals of the game and the object types and their interactions, using the kinds of intuitive theories, model-building abilities and model-based planning mechanisms we describe below.”
  • Humans can quickly adapt what they have learned to new goals. For example: get the lowest possible score; get closest to some score without ever going over; pass each level at the last possible minute, right before the time hits zero; get as many fish as you can; and so on.

This range of goals highlights an essential component of human intelligence: people can learn models and use them for arbitrary new tasks and goals.

Of course, one of the reasons that humans can learn and adapt so quickly is that we can approach new problems armed with extensive prior experience, whereas the DQN is starting completely from scratch. How can we build machine learning systems that don’t always need to start from scratch?

How do we bring to bear rich prior knowledge to learn new tasks and solve new problems so quickly? What form does that prior knowledge take, and how is it constructed, from some combination of inbuilt capacities and previous experience?

The next three sections highlight some of the core ingredients en-route to meeting this challenge.

… future generations of neural networks will look very different from the current state-of-the-art. They may be endowed with intuitive physics, theory of mind, causal reasoning, and other capacities…

Intuitive physics

What do you get if you cross Deep Learning and Wolfram Alpha++? Humans have an understanding of several core domains very early in their development cycle, including numbers, space, physics, and psychology.

At the age of 2 months, and possibly earlier, human infants expect inanimate objects to follow principles of persistence, continuity, cohesion, and solidity. Young infants believe objects should move along smooth paths, not wink in and out of existing, not inter-penetrate and not act at a distance….

At 6 months further expectations are developed around rigid bodies, soft bodies, and liquids. At 12 months concepts such as inertia, support, containment, and collisions.

What are the prospects for embedding or acquiring this kind of intuitive physics in deep learning systems?

A promising recent paper from the Facebook AI team on PhysNet may be a step in this direction – it can learn to do simple ‘Jenga-style’ calculations on the stability of block towers with two, three, or four cubical blocks. It matches human performance on real images, and exceeds human performance on synthetic ones. PhysNet does require extensive training though, whereas people require much less and can generalize better.

Could deep learning systems such as PhysNet capture this flexibility, without explicitly simulating the causal interactions between objects in three dimensions? We are not sure, but we hope this is a challenge they will take on.

Intuitive psychology

Pre-verbal infants can distinguish animate agents from inanimate objects….

… infants expect agents to act contingently and reciprocally, to have goals, and to take efficient actions towards those goals subject to constraints. These goals can be socially directed; at around three months of age, infants begin to discriminate anti-social agents that hurt or hinder others from neutral agents and they later distinguish between anti-social, neutral, and pro-social agents. It is generally agreed that infants expect agents to act in a goal-directed, efficient, and socially sensitive fashion.

While we don’t know exactly how this works, one explanation is the use of generative models of action choice (“Bayesian theory-of-mind” models). These models formalise concepts such as ‘goal’, ‘agent’, ‘planning’, ‘cost’, ‘efficiency’, and ‘belief.’ By simulating the planning process of agents, people can predict what they might do next, or use inverse reasoning from a series of actions to infer agent beliefs and utilities.

As with objects and forces, it is unclear whether a complete representation of these concepts (agents, goals, etc.) could emerge from deep neural networks trained in a purely predictive capacity…

Consider the Frostbite challenge – watching an expert play, intuitive psychology lets us infer the beliefs, desires and intentions of the player. “For instance, we can learn that birds are to be avoided from seeing how the experienced player appears to avoid them. We do not need to experience a single example of encountering a bird – and watching Frostbite Bailey die because of the bird – in order to infer that birds are probably dangerous.”

There are several ways that intuitive psychology could be incorporated into contemporary deep learning systems…. a simple inductive bias, for example the tendency to notice things that move other things, can bootstrap reasoning about more abstract concepts of agency. Similarly, a great deal of goal-directed and socially-directed can also be boiled down to simple utility-calculus in a way that could be shared with other cognitive abilities.

Learning as model building

Children (and adults) have a great capacity for ‘one-shot’ learning – a few examples of a hairbrush, pineapple, or light-sabre and a child understands the category, “grasping the boundary of the infinite set that defines each concept from the infinite set of all possible objects.”

Contrasting with the efficiency of human learning, neural networks – by virtue of their generality as highly flexible function approximations – are notoriously data hungry.

Even with just a few examples, people can learn rich conceptual models. For example, after seeing an example of a novel two-wheeled vehicle, a person can sketch new instances, parse the concept into its most important components, or even create new complex concepts though the combination of familiar concepts.

This richness and flexibility suggests that learning as model building is a better metaphor than learning as pattern recognition. Furthermore, the human capacity for one-shot learning suggests that these models are built upon rich domain knowledge rather than starting from a blank slate.

The authors of this paper developed an algorithm using Bayesian Program Learning (BPL) that represents concepts as simple stochastic programs – structured procedures that generate new example of a concept when executed.

These programs allow the model to express causal knowledge about how the raw data are formed, and the probabilistic semantics allow the model to handle noise and perform creative tasks. Structure sharing across concepts is accomplished by the compositional reuse of stochastic primitives that can combine in new ways to create new concepts.

BPL can perform a challenging one-shot classification task at human-level performance. One for a future edition of The Morning Paper perhaps.

Another interesting kind of model is a causal model. In the interests of space I won’t discuss it here, but see §4.2.2 in the paper for details.

A final area the authors discuss in this section is “learning to learn”:

While transfer learning and multi-task learning are already important themes across AI, and in deep learning in particular, they have not led to systems that learn new tasks as rapidly and flexibly as humans do… To gain the full benefit that humans get from learning-to-learn, AI systems might first need to adopt the more compositional (or more language-like) and causal forms of representations that we have argued for above.

A system for example that learned compositionally structured causal models of a game – built on a foundation of intuitive physics and psychology – could transfer knowledge more efficiently and thus learn new games much more quickly.

Thinking fast

Hierarchical Bayesian models operating over probabilistic programs are equipped to deal with theory-like structures and rich causal representations of the world, yet there are formidable challenges for efficient inference… For domains where programs or theory learning happens quickly, it is possible that people employ inductive biases not only to evaluate hypotheses, but also to guide hypotheses selection.

For example , “20 inches” cannot possibly be the answer to the question “What year was Lincoln born?” Recent work has attempted to tackle this challenge using feed-forward mappings to amortize probabilistic inference computations. See §4.3.1 for references.

Outside of the ML mainstream?

This is already about 50% longer than my target write-up length (but to be fair, the paper is >> 50% longer than the average paper length!) so I shall stop here and encourage you to dive into the full paper if this piques your interest. I’ll leave you with this closing thought: if we are going to see such breakthroughs in machine learning, it’s highly likely they’ll be developed either by those who remember earlier eras of AI, or those working a little bit outside of the mainstream. Keep your eye on the left-side of the field🙂.