We’ve seen systems that help to select the best data structure from a pre-defined set of choices (e.g. ‘Darwinian data structure selection’), systems that synthesise data structure implementations given an abstract specification (‘Generalized data structure synthesis’), systems that use learning to tune configurations given a pre-defined set of configuration options (e.g, ‘BOAT: Building auto-tuners with structured Bayesian optimization’), systems that use learned models as key inputs to algorithms (e.g. ‘SageDB: a learned database system’), and systems that use reinforcement learning to discover fit-for-workload policies (‘Towards a hands-free query optimizer through deep learning’). Today’s paper choice jumps right into the middle of this mix, showing how families of data structures can be organised into design continuums with well-understood design parameters and cost models informing selection of a given data-structure design from within the space. As well as helping us to understand and reason about data structures within the space, the constrained feature space with quick-to-evaluate cost models opens the way to practical machine learning guided data structure synthesis and selection. The notion of design continuums is explored in the context of one specific continuum: data structures for key-value stores.

…these properties allow us to envision a new class of self-designing key-value stores with a substantially improved ability to adapt to workload and hardware changes by transitioning between drastically different data structure designs to assume a diverse set of performance properties at will.

The paper jumps back and forward between the details of the design continuum for KV-stores, and the idea of a design continuum itself. I shall put more emphasis on the latter in this write-up.

### What is a design continuum?

Selecting a data structure involves making trade-offs: what’s right for one workload and set of objectives may not be right for another. When we’re building systems by hand, we tend to create either (a) general purpose systems, or (b) systems positioned towards extremes in the design space, specialised for a particular task. There’s a lot of ground between the two. The picture that emerges for me is that we’re heading towards a world of ‘personalisation’ (mass-customisation) for system software. In consumer facing software, personalisation is the idea that the system can learn the interests and preferences of its individual users, and tailor the user experience for each. Whereas we all used to get a general purpose user-experience, now personalisation is everywhere. In system software, the ‘user’ is a workload, and by learning the ‘preferences’ (characteristics) of that workload we can deliver an optimised ‘experience’ when executing it. Instead of general purpose we have fit-for-(this exact)-purpose.

Mass customisation requires (a) that we be able to enumerate points in the data structure space such that we can explore it efficiently, and (b) a way of efficiently evaluating the performance of a given data structure design. In general this is a really hard problem because the design space is vast.

Our insight is that there exist “design continuums” embedded in the design space of data structures. An intuitive way to think of design continuums is as a performance hyperplane that connects a specific set of data structure designs. Design continuums are effectively a projection of the design space, a “pocket” of designs where we can identify unifying properties among its members.

A properly constructed design continuum has a small set of design parameters forming a continuous performance tradeoff for fundamental performance metrics.

### Constructing design continuums

A design continuum has five main elements.

Firstly there are a set of environment parameters which give us critical information about the workload and the environment (e.g. hardware) it runs on. For the key-value store design continuum, the authors use the following environment parameters:

Secondly, a super-structure shows how design primitives can be combined in structured ways to create families of data structures. In the key-value store example, the super-structure is based on the founding idea of a number of layers or levels, with upper levels being hot, and lower levels progressively colder.

Third, a set of design parameters form the externalised ‘knobs’ that can be tuned to influence the design. The ideal is to have the smallest set of movable design abstractions that permit differentiating among target designs in the continuum.

Specifically, fewer design parameters (for the same target designs) lead to a cleaner abstraction which in turn makes it easier to come up with algorithms that automatically find the optimal design. We minimize the number of design parameters in two ways: 1) by adding deterministic design rules which encapsulate expert knowledge about what is a good design, and 2) by collapsing more than one interconnected design decisions to a single design parameter.

For the key-value store design continuum, we have the following design parameters:

The design parameters inform, but do not themselves directly specify, the choices to be made within the super-structure. For this we have a set of design rules, parameterized by the environment and design parameters. “These rules enable instantiating specify designs by deterministically deriving key design aspects.” For example, in the key-value continuum one of the rules is that the capacity of levels grows exponentially at each level by a factor of T (a design parameter). And a memory budget will first be used for fence pointers, with the remainder assigned to Bloom filters. Here are the design rules for the KV continuum:

Finally, a cost model provides a closed-form equation for each one of the core performance metrics. The cost model provides an understanding of how changes in the design parameters (configuration tuning) impact the various performance metrics. The cost model for the KV continuum looks like this:

Using the cost model, a theory-of-constraints style approach can be used to find an optimal design: iteratively finding the current limiting factor and the knobs that can be tweaked to alleviate it.

The overall process of putting together a design continuum looks like this, where we proceed from left to right:

An ideal design continuum has six key properties:

1. It is functionally intact; all possible designs should be able to support all operation types (i.e., are substitutable).
2. It is pareto-optimal. There should be no two designs such that one of them is better than the other on one or more of the performance metrics (and vice versa) while being equal on all the others.
3. It has a one-to-one mapping from the domain of design knobs to the co-domain of performance and memory trade-offs (bijective).
4. It enables a diverse set of performance properties.
5. The time complexity for navigating the continuum to converge on an optimal or near-optimal design should be tractable. (I.e., it is navigable).
6. It is likely to be layered in order to provide a trade-off between diversity and navigability.

#### Growing a design continuum

One way to evolve a design continuum is to start with a small coherent core and then incrementally extend it. This is a three-step process: bridging, patching, and then costing. Bridging is the initial model extension, which in order of preference can take place by (1) introducing new design rules, (2) expanding the domains of existing design parameters, or (3) adding new design parameters. The bridging process may introduce many new intermediate designs, some of which may not be functionally intact (workable). Patching adds new design rules and/or constraints on permissible design parameter settings to avoid these broken designs. Care should be take that patching only constrains new parts of the continuum opened up by bridging. The final step is to generalize the cost model to account for all the new designs.

### Self-designing systems

Knowing which design is the best for a workload opens the opportunity for systems that can adapt on-the-fly. While adaptivity has been studied in several forms including adapting storage to queries, the new opportunity is morphing among what is typically considered as fundamentally different designs, e.g. from an LSM-tree to a B+ tree, which can allow systems to gracefully adapt to a larger array of diverse workload patterns.

The authors identify three challenges on the way to such self-designing systems. Firstly, if we’re going to adapt on-the-fly, then we need way to physically transition among any two designs at runtime (and take into account the cost of those transitions). Secondly, we’ll need to generate tailored code for each design point. Given that we a continuum, it should be possible to write a single generalized algorithm for each operation that can instantiate the concrete algorithm used for that operation in each possible design. Finally, there will be limits to what we can do with rule-based tweaking of design parameters.

The path forward is to combine machine learning with the design continuum. Machine learning is increasingly used to tune exposed tuning knobs in systems. The new opportunity here is the native combination of such techniques with the system design itself… The net result in that design continuums can be blended with ML approaches to co-design a tailored system that both knows how to navigate a vast space of the design space and learns when needed to navigate design options that are hard to deterministically formulate how they will interact with the rest of the design.

### The Key-Value store design continuum

I’ve hit my target word limit for a #themorningpaper post, and we haven’t really had a chance to dive into the KV continuum in any detail at all. In brief, the authors show how B+ trees, Log-Structured-Merge trees, and Log-Structured Hash-tables can be combined into one layered design continuum, with different instances falling out from different design parameter choices.

(Enlarge)

There’s a whole bunch of good detail here that’s interesting in its own right, so if the big ideas in the paper have caught your imagination, I definitely encourage you to go and check it out.

Towards a hands-free query optimizer through deep learning Marcus & Papaemmanouil, CIDR’19

Where the SageDB paper stopped— at the exploration of learned models to assist in query optimisation— today’s paper choice picks up, looking exclusively at the potential to apply learning (in this case deep reinforcement learning) to build a better optimiser.

### Why reinforcement learning?

Query optimisers are traditionally composed of carefully tuned and complex heuristics based on years of experience. Feedback from the actual execution of query plans can be used to update cardinality estimates. Database cracking, adaptive indexing, and adaptive query processing all incorporate elements of feedback as well.

In this vision paper, we argue that recent advances in deep reinforcement learning (DRL) can be applied to query optimization, resulting in a “hands-free” optimizer that (1) can tune itself for a particular database automatically without requiring intervention from expert DBAs, and (2) tightly incorporates feedback from past query optimizations and executions in order to improve the performance of query execution plans generated in the future.

If we view query optimisation as a DRL problem, then in reinforcement learning terminology the optimiser is the agent, the current query plan is the state, and each available action represents an individual change to the query plan. The agent learns a policy which informs the actions it chooses under differing circumstances. Once the agent decides to take no further actions the episode is complete and the agent’s reward is (ideally) a measure of how well the generated plan actually performed.

There are a number of challenges, explored in this paper, with making this conceptual mapping work well in practice. Not least of which is that evaluating the reward function (executing a query plan to see how well it performs) is very expensive compared to e.g. computing the score in an Atari game.

### The ReJOIN join order enumerator

ReJOIN explores some of these ideas on a subset of the overall query optimisation problem: learning a join order enumerator.

Each query sent to ReJOIN is an episode, the state represents subtrees of a binary join tree together with information about the query join and selection predicates. Actions combine two subtrees into a single tree. Once all input relations are joined the episode ends, and ReJOIN assigns a reward based on the optimiser’s cost model. It’s policy network is updated on the basis of this score. The final join ordering is passed to the optimiser to complete the physical plan.

Using the optimiser’s cost model as a proxy for the ultimate performance of a generated query plan enables join orderings to be evaluated much more quickly.

The following chart shows how ReJOIN learns to produce good join orders during training. It takes nearly 9000 episodes (queries) to become competitive with PostgreSQL.

Once ReJOIN has caught up with PostgreSQL, it goes on to surpass it, producing orderings with lower cost.

Also of note here is that after training, ReJOIN produces its query plans faster than PostgreSQL’s built-in join enumerator in many cases. The bottom-up nature of ReJOIN’s algorithm is $O(n)$, whereas PostgreSQL’s greedy bottom-up algorithm is $O(n^2)$.

In addition to being limited to just join ordering, ReJOIN’s use of the query optimiser’s cost model to generate reward signals means that it is still dependent on a well-tuned cost model, which is a big part of the problem we wanted to solve in the first place. Ideally we’d like to extend the approach to handle full physical plan generation, and also remove the dependency on having an existing cost model.

### Challenges in extending the approach

Once we go from just join ordering to the full search space including operator and access path selection etc., the approach from ReJOIN is unable to learn effective polities in reasonable time. An initial model failed to out-perform random choice even after 72 hours of training.

If we use actual query execution time to generate the reward, then initial policies which will often generate very inefficient plans will take a long time to obtain a reward signal. Thus we learn the slowest at exactly the point when we’d like to learn the fastest and the system takes a prohibitive amount of time to converge to good results. (An experiment with ReJOIN using real query latency instead of optimiser cost confirmed this).

Finally, query latency as a reward signal doesn’t meet the expectations of many DRL algorithms that the reward signal is dense and linear. A dense reward signal is one that provides incremental feedback with every action the agent takes (such as the score updating in an Atari game), not just at the end of the episode. The linear assumption means that an algorithm may attempt to maximise the sum of many small rewards within an episode.

We have identified how the large search space, delayed reward signal, and costly performance indicators provide substantial hurdles to naive applications of DRL to query optimization.

Should we just give up on the idea then? All is not lost yet! The last section of the paper details a number of alternative approaches that could overcome some of these hurdles.

### Research directions

Three techniques that may help to make DRL-based query optimisation practical again are learning from demonstration, cost-model bootstrapping, and incremental learning. We’ll look at each of these briefly in turn next (there are no results from applying or evaluating these ideas as yet).

#### Learning from demonstration

In learning from demonstration, a model is first trained to mimic the behaviour of an existing expert, and then goes on to learn directly from its actions. In the context of query optimisation, we would first train a model to mimic the actions taken by an existing optimiser (indexes, join orderings, pruning of bad plans etc.), and then switch to optimising queries directly bypassing the ‘mentor’ optimiser. In this second phase the agent fine-tunes its own policy. The advantage of this strategy is that mimicking the existing optimiser in the early stages helps the optimiser agent to avoid the ‘obviously bad’ parts of the search space.

Since the behavior of the model in the second phase should not initially stray too far from the behavior of the expert system, we do not have to worry about executing any exceptionally poor query plans. Additionally, since the second training phase only needs to fine-tune an already-performant model, the delayed reward signal is of far less consequence.

#### Cost model bootstrapping

Cost model bootstrapping uses a very similar in spirit two-phase approach. In the first phase, instead of learning to mimic the actions of an existing expert optimiser, the judgements of the existing expert optimiser (i.e., it’s cost model) are used to bring the agent to an initial level of competence. The optimiser’s cost model is used as the reward signal during initial training, and once the agent has learned to produce good policies according to the cost model, it is then switched to learning from a reward based on actual query latency.

One complication is doing this is that the reward units (scale) need to be consistent across the costs produced by the query optimiser cost model, and the latency measurements of actual query executions. We could apply some kind of normalisation across the two, or alternatively transfer the weights from the first network to a new network (transfer learning).

#### Incremental learning

Incremental learning attempts to mitigate the issues stemming from poor query plans early in the learning cycle by beginning learning on simpler problems:

… incrementally learning query optimization by first training a model to handle simple cases and slowly introducing more complexity. This approach makes the extremely large search space more manageable by dividing it into smaller pieces.

In the context of query optimisation, we can make problems easier by reducing the number of relations, or by reducing the number of dimensions to be considered. That leads to a problem space that looks like this:

We could therefore try starting with a small number of pipeline phases and gradually introducing more, as shown in the following figure:

Or we could try starting with small examples and gradually focus on larger and larger queries.

Maybe a hybrid strategy will be best, starting with join order and small queries, and gradually increasing sophistication in both dimensions.

SageDB: a learned database system Kraska et al., CIDR’19

About this time last year, a paper entitled ‘The case for learned index structures’ (part I, part II) generated a lot of excitement and debate. Today’s paper choice builds on that foundation, putting forward a vision where learned models pervade every aspect of a database system.

The core idea behind SageDB is to build one or more models about the data and workload distribution and based on them automatically build the best data structures and algorithms for all components of the database system. This approach, which we call “database synthesis” will allow us to achieve unprecedented performance by specializing the implementation of every database component to the specific database, query workload, and execution environment.

### For the want of a model

In the absence of runtime learning and adaptation, database systems are engineered for general purpose use and do not take full advantage of the specific characteristics of the workload and data at hand. The size of the opportunity for SageDB is the gap between such an approach and what is possible when designing a specialised solution with full knowledge of the data distribution and workload.

Consider an extreme case: we want to store and query ranges of fixed-length records with continuous integer keys. Using a conventional index here makes no sense as the key itself can be used as an offset. A C program loading 100M integers into an array and summing over a range runs in about 300ms. Doing the same operation in Postgres takes about 150 seconds: a 500x overhead for the general purpose design.

…we can optimize almost any algorithm or data structure used by the database with knowledge of the exact data distribution. These optimizations can sometimes even change the complexity class of well-known data processing algorithms.

Knowledge of the data distribution comes in the form of a (learned) model. Armed with such a model, the authors argue that we can automatically synthesise index structures, sorting and join algorithms, and even entire query optimisers, leveraging the data distribution patterns for performance gains.

### Overfitting is good

What kind of a model makes sense? A histogram for example is a very simple model, but for the use cases discussed here either too coarse-grained or too big to be useful. At the other end of the spectrum, deep and wide neural nets come with high costs (though these are expected to decrease with advances in hardware). Combine this with the fact that for this use case, ‘overfitting’ is good! We want to capture the precise nuances of our exact data as precisely as possible. (The research program to date is largely focused on analytic workloads, some degree of generalisation is clearly beneficial once we start to consider updates).

As of today, we found that we often need to generate special models to see significant benefits.

As an example, consider the RMI model from ‘The case for learned index structures’ :

1. Fit a simple model (linear regression, simple neural net etc.) over the data
2. Use the prediction of the model to pick another model, an expert, which more accurately models the subset of the data
3. Repeat the process until the leaf model is making a final prediction

RMI is just a starting point. For example, it is possible to make the top model or bottom model more complex, replace parts of the models at a particular level stage with other types of models, use quantization, vary the feature representation, combine models with other data structures, and so on. We therefore believe we will see an explosion of new ideas on how to most efficiently generate models for database components to achieve the right balance between precision, low latency, space, and execution time for a given workload.

### Data access

Last year’s paper on ‘The case for learned index structures’ showed that an RMI-based index can outperform state of the art B-Tree implementations by a factor of two while being orders of magnitude smaller (“note that the updated arXiv version contains new results“). Subsequent work has extended this to data stored on disk, compression inserts, and multi-dimensional data.

For multi-dimensional data, the baseline is an R-Tree (as opposed to a B-Tree). R-Trees map rectangles to a list of index ranges such that the index of every point lying in the rectangle is contained in the union of these ranges. We can replace an R-Tree with a learned model, just as we could the B-Tree. One of the tricks that makes the RMI B-Tree replacement work is that it is sufficient for the model to get us ‘in the right locality’ and then we can do a local search around the prediction to finish the job. For R-Trees, we also need a layout that enables efficient localised search.

While many possible projection strategies exist, we found that successively sorting and partitioning points along a sequence of dimensions into equally-sized cells produces a layout that is efficient to compute, learnable (e.g., in contrast to z-order, which is very hard to learn), and tight (i.e., almost all points in the union of the index ranges satisfy the query).

The authors implemented such a learned index over an in-memory column store with compression, and compared it to a full column scan, a clustered index (sorting by the column providing the best overall performance), and an R-Tree. The benchmarks used 60 million records from the lineitem table of the TPC-H benchmark, with query selectivity of 0.25%.

The learned index beats the next best performing implementation by 34x (note the log scales on the charts) and has only a tiny space overhead compared to the clustered solution.

Further analysis revealed that the learned index beats the clustered index on almost every type of query – the exception is when the clustered dimension in the clustered index is the only dimension in the query.

### Query execution

This is one of my favourite parts of the paper, because it demonstrates how learned models can even help in the humble and age-old case of sorting. The approach to sorting is to use a learned model to put the records into roughly the right order, and then correct the nearly perfected sorted data as a final step. For this an efficient local-sort such as insertion sort can be used, which is very fast with almost-sorted arrays.

The figure below shows results of a learned approach to sorting for increasingly large data sizes of 64-bit doubles randomly sampled from a normal distribution. In the comparison, Timsort is the default sort for Java and Python, std::sort is from the C++ library. The learned variant is 18% faster than the next best (Radix sort in this case) on average.

(This doesn’t include the time taken to learn the model).

Learned models can also be used to improve joins. For example, consider a merge-join with two stored join columns and a model-per-column. We can use the model to skip data that will not join (the authors don’t detail how the equivalent of ‘local patching’ is supposed to work in this scenario, it’s not immediately obvious to me).

The authors also experimented with workload aware schedulers, implementing a reinforcement-learning based scheduling system using a graph neural network:

Our system represents a scheduling algorithm as a neural network that takes as input information about the data (e.g., using a CDF model) and the query workload (e.g, using a model trained on previous executions of queries) to make scheduling decisions.

On a sample of 10 TPC-H queries, the learned scheduler improved average job completion time by 45% over Spark’s default FIFO scheduler.

The strategy that the scheduler learned to get this improvement was to combine completing short jobs quickly with maximising cluster efficiency, learning to run jobs near their parallelism ‘sweet spot.’

### Query optimiser

Traditional query optimizers are extremely hard to build, maintain, and often yield sub-optimal query plans. The brittleness and complexity of the optimizer makes it a good candidate to be learned…

Initial experiments starting with a traditional cost model and refining it over time through learning showed that the model quality can be improved, but that to make big gains would require making significant improvements to cardinality estimation. The research direction now (no reported results as yet) is to explore hybrid-model based approaches to cardinality estimation. These hybrid models combine a learned model of the underlying data patterns and correlations, with exception/outlier listens that capture extreme (and hard to learn) anomalies of the particular instance of the data.

### Other areas

Other suggested areas where learned models may prove beneficial in the future include approximate query processing, predictive modelling, and workloads including inserts and updates.

### The last word

SageDB presents a radical new approach to build database systems, by using ML models combined with program synthesis to generate system components. If successful, we believe this approach will result in a new generation of big data processing tools, which can better take advantage of GPUs and TPUs, provide significant benefits in regard to storage consumption and space, and, in some cases, even change the complexity class of certain data operations.

Serverless computing: one step forward, two steps back Hellerstein et al., CIDR’19

The biennial Conference on Innovative Data Systems Research has come round again. Today’s paper choice is sure to generate some healthy debate, and it’s a good set of questions to spend some time thinking over as we head into 2019: Where do you think serverless is heading? What is it good for today? What’s the end-goal here?

The authors see ‘critical gaps’ in current first-generation serverless offerings from the major cloud vendors (AWS, Azure, GCP). I’m sure some will read the paper as serverless-bashing, but I read into it more of an appeal from the heart to not stop where we are today, but to continue to pursue infrastructure and programming models truly designed for cloud platforms. Platforms that offer ‘unlimited’ data storage, ‘unlimited’ distributed processing power, and the ability to harness these only as needed.

We hope this paper shifts the discussion from ‘What it serverless?’ Or ‘Will serverless win?’ to a rethinking of how we design infrastructure and programming models to spark real innovation in data-rich, cloud-scale systems. We see the future of cloud programming as far, far brighter than the promise of today’s serverless FaaS offerings. Getting to that future requires revisiting the designs and limitations of what is being called ‘serverless computing’ today.

### What is serverless and what is it good for today?

Serverless as a term has been broadly co-opted of late (‘the notion is vague enough to allow optimists to project any number of possible broad interpretations on what it might mean’). In this paper, serverless means FaaS (Functions-as-a-Service), in the spirit of the offerings that the major cloud vendors are currently promoting.

• Developers can register functions to run in the cloud, and declare when those functions should be triggered
• A FaaS infrastructure monitors triggering events, allocates a runtime for the function, executes it, and persists the results
• The platform is not simply elastic, in the sense that humans or scripts can add and remove resources as needed, it is autoscaling: the workload automatically drives the allocation and deallocation of resources
• The user is only billed for the computing resources used during function invocation

Many people have reported great results from adopting serverless. See for example ‘Serverless computing: economic and architectural impact’ that we looked at previously on The Morning Paper. The authors analysed use cases documented by Amazon and broke them down into three major buckets:

• Embarrassingly parallel tasks, often invoked on-demand and intermittently. For example, re-sizing images, performing object recognition, and running integer-programming-based optimisations.
• Orchestration functions, used to coordinate calls to proprietary auto-scaling services, where the back-end services themselves do the real heavy lifting.
• Applications that compose chains of functions – for example workflows connected via data dependencies. These use cases showed high end-to-end latencies though.

### Limitations: putting the ‘less’ in serverless

Moving beyond the use cases outlined above we run into a number of limitations of the current embodiments of serverless:

• Functions have limited lifetimes (15 minutes)
• Functions are critically reliant on network bandwidth to communicate, but current platforms from AWS, Google, and Azure provide limited bandwidth (an order of magnitude lower than a single modern SSD) that is shared between all functions packed on the same VM.
• Since functions are not directly network accessible, they must communicate via an intermediary service – which typically results in writing state out to slow storage and reading it back in again on subsequent calls.
• There is no mechanism to access specialized hardware through functions, yet hardware specialisation will only accelerate in the coming years.

One consequence of these limitations is that FaaS is a data-shipping architecture: we move data to the code, not the other way round.

This is a recurring architectural anti-pattern amongst system designers, which database aficionados seem to need to point out each generation. Memory hierarchy realities— across various storage layers and network delays — make this a bad design decision for reasons of latency, bandwidth, and cost.

A second consequence is that the high cost of function interactions ‘stymies basic distributed computing.’

… with all communication transiting through storage, there is no real way for thousands (much less millions) of cores in the cloud to work together efficiently using current FaaS platforms, other than via largely uncoordinated (embarrassing) parallelism.

(Enlarge)

Three small case studies in the paper illustrate the pain of moving outside of the serverless sweetspot:

• Training machine learning models. A single Lambda function was able to perform 294 iterations before hitting the time limit, so 31 sequential executions were required to complete training. The total cost was $0.29, and the total training latency was 465 minutes. Using an m4.large instance with 8GB RAM instead, the training process takes around 21 minutes and costs$0.04.
• Low-latency prediction serving. The best result using Lambda functions achieved an average latency of 447ms per batch. An alternative using two m5.large instances connected via ZeroMQ showed an average latency of 13ms per batch – 27x faster. For one million messages per second, the Lamdba + SQS solution would costs $1,584 per hour. Using EC2 instances the same throughput costs around$27.84 per hour.
• Trying to implement a simple leader election protocol through functions using a blackboard architecture with DynamoDB, each round of leader election took 16.7 seconds. (And using DynamoDB as a communication mechanism in this way would cost \$450 per hour to coordinate a cluster of 1000 nodes).

These results fall into the “we tried to fix some screws with a hammer and it didn’t work out too well” bucket. But highlighting where current serverless designs don’t work well can help us think about ways to improve them.

Many of the constraints of current FaaS offerings can also be overcome, we believe, maintaining autoscaling while unlocking the performance potential of data and distributed computing.

### Moving forward

A ‘truly programmable environment for the cloud’ should have the following characteristics:

• Fluid code and data placement: logical separation of code and data with the ability of the infrastructure to physically colocate when it makes sense, including function shipping.
• Heterogenous hardware support. “Ideally, application developers could specify applications using high-level DSLs, and the cloud providers would compile those applications to the most cost-effective hardware that meets user specified SLOs
• Long-running, addressable virtual agents: “If the platform pays a cost to create an affinity (e.g. moving data), it should recoup that cost across multiple requests. This motivates the ability for programmers to establish software agents— call them functions, actors, services, etc.— that persist over time in the cloud, with known identities.
• Disorderly programming: “the sequential metaphor of procedural programming will not scale to the cloud. Developers need languages that encourage code that works correctly in small, granular units— of both data and computation— that can be easily moved around across time and space.” For example, Functional Reactive Programming.
• A common intermediate representation (IR), that can serve as a compilation target from many languages (because a wide variety of programming languages and DSLs will be explored). Web assembly anyone?
• An API for specifying service-level objectives. “Achieving predictable SLOs requires a smooth ‘cost surface’ in optimization— non-linearities are the bane of SLOs. This reinforces our discussion above regarding small granules of code and data with fluid placement and disorderly execution.
• A ‘cloud-programming aware’ security model.

Taken together, these challenges seem both interesting and surmountable… We are optimistic that research can open the cloud’s full potential to programmers. Whether we call the new results ‘serverless computing’ or something else, the future is fluid.

I’m sure you’ve noticed the trend for many things in IT to become finer-grained over time (releases, team structure, service sizes, deployment units, …). I really need to write a longer blog post about this, but my thesis is that in many cases the ideal we’re seeking is a continuous one (e.g., in the case of serverless, continuous and seamless autoscaling of cloud resources and the associated billing). The more fine-grained we make things, the better our approximation to that ideal (it’s calculus all over again. We’re held back from making things finer-grained beyond a certain point by the transaction costs (borrowing from lean) associated with each unit. E.g. “we can’t possible release once a day, it takes us a full week to go through the release process!” Once we find a way to lower the transaction costs, we can improve our approximation and become finer-grained. If you haven’t seen it already for example, check out the work that CloudFlare have been doing with workers: V8 isolates (v low transaction costs), a web assembly runtime (the IR called for above), and JavaScript functions. Skate to where the puck is heading!

Unsupervised learning of artistic styles with archetypal style analysis Wynen et al., NeurIPS’18

I’ve always enjoyed following work on artistic style transfer. The visual nature makes it easy to gain an appreciation for what is going on and the results are very impressive. It also something that’s been unfolding within the timespan of The Morning Paper, if we peg the beginning to the work of Gatys et al. in 2015. See for example the posts on ‘Texture Networks’ and ‘Deep photo style transfer.’

Beyond direct style transfer, the objective of the work described in today’s paper choice is to uncover representations of styles (archetypes) themselves. Given a large collection of paintings,…

… Our objective is to automatically discover, summarize, and manipulate artistic styles present in the collection.

This is achieved using an unsupervised learning technique called archetypal analysis. We can recover archetypes from a collection of paintings, and we can also go the other way; taking a painting and decomposing it into a combination of archetypes. And of course if we then manipulate the composition of archetypes, we can manipulate the style of an image.

To visualise what an archetype ‘looks like’ the authors synthesise a texture from an image filled with random noise, using the style representation of the archetype. The following image shows some examples, with the synthesised archetypal textures in the left-most column and the three images next to them on each row showing the individual images that made the strongest contribution to the archetype.

The strongest contributions usually exhibit a common characteristic like stroke style or choice of colors. Smaller contributions are often more difficult to interpret. Figure 2a also highlights correlation between content and style: the archetype on the third row is only composed of portraits.

### From a collection of paintings to archetypes

Given a pre-trained VGG-19 network, a concise representation of the style of an individual painting can be obtained in the following manner:

• Take the feature maps from five layers of the trained VGG-19 (the network is trained as an auto-encoder to encode the salient features of the input image)
• Compute the first and second-order statistics of each feature map. Given a layer $l$ with feature map $\mathbf{F}_l$, $p_l$ channels, and $m_l$ pixel positions we have:

• Normalise the statistics by the number of parameters at each layer (divide the statistics for layer l by $p_{l}(p_{l} + 1)$). This was found empirically to be useful for preventing over-representation of layers with more parameters.
• From a style descriptor by concatenating the normalised statistics from all five layers: $\{\mathbf{\mu}_1, \mathbf{\Sigma}_1, ..., \mathbf{\mu}_l, \mathbf{\Sigma}_l\}$.

After creating style descriptors in this way for all the paintings in a collection, singular value decomposition is applied to reduce the dimensions to 4096 while keeping more than 99% of the variance.

Given the resulting set of vectors, archetypal analysis is then used to learn a set of archetypes $\mathbf{Z}$ such that each sample can be well approximated by a convex combination of archetypes. In addition, each archetype is a combination of samples. This is an optimisation problem that can be given to a dedicated solver.

… we use archetypal analysis on the 4096-dimensional style vectors previously described, and typically learn between k=32 to k=256 archetypes. Each painting’s style can then be represented by a sparse low-dimensional code [combination of styles], and each archetype itself associated to a few input paintings.

The evaluation uses two datasets to extract archetypes: the GanGogh collection of 95,997 artworks from WikiArt, and a collection of 1154 paintings and drawings by Vincent van Gogh. 256 archetypes are extracted for GanGogh, and 32 for Vincent van Gogh.

The following figure shows examples of image styles broken down into their contributing archetypes.

### Style manipulation using archetypes

Given a method of decoding a style vector to an image, we can manipulate style vectors in the archetype plane and then transform the results into images. Given a “content” (input) feature map from an original image, whitening and colouring operations can be used to match the mean and covariance structure of a given style feature map. This technique was first described in ‘Universal style transfer via feature transforms.’

Several methods for identifying a target style are of interest:

• Using a single archetype, to produce an image in the style of that archetype
• Using a combination of archetypes, for archetypal style interpolation
• Adjusting the weighting of archetypes already present in the image (e.g., significantly strengthening one component)

For example, the following figure shows the results of amplifying archetypal styles within a source image. The middle panel on each row is the original image. Moving left, the strongest component is emphasised, and moving right the second strongest component.

Ignoring any archetypal styles already present, the following figure shows the results of free exploration of the archetypal space for a given source image:

### A visual summary

Neural ordinary differential equations Chen et al., NeurIPS’18

‘Neural Ordinary Differential Equations’ won a best paper award at NeurIPS last month. It’s not an easy piece (at least not for me!), but in the spirit of ‘deliberate practice’ that doesn’t mean there isn’t something to be gained from trying to understand as much as possible.

### Neural networks as differential equations

Consider a multi-layered neural network. We have an input layer and an output layer, and inbetween them, some number of hidden layers. As an input feeds forward through the network, it is progressively transformed, one layer at a time, from the input to the ultimate output. Each network layer is a step on that journey. If we take a small number of big steps, we end up with a rough approximation to the true transformation function we’d like to learn. If we take a much larger number of steps (deeper networks), with each step being individually smaller, we have a more accurate approximation to the true function. What happens in the limit as we take an infinite number of infinitely small steps? Calculus!

So one way of thinking about those hidden layers is as steps in Euler’s method for solving differential equations. Consider the following illustration from wikipedia:

We want to recover the blue curve, but all we have is an initial point $A_0$ (think inputs to the network) and a differential equation. From the differential equation, we can calculate the tangent line. If we take a small step along the tangent line, we arrive at $A_1$, which will be close to the desired blue line if the step is small enough. Repeat this process to uncover a polygonal curve $A_{0}A_{1}A_{2}...A_{n}$.

Many neural networks have a composition that looks exactly like the steps of Euler’s method. We start with an initial state $\mathbf{z}_0$, and apply successive transformations over time (layers):

• $\mathbf{z}_1 = \mathbf{z}_0 + f(\mathbf{z}_0, \theta_{0})$
• $\mathbf{z}_2 = \mathbf{z}_1 + f(\mathbf{z}_1, \theta_{1})$
• $\mathbf{z}_3 = \mathbf{z}_2 + f(\mathbf{z}_2, \theta_{2})$
• $\mathbf{z}_{t+1} = \mathbf{z}_t + f(\mathbf{z}_t, \theta_{t})$

In the limit, we parameterize the continuous dynamics of hidden units using an ordinary differential equation (ODE) specified by a neural network:

$\displaystyle \frac{d\mathbf{z}(t)}{dt} = f(\mathbf{z}(t), t, \theta)$

The equivalent of having T layers in the network, is finding the solution to this ODE at time T.

Something really neat happens once we formulate the problem in this way though. Just like we’ve seen a number of papers that express a problem in a form suitable for solving by a SAT solver, and then throw a state of the art SAT-solver at it, we can now use any ODE solver of our choice.

Euler’s method is perhaps the simplest method for solving ODEs. There since been more than 120 years of development of efficient and accurate ODE solvers. Modern ODE solvers provide guarantees about the growth of approximation error, monitor the level of error, and adapt their evaluation strategy on the fly to achieve the requested level of accuracy. This allows the cost of evaluating a model to scale with problem complexity.

### How to train a continuous-depth network

We’ve seen how to feed-forward, but how do you efficiently train a network defined as a differential equation? The answer lies in the adjoint method (which dates back to 1962). Think of the adjoint as the instantaneous analog of the chain rule.

This approach computes gradients by solving a second, augmented ODE backwards in time, and is applicable to all ODE solvers. This approach scales linearly with problem size, has low memory cost, and explicitly controls numerical error.

The adjoint captures how the loss function L changes with respect to the hidden state ( $- \partial L / \partial \mathbf{z}(t)$). Starting from the output of the network, we can recompute $\mathbf{z}(t)$ backwards in time together with the adjoint.

A third integral then tells us how the loss changes with the parameters $\theta$ ( $dL/d\theta$).

All three of these integrals can be computed in a single call to an ODE solver, which concatenates the original state, the adjoint, and the other partial derivatives into a single vector. Algorithm 1 shows how to construct the necessary dynamics, and call an ODE solver to compute all gradients at once.

(Don’t ask me to explain that further!)

### Applied Neural ODEs

#### Residual Networks

Section three tackles the good old MNIST problem, comparing an ODE-net to a ResNet with 6 residual blocks. The ODE-net replaces the residual blocks with an ODE-Solve module.

Concentrating on the 2nd and 4th lines in the table below, ODE-Nets are able to achieve roughly the same performance as a ResNet, but using only about 1/3 of the parameters. Also note that the ODE-Net solution using constant memory, whereas ResNets use memory proportional to the number of layers.

It’s not clear how to define the ‘depth’ of an ODE solution. A related quantity is the number of evaluations of the hidden state dynamics required, a detail delegated to the ODE solver and dependent on the initial state or input. The figure below shows the number of function evaluations increases throughout training, presumably adapting to increasing complexity of the model.

In other words, the ODE-net is kind of doing the equivalent of deepening its network over time as it needs to add sophistication.

#### Normalising flows

Normalizing flows allow more complex probability distribution functions (pdfs) to be learned. The same trick of shifting from a discrete set of layers to a continuous transformation works in this situation too. The following figure shows normalising flows vs continuous normalising flows (CNF) when trying to learn a pdf. The CNF is trained for 10,000 iterations and generally achieves lower loss than the NF trained for 500,000 iterations.

#### Time-series

This is the application which most caught my attention.

Applying neural networks to irregularly-sampled data such as medical records, network traffic, or neural spiking data is difficult. Typically, observations are put into bins of fixed duration, and the latent dynamics are discretized in the same way. This leads to difficulties with missing data and ill-defined latent variables… We present a continuous-time, generative approach to modeling time series. Our model represents each time series by a latent trajectory. Each trajectory is determined from a local initial state $\mathbf{z}_{t_0}$, and a global set of latent dynamics shared across all time series.

The model can be trained as a variational autoencoder, and it looks like this:

The evaluation here is based on a dataset of 1000 2-dimensional spirals, each starting at a different point. Half of the spirals are clockwise, and half counter-clockwise. Points are sampled from these trajectories at irregular timestamps. The figure below shows a latent neural ODE is better able to recover the spirals than a traditional RNN:

A PyTorch implementation of ODE solvers is available.

The tradeoffs of large scale learning Bottou & Bousquet, NIPS’07

Welcome to another year of The Morning Paper. As usual we’ll be looking at a broad cross-section of computer science research (I have over 40 conferences/workshops on my list to keep an eye on as a start!). I’ve no idea yet what papers we’ll stumble across, but if previous years are anything to go by I’m sure there’ll be plenty of great material to keep interest levels high.

To start us off, today’s paper choice is “The tradeoffs of large scale learning,” which won the ‘test of time’ award at NeurIPS last month.

this seminal work investigated the interplay between data and computation in ML, showing that if one is limited by computing power but can make use of a large dataset, it is more efficient to perform a small amount of computation on many individual training examples rather than to perform extensive computation on a subset of the data. [Google AI blog: The NeurIPS 2018 Test of Time Award].

For a given time/computation budget, are we better off performing a computationally cheaper (e.g., approximate) computation over lots of data, or a more accurate computation over less data? In order to get a handle on this question, we need some model to help us think about how good a given solution is.

### Evaluating learning systems

Consider an input domain $\mathcal{X}$ and an output domain $\mathcal{Y}$. We have input-output pairs $(x,y) \in \mathcal{X} \times \mathcal{Y}$, and some probability distribution $P(x,y)$. We want to learn the conditional distribution $P(y|x)$— in other words, to make a prediction of the output, $\hat{y}$, given the input $x$. A loss function $\mathcal{L}(\hat{y},y)$ measures the distance between our prediction and the true value. We will approximate the probability distribution with a function $f$ such that $\hat{y} = f(x)$.

The ‘gold standard’ function $f^{*}$ is the one that minimises the expected risk (loss), $E(f)$ over all output classes. We don’t know the true distribution $P(y|x)$ though. Instead, we are given a sample of n independently drawn training examples. The best we can do is find $f^{n}$, the function that minimises the empirical risk $E_{n}(f)$ across these n data points.

When we build a given learning model, we constrain the set of prediction functions that can be learned to come from some family $\mathcal{F}$. Let $f^{*}_{\mathcal{F}}$ be the best possible function within that family in terms of minimising expected risk.

The optimal function $f^{*}$ may well not belong to the family $\mathcal{F}$. So by constraining learned functions to come from $\mathcal{F}$ we introduce an approximation error $\mathcal{E}_{app}$ that measures the difference in expected risk between $f^{*}_{\mathcal{F}}$ and $f^{*}$.

We also have an estimation error $\mathcal{E}_{est}$ that measures the effect of minimising the empirical risk based on training examples rather than the expected risk across the whole distribution (the expected value of $E(f_n) - E(f^*_{\mathcal{F}})$ ).

The estimation error is determined by the number of training examples and by the capacity of the family of functions. Large families of functions have smaller approximation errors but lead to higher estimation errors.

Now, given that we’re working from training samples, it should be clear that the the empirical risk $E_{n}(f)$ is already an approximation of the expected risk $E(f)$. Finding the $f_{n}$ that minimises the empirical risk is often computationally expensive. Since we’re already approximating, maybe a cheaper to compute approximation to $f_n$ would be good enough? If we allow the minimisation algorithm to return an approximate solution $\tilde{f_n}$ that is within $\rho$ of the true minimum, then we introduce a third error term $\mathcal{E}_{opt}$, measuring the gap between $\tilde{f_n}$ and $f_n$.

So now we have a trade-off involving three variables and two constraints:

The constraints are the maximal number of available training examples and the maximal computation time. The variables are the size of the family of functions $\mathcal{F}$, the optimization accuracy $\rho$, and the number of examples $n$.

Typically the error components change with respect to the variables as shown in the following table:

When we are constrained by the number of available training examples (and not by computation time) we have a small-scale learning problem. When the constraint is the available computation time, we have a large-scale learning problem. Here approximate optimisation can possibly achieve better generalisation because more training examples can be processed during the allowed time.

### Exploring the trade-offs in large-scale learning

Armed with this model of learning systems, we can now explore trade-offs in the case of large-scale learning. I.e., if we are compute bound, should we use a coarser-grained approximation over more data, or vice-versa? Let’s assume that we’re considering a fixed family of functions $\mathcal{F}$ and hence ignore the $\mathcal{E}_{app}$ error component. Thus the paper focuses on the choice of optimisation algorithm.

Four gradient descent algorithms are compared:

• Gradient descent, which has linear convergence and requires $\mathcal{O}(\kappa \log(1/\rho))$ iterations to reach accuracy $\rho$. Each iteration considers all training examples.
• Second order gradient descent (2GD), which requires $\mathcal{O}(\log \log(1/\rho))$ iterations to reach accuracy $\rho$. Each iteration considers all training examples.
• Stochastic gradient descent, in which a random training example is chosen at each iteration, and the parameters are updated on the basis of that example only. It reaches accuracy $\rho$ after less than $v \kappa^{2}/\rho + o(1/\rho)$ iterations on average. Neither the initial value of the parameter vector nor the total number of examples appear in the dominant term of the bound.
• Second order stochastic gradient descent, which doesn’t change the influence of $\rho$ on the convergence rate, but does improve the constants.

The results are summarised in the following table. The third column, time to reach a given accuracy, is simply the cost per iteration (column one) times the number of iterations needed (column two). The fourth column bounds the time needed to reduce the excess error below some chosen threshold.

Setting the fourth column expression to $T_{max}$ and solving for $\epsilon$ yields the best excess error achieved by each algorithm within the limited time $T_{max}$. This provides the asymptotic solution of the Estimation-Optimization tradeoff for large-scale problems satisfying our assumptions…

• SGD and 2SGD results do not depend on the estimation rate exponent, $\alpha$. When the estimate rate is poor, there is less need to optimise accurately and so we can process more examples.
• Second order algorithms bring little asymptotic improvements, but do bring improvements in the constant terms which can matter in practice.
• Stochastic algorithms yield the best generalization performance despite being the worst optimization algorithms.

The perspective proposed by Léon and Olivier in their collaboration 10 years ago provided a significant boost to the development of the algorithm that is nowadays the workhorse of ML systems that benefit our lives daily, and we offer our sincere congratulations to both authors on this well-deserved award. [Google AI blog: The NeurIPS 2018 Test of Time Award].