SageDB: a learned database system Kraska et al., CIDR’19

About this time last year, a paper entitled ‘The case for learned index structures’ (part I, part II) generated a lot of excitement and debate. Today’s paper choice builds on that foundation, putting forward a vision where learned models pervade every aspect of a database system.

The core idea behind SageDB is to build one or more models about the data and workload distribution and based on them automatically build the best data structures and algorithms for all components of the database system. This approach, which we call “database synthesis” will allow us to achieve unprecedented performance by specializing the implementation of every database component to the specific database, query workload, and execution environment.

### For the want of a model

In the absence of runtime learning and adaptation, database systems are engineered for general purpose use and do not take full advantage of the specific characteristics of the workload and data at hand. The size of the opportunity for SageDB is the gap between such an approach and what is possible when designing a specialised solution with full knowledge of the data distribution and workload.

Consider an extreme case: we want to store and query ranges of fixed-length records with continuous integer keys. Using a conventional index here makes no sense as the key itself can be used as an offset. A C program loading 100M integers into an array and summing over a range runs in about 300ms. Doing the same operation in Postgres takes about 150 seconds: a 500x overhead for the general purpose design.

…we can optimize almost any algorithm or data structure used by the database with knowledge of the exact data distribution. These optimizations can sometimes even change the complexity class of well-known data processing algorithms.

Knowledge of the data distribution comes in the form of a (learned) model. Armed with such a model, the authors argue that we can automatically synthesise index structures, sorting and join algorithms, and even entire query optimisers, leveraging the data distribution patterns for performance gains.

### Overfitting is good

What kind of a model makes sense? A histogram for example is a very simple model, but for the use cases discussed here either too coarse-grained or too big to be useful. At the other end of the spectrum, deep and wide neural nets come with high costs (though these are expected to decrease with advances in hardware). Combine this with the fact that for this use case, ‘overfitting’ is good! We want to capture the precise nuances of our exact data as precisely as possible. (The research program to date is largely focused on analytic workloads, some degree of generalisation is clearly beneficial once we start to consider updates).

As of today, we found that we often need to generate special models to see significant benefits.

As an example, consider the RMI model from ‘The case for learned index structures’ :

1. Fit a simple model (linear regression, simple neural net etc.) over the data
2. Use the prediction of the model to pick another model, an expert, which more accurately models the subset of the data
3. Repeat the process until the leaf model is making a final prediction

RMI is just a starting point. For example, it is possible to make the top model or bottom model more complex, replace parts of the models at a particular level stage with other types of models, use quantization, vary the feature representation, combine models with other data structures, and so on. We therefore believe we will see an explosion of new ideas on how to most efficiently generate models for database components to achieve the right balance between precision, low latency, space, and execution time for a given workload.

### Data access

Last year’s paper on ‘The case for learned index structures’ showed that an RMI-based index can outperform state of the art B-Tree implementations by a factor of two while being orders of magnitude smaller (“note that the updated arXiv version contains new results“). Subsequent work has extended this to data stored on disk, compression inserts, and multi-dimensional data.

For multi-dimensional data, the baseline is an R-Tree (as opposed to a B-Tree). R-Trees map rectangles to a list of index ranges such that the index of every point lying in the rectangle is contained in the union of these ranges. We can replace an R-Tree with a learned model, just as we could the B-Tree. One of the tricks that makes the RMI B-Tree replacement work is that it is sufficient for the model to get us ‘in the right locality’ and then we can do a local search around the prediction to finish the job. For R-Trees, we also need a layout that enables efficient localised search.

While many possible projection strategies exist, we found that successively sorting and partitioning points along a sequence of dimensions into equally-sized cells produces a layout that is efficient to compute, learnable (e.g., in contrast to z-order, which is very hard to learn), and tight (i.e., almost all points in the union of the index ranges satisfy the query).

The authors implemented such a learned index over an in-memory column store with compression, and compared it to a full column scan, a clustered index (sorting by the column providing the best overall performance), and an R-Tree. The benchmarks used 60 million records from the lineitem table of the TPC-H benchmark, with query selectivity of 0.25%.

The learned index beats the next best performing implementation by 34x (note the log scales on the charts) and has only a tiny space overhead compared to the clustered solution.

Further analysis revealed that the learned index beats the clustered index on almost every type of query – the exception is when the clustered dimension in the clustered index is the only dimension in the query.

### Query execution

This is one of my favourite parts of the paper, because it demonstrates how learned models can even help in the humble and age-old case of sorting. The approach to sorting is to use a learned model to put the records into roughly the right order, and then correct the nearly perfected sorted data as a final step. For this an efficient local-sort such as insertion sort can be used, which is very fast with almost-sorted arrays.

The figure below shows results of a learned approach to sorting for increasingly large data sizes of 64-bit doubles randomly sampled from a normal distribution. In the comparison, Timsort is the default sort for Java and Python, std::sort is from the C++ library. The learned variant is 18% faster than the next best (Radix sort in this case) on average.

(This doesn’t include the time taken to learn the model).

Learned models can also be used to improve joins. For example, consider a merge-join with two stored join columns and a model-per-column. We can use the model to skip data that will not join (the authors don’t detail how the equivalent of ‘local patching’ is supposed to work in this scenario, it’s not immediately obvious to me).

The authors also experimented with workload aware schedulers, implementing a reinforcement-learning based scheduling system using a graph neural network:

Our system represents a scheduling algorithm as a neural network that takes as input information about the data (e.g., using a CDF model) and the query workload (e.g, using a model trained on previous executions of queries) to make scheduling decisions.

On a sample of 10 TPC-H queries, the learned scheduler improved average job completion time by 45% over Spark’s default FIFO scheduler.

The strategy that the scheduler learned to get this improvement was to combine completing short jobs quickly with maximising cluster efficiency, learning to run jobs near their parallelism ‘sweet spot.’

### Query optimiser

Traditional query optimizers are extremely hard to build, maintain, and often yield sub-optimal query plans. The brittleness and complexity of the optimizer makes it a good candidate to be learned…

Initial experiments starting with a traditional cost model and refining it over time through learning showed that the model quality can be improved, but that to make big gains would require making significant improvements to cardinality estimation. The research direction now (no reported results as yet) is to explore hybrid-model based approaches to cardinality estimation. These hybrid models combine a learned model of the underlying data patterns and correlations, with exception/outlier listens that capture extreme (and hard to learn) anomalies of the particular instance of the data.

### Other areas

Other suggested areas where learned models may prove beneficial in the future include approximate query processing, predictive modelling, and workloads including inserts and updates.

### The last word

SageDB presents a radical new approach to build database systems, by using ML models combined with program synthesis to generate system components. If successful, we believe this approach will result in a new generation of big data processing tools, which can better take advantage of GPUs and TPUs, provide significant benefits in regard to storage consumption and space, and, in some cases, even change the complexity class of certain data operations.

Serverless computing: one step forward, two steps back Hellerstein et al., CIDR’19

The biennial Conference on Innovative Data Systems Research has come round again. Today’s paper choice is sure to generate some healthy debate, and it’s a good set of questions to spend some time thinking over as we head into 2019: Where do you think serverless is heading? What is it good for today? What’s the end-goal here?

The authors see ‘critical gaps’ in current first-generation serverless offerings from the major cloud vendors (AWS, Azure, GCP). I’m sure some will read the paper as serverless-bashing, but I read into it more of an appeal from the heart to not stop where we are today, but to continue to pursue infrastructure and programming models truly designed for cloud platforms. Platforms that offer ‘unlimited’ data storage, ‘unlimited’ distributed processing power, and the ability to harness these only as needed.

We hope this paper shifts the discussion from ‘What it serverless?’ Or ‘Will serverless win?’ to a rethinking of how we design infrastructure and programming models to spark real innovation in data-rich, cloud-scale systems. We see the future of cloud programming as far, far brighter than the promise of today’s serverless FaaS offerings. Getting to that future requires revisiting the designs and limitations of what is being called ‘serverless computing’ today.

### What is serverless and what is it good for today?

Serverless as a term has been broadly co-opted of late (‘the notion is vague enough to allow optimists to project any number of possible broad interpretations on what it might mean’). In this paper, serverless means FaaS (Functions-as-a-Service), in the spirit of the offerings that the major cloud vendors are currently promoting.

• Developers can register functions to run in the cloud, and declare when those functions should be triggered
• A FaaS infrastructure monitors triggering events, allocates a runtime for the function, executes it, and persists the results
• The platform is not simply elastic, in the sense that humans or scripts can add and remove resources as needed, it is autoscaling: the workload automatically drives the allocation and deallocation of resources
• The user is only billed for the computing resources used during function invocation

Many people have reported great results from adopting serverless. See for example ‘Serverless computing: economic and architectural impact’ that we looked at previously on The Morning Paper. The authors analysed use cases documented by Amazon and broke them down into three major buckets:

• Embarrassingly parallel tasks, often invoked on-demand and intermittently. For example, re-sizing images, performing object recognition, and running integer-programming-based optimisations.
• Orchestration functions, used to coordinate calls to proprietary auto-scaling services, where the back-end services themselves do the real heavy lifting.
• Applications that compose chains of functions – for example workflows connected via data dependencies. These use cases showed high end-to-end latencies though.

### Limitations: putting the ‘less’ in serverless

Moving beyond the use cases outlined above we run into a number of limitations of the current embodiments of serverless:

• Functions have limited lifetimes (15 minutes)
• Functions are critically reliant on network bandwidth to communicate, but current platforms from AWS, Google, and Azure provide limited bandwidth (an order of magnitude lower than a single modern SSD) that is shared between all functions packed on the same VM.
• Since functions are not directly network accessible, they must communicate via an intermediary service – which typically results in writing state out to slow storage and reading it back in again on subsequent calls.
• There is no mechanism to access specialized hardware through functions, yet hardware specialisation will only accelerate in the coming years.

One consequence of these limitations is that FaaS is a data-shipping architecture: we move data to the code, not the other way round.

This is a recurring architectural anti-pattern amongst system designers, which database aficionados seem to need to point out each generation. Memory hierarchy realities— across various storage layers and network delays — make this a bad design decision for reasons of latency, bandwidth, and cost.

A second consequence is that the high cost of function interactions ‘stymies basic distributed computing.’

… with all communication transiting through storage, there is no real way for thousands (much less millions) of cores in the cloud to work together efficiently using current FaaS platforms, other than via largely uncoordinated (embarrassing) parallelism.

(Enlarge)

Three small case studies in the paper illustrate the pain of moving outside of the serverless sweetspot:

• Training machine learning models. A single Lambda function was able to perform 294 iterations before hitting the time limit, so 31 sequential executions were required to complete training. The total cost was $0.29, and the total training latency was 465 minutes. Using an m4.large instance with 8GB RAM instead, the training process takes around 21 minutes and costs$0.04.
• Low-latency prediction serving. The best result using Lambda functions achieved an average latency of 447ms per batch. An alternative using two m5.large instances connected via ZeroMQ showed an average latency of 13ms per batch – 27x faster. For one million messages per second, the Lamdba + SQS solution would costs $1,584 per hour. Using EC2 instances the same throughput costs around$27.84 per hour.
• Trying to implement a simple leader election protocol through functions using a blackboard architecture with DynamoDB, each round of leader election took 16.7 seconds. (And using DynamoDB as a communication mechanism in this way would cost \$450 per hour to coordinate a cluster of 1000 nodes).

These results fall into the “we tried to fix some screws with a hammer and it didn’t work out too well” bucket. But highlighting where current serverless designs don’t work well can help us think about ways to improve them.

Many of the constraints of current FaaS offerings can also be overcome, we believe, maintaining autoscaling while unlocking the performance potential of data and distributed computing.

### Moving forward

A ‘truly programmable environment for the cloud’ should have the following characteristics:

• Fluid code and data placement: logical separation of code and data with the ability of the infrastructure to physically colocate when it makes sense, including function shipping.
• Heterogenous hardware support. “Ideally, application developers could specify applications using high-level DSLs, and the cloud providers would compile those applications to the most cost-effective hardware that meets user specified SLOs
• Long-running, addressable virtual agents: “If the platform pays a cost to create an affinity (e.g. moving data), it should recoup that cost across multiple requests. This motivates the ability for programmers to establish software agents— call them functions, actors, services, etc.— that persist over time in the cloud, with known identities.
• Disorderly programming: “the sequential metaphor of procedural programming will not scale to the cloud. Developers need languages that encourage code that works correctly in small, granular units— of both data and computation— that can be easily moved around across time and space.” For example, Functional Reactive Programming.
• A common intermediate representation (IR), that can serve as a compilation target from many languages (because a wide variety of programming languages and DSLs will be explored). Web assembly anyone?
• An API for specifying service-level objectives. “Achieving predictable SLOs requires a smooth ‘cost surface’ in optimization— non-linearities are the bane of SLOs. This reinforces our discussion above regarding small granules of code and data with fluid placement and disorderly execution.
• A ‘cloud-programming aware’ security model.

Taken together, these challenges seem both interesting and surmountable… We are optimistic that research can open the cloud’s full potential to programmers. Whether we call the new results ‘serverless computing’ or something else, the future is fluid.

I’m sure you’ve noticed the trend for many things in IT to become finer-grained over time (releases, team structure, service sizes, deployment units, …). I really need to write a longer blog post about this, but my thesis is that in many cases the ideal we’re seeking is a continuous one (e.g., in the case of serverless, continuous and seamless autoscaling of cloud resources and the associated billing). The more fine-grained we make things, the better our approximation to that ideal (it’s calculus all over again. We’re held back from making things finer-grained beyond a certain point by the transaction costs (borrowing from lean) associated with each unit. E.g. “we can’t possible release once a day, it takes us a full week to go through the release process!” Once we find a way to lower the transaction costs, we can improve our approximation and become finer-grained. If you haven’t seen it already for example, check out the work that CloudFlare have been doing with workers: V8 isolates (v low transaction costs), a web assembly runtime (the IR called for above), and JavaScript functions. Skate to where the puck is heading!

Unsupervised learning of artistic styles with archetypal style analysis Wynen et al., NeurIPS’18

I’ve always enjoyed following work on artistic style transfer. The visual nature makes it easy to gain an appreciation for what is going on and the results are very impressive. It also something that’s been unfolding within the timespan of The Morning Paper, if we peg the beginning to the work of Gatys et al. in 2015. See for example the posts on ‘Texture Networks’ and ‘Deep photo style transfer.’

Beyond direct style transfer, the objective of the work described in today’s paper choice is to uncover representations of styles (archetypes) themselves. Given a large collection of paintings,…

… Our objective is to automatically discover, summarize, and manipulate artistic styles present in the collection.

This is achieved using an unsupervised learning technique called archetypal analysis. We can recover archetypes from a collection of paintings, and we can also go the other way; taking a painting and decomposing it into a combination of archetypes. And of course if we then manipulate the composition of archetypes, we can manipulate the style of an image.

To visualise what an archetype ‘looks like’ the authors synthesise a texture from an image filled with random noise, using the style representation of the archetype. The following image shows some examples, with the synthesised archetypal textures in the left-most column and the three images next to them on each row showing the individual images that made the strongest contribution to the archetype.

The strongest contributions usually exhibit a common characteristic like stroke style or choice of colors. Smaller contributions are often more difficult to interpret. Figure 2a also highlights correlation between content and style: the archetype on the third row is only composed of portraits.

### From a collection of paintings to archetypes

Given a pre-trained VGG-19 network, a concise representation of the style of an individual painting can be obtained in the following manner:

• Take the feature maps from five layers of the trained VGG-19 (the network is trained as an auto-encoder to encode the salient features of the input image)
• Compute the first and second-order statistics of each feature map. Given a layer $l$ with feature map $\mathbf{F}_l$, $p_l$ channels, and $m_l$ pixel positions we have:

• Normalise the statistics by the number of parameters at each layer (divide the statistics for layer l by $p_{l}(p_{l} + 1)$). This was found empirically to be useful for preventing over-representation of layers with more parameters.
• From a style descriptor by concatenating the normalised statistics from all five layers: $\{\mathbf{\mu}_1, \mathbf{\Sigma}_1, ..., \mathbf{\mu}_l, \mathbf{\Sigma}_l\}$.

After creating style descriptors in this way for all the paintings in a collection, singular value decomposition is applied to reduce the dimensions to 4096 while keeping more than 99% of the variance.

Given the resulting set of vectors, archetypal analysis is then used to learn a set of archetypes $\mathbf{Z}$ such that each sample can be well approximated by a convex combination of archetypes. In addition, each archetype is a combination of samples. This is an optimisation problem that can be given to a dedicated solver.

… we use archetypal analysis on the 4096-dimensional style vectors previously described, and typically learn between k=32 to k=256 archetypes. Each painting’s style can then be represented by a sparse low-dimensional code [combination of styles], and each archetype itself associated to a few input paintings.

The evaluation uses two datasets to extract archetypes: the GanGogh collection of 95,997 artworks from WikiArt, and a collection of 1154 paintings and drawings by Vincent van Gogh. 256 archetypes are extracted for GanGogh, and 32 for Vincent van Gogh.

The following figure shows examples of image styles broken down into their contributing archetypes.

### Style manipulation using archetypes

Given a method of decoding a style vector to an image, we can manipulate style vectors in the archetype plane and then transform the results into images. Given a “content” (input) feature map from an original image, whitening and colouring operations can be used to match the mean and covariance structure of a given style feature map. This technique was first described in ‘Universal style transfer via feature transforms.’

Several methods for identifying a target style are of interest:

• Using a single archetype, to produce an image in the style of that archetype
• Using a combination of archetypes, for archetypal style interpolation
• Adjusting the weighting of archetypes already present in the image (e.g., significantly strengthening one component)

For example, the following figure shows the results of amplifying archetypal styles within a source image. The middle panel on each row is the original image. Moving left, the strongest component is emphasised, and moving right the second strongest component.

Ignoring any archetypal styles already present, the following figure shows the results of free exploration of the archetypal space for a given source image:

### A visual summary

Neural ordinary differential equations Chen et al., NeurIPS’18

‘Neural Ordinary Differential Equations’ won a best paper award at NeurIPS last month. It’s not an easy piece (at least not for me!), but in the spirit of ‘deliberate practice’ that doesn’t mean there isn’t something to be gained from trying to understand as much as possible.

### Neural networks as differential equations

Consider a multi-layered neural network. We have an input layer and an output layer, and inbetween them, some number of hidden layers. As an input feeds forward through the network, it is progressively transformed, one layer at a time, from the input to the ultimate output. Each network layer is a step on that journey. If we take a small number of big steps, we end up with a rough approximation to the true transformation function we’d like to learn. If we take a much larger number of steps (deeper networks), with each step being individually smaller, we have a more accurate approximation to the true function. What happens in the limit as we take an infinite number of infinitely small steps? Calculus!

So one way of thinking about those hidden layers is as steps in Euler’s method for solving differential equations. Consider the following illustration from wikipedia:

We want to recover the blue curve, but all we have is an initial point $A_0$ (think inputs to the network) and a differential equation. From the differential equation, we can calculate the tangent line. If we take a small step along the tangent line, we arrive at $A_1$, which will be close to the desired blue line if the step is small enough. Repeat this process to uncover a polygonal curve $A_{0}A_{1}A_{2}...A_{n}$.

Many neural networks have a composition that looks exactly like the steps of Euler’s method. We start with an initial state $\mathbf{z}_0$, and apply successive transformations over time (layers):

• $\mathbf{z}_1 = \mathbf{z}_0 + f(\mathbf{z}_0, \theta_{0})$
• $\mathbf{z}_2 = \mathbf{z}_1 + f(\mathbf{z}_1, \theta_{1})$
• $\mathbf{z}_3 = \mathbf{z}_2 + f(\mathbf{z}_2, \theta_{2})$
• $\mathbf{z}_{t+1} = \mathbf{z}_t + f(\mathbf{z}_t, \theta_{t})$

In the limit, we parameterize the continuous dynamics of hidden units using an ordinary differential equation (ODE) specified by a neural network:

$\displaystyle \frac{d\mathbf{z}(t)}{dt} = f(\mathbf{z}(t), t, \theta)$

The equivalent of having T layers in the network, is finding the solution to this ODE at time T.

Something really neat happens once we formulate the problem in this way though. Just like we’ve seen a number of papers that express a problem in a form suitable for solving by a SAT solver, and then throw a state of the art SAT-solver at it, we can now use any ODE solver of our choice.

Euler’s method is perhaps the simplest method for solving ODEs. There since been more than 120 years of development of efficient and accurate ODE solvers. Modern ODE solvers provide guarantees about the growth of approximation error, monitor the level of error, and adapt their evaluation strategy on the fly to achieve the requested level of accuracy. This allows the cost of evaluating a model to scale with problem complexity.

### How to train a continuous-depth network

We’ve seen how to feed-forward, but how do you efficiently train a network defined as a differential equation? The answer lies in the adjoint method (which dates back to 1962). Think of the adjoint as the instantaneous analog of the chain rule.

This approach computes gradients by solving a second, augmented ODE backwards in time, and is applicable to all ODE solvers. This approach scales linearly with problem size, has low memory cost, and explicitly controls numerical error.

The adjoint captures how the loss function L changes with respect to the hidden state ( $- \partial L / \partial \mathbf{z}(t)$). Starting from the output of the network, we can recompute $\mathbf{z}(t)$ backwards in time together with the adjoint.

A third integral then tells us how the loss changes with the parameters $\theta$ ( $dL/d\theta$).

All three of these integrals can be computed in a single call to an ODE solver, which concatenates the original state, the adjoint, and the other partial derivatives into a single vector. Algorithm 1 shows how to construct the necessary dynamics, and call an ODE solver to compute all gradients at once.

(Don’t ask me to explain that further!)

### Applied Neural ODEs

#### Residual Networks

Section three tackles the good old MNIST problem, comparing an ODE-net to a ResNet with 6 residual blocks. The ODE-net replaces the residual blocks with an ODE-Solve module.

Concentrating on the 2nd and 4th lines in the table below, ODE-Nets are able to achieve roughly the same performance as a ResNet, but using only about 1/3 of the parameters. Also note that the ODE-Net solution using constant memory, whereas ResNets use memory proportional to the number of layers.

It’s not clear how to define the ‘depth’ of an ODE solution. A related quantity is the number of evaluations of the hidden state dynamics required, a detail delegated to the ODE solver and dependent on the initial state or input. The figure below shows the number of function evaluations increases throughout training, presumably adapting to increasing complexity of the model.

In other words, the ODE-net is kind of doing the equivalent of deepening its network over time as it needs to add sophistication.

#### Normalising flows

Normalizing flows allow more complex probability distribution functions (pdfs) to be learned. The same trick of shifting from a discrete set of layers to a continuous transformation works in this situation too. The following figure shows normalising flows vs continuous normalising flows (CNF) when trying to learn a pdf. The CNF is trained for 10,000 iterations and generally achieves lower loss than the NF trained for 500,000 iterations.

#### Time-series

This is the application which most caught my attention.

Applying neural networks to irregularly-sampled data such as medical records, network traffic, or neural spiking data is difficult. Typically, observations are put into bins of fixed duration, and the latent dynamics are discretized in the same way. This leads to difficulties with missing data and ill-defined latent variables… We present a continuous-time, generative approach to modeling time series. Our model represents each time series by a latent trajectory. Each trajectory is determined from a local initial state $\mathbf{z}_{t_0}$, and a global set of latent dynamics shared across all time series.

The model can be trained as a variational autoencoder, and it looks like this:

The evaluation here is based on a dataset of 1000 2-dimensional spirals, each starting at a different point. Half of the spirals are clockwise, and half counter-clockwise. Points are sampled from these trajectories at irregular timestamps. The figure below shows a latent neural ODE is better able to recover the spirals than a traditional RNN:

A PyTorch implementation of ODE solvers is available.

The tradeoffs of large scale learning Bottou & Bousquet, NIPS’07

Welcome to another year of The Morning Paper. As usual we’ll be looking at a broad cross-section of computer science research (I have over 40 conferences/workshops on my list to keep an eye on as a start!). I’ve no idea yet what papers we’ll stumble across, but if previous years are anything to go by I’m sure there’ll be plenty of great material to keep interest levels high.

To start us off, today’s paper choice is “The tradeoffs of large scale learning,” which won the ‘test of time’ award at NeurIPS last month.

this seminal work investigated the interplay between data and computation in ML, showing that if one is limited by computing power but can make use of a large dataset, it is more efficient to perform a small amount of computation on many individual training examples rather than to perform extensive computation on a subset of the data. [Google AI blog: The NeurIPS 2018 Test of Time Award].

For a given time/computation budget, are we better off performing a computationally cheaper (e.g., approximate) computation over lots of data, or a more accurate computation over less data? In order to get a handle on this question, we need some model to help us think about how good a given solution is.

### Evaluating learning systems

Consider an input domain $\mathcal{X}$ and an output domain $\mathcal{Y}$. We have input-output pairs $(x,y) \in \mathcal{X} \times \mathcal{Y}$, and some probability distribution $P(x,y)$. We want to learn the conditional distribution $P(y|x)$— in other words, to make a prediction of the output, $\hat{y}$, given the input $x$. A loss function $\mathcal{L}(\hat{y},y)$ measures the distance between our prediction and the true value. We will approximate the probability distribution with a function $f$ such that $\hat{y} = f(x)$.

The ‘gold standard’ function $f^{*}$ is the one that minimises the expected risk (loss), $E(f)$ over all output classes. We don’t know the true distribution $P(y|x)$ though. Instead, we are given a sample of n independently drawn training examples. The best we can do is find $f^{n}$, the function that minimises the empirical risk $E_{n}(f)$ across these n data points.

When we build a given learning model, we constrain the set of prediction functions that can be learned to come from some family $\mathcal{F}$. Let $f^{*}_{\mathcal{F}}$ be the best possible function within that family in terms of minimising expected risk.

The optimal function $f^{*}$ may well not belong to the family $\mathcal{F}$. So by constraining learned functions to come from $\mathcal{F}$ we introduce an approximation error $\mathcal{E}_{app}$ that measures the difference in expected risk between $f^{*}_{\mathcal{F}}$ and $f^{*}$.

We also have an estimation error $\mathcal{E}_{est}$ that measures the effect of minimising the empirical risk based on training examples rather than the expected risk across the whole distribution (the expected value of $E(f_n) - E(f^*_{\mathcal{F}})$ ).

The estimation error is determined by the number of training examples and by the capacity of the family of functions. Large families of functions have smaller approximation errors but lead to higher estimation errors.

Now, given that we’re working from training samples, it should be clear that the the empirical risk $E_{n}(f)$ is already an approximation of the expected risk $E(f)$. Finding the $f_{n}$ that minimises the empirical risk is often computationally expensive. Since we’re already approximating, maybe a cheaper to compute approximation to $f_n$ would be good enough? If we allow the minimisation algorithm to return an approximate solution $\tilde{f_n}$ that is within $\rho$ of the true minimum, then we introduce a third error term $\mathcal{E}_{opt}$, measuring the gap between $\tilde{f_n}$ and $f_n$.

So now we have a trade-off involving three variables and two constraints:

The constraints are the maximal number of available training examples and the maximal computation time. The variables are the size of the family of functions $\mathcal{F}$, the optimization accuracy $\rho$, and the number of examples $n$.

Typically the error components change with respect to the variables as shown in the following table:

When we are constrained by the number of available training examples (and not by computation time) we have a small-scale learning problem. When the constraint is the available computation time, we have a large-scale learning problem. Here approximate optimisation can possibly achieve better generalisation because more training examples can be processed during the allowed time.

### Exploring the trade-offs in large-scale learning

Armed with this model of learning systems, we can now explore trade-offs in the case of large-scale learning. I.e., if we are compute bound, should we use a coarser-grained approximation over more data, or vice-versa? Let’s assume that we’re considering a fixed family of functions $\mathcal{F}$ and hence ignore the $\mathcal{E}_{app}$ error component. Thus the paper focuses on the choice of optimisation algorithm.

Four gradient descent algorithms are compared:

• Gradient descent, which has linear convergence and requires $\mathcal{O}(\kappa \log(1/\rho))$ iterations to reach accuracy $\rho$. Each iteration considers all training examples.
• Second order gradient descent (2GD), which requires $\mathcal{O}(\log \log(1/\rho))$ iterations to reach accuracy $\rho$. Each iteration considers all training examples.
• Stochastic gradient descent, in which a random training example is chosen at each iteration, and the parameters are updated on the basis of that example only. It reaches accuracy $\rho$ after less than $v \kappa^{2}/\rho + o(1/\rho)$ iterations on average. Neither the initial value of the parameter vector nor the total number of examples appear in the dominant term of the bound.
• Second order stochastic gradient descent, which doesn’t change the influence of $\rho$ on the convergence rate, but does improve the constants.

The results are summarised in the following table. The third column, time to reach a given accuracy, is simply the cost per iteration (column one) times the number of iterations needed (column two). The fourth column bounds the time needed to reduce the excess error below some chosen threshold.

Setting the fourth column expression to $T_{max}$ and solving for $\epsilon$ yields the best excess error achieved by each algorithm within the limited time $T_{max}$. This provides the asymptotic solution of the Estimation-Optimization tradeoff for large-scale problems satisfying our assumptions…

• SGD and 2SGD results do not depend on the estimation rate exponent, $\alpha$. When the estimate rate is poor, there is less need to optimise accurately and so we can process more examples.
• Second order algorithms bring little asymptotic improvements, but do bring improvements in the constant terms which can matter in practice.
• Stochastic algorithms yield the best generalization performance despite being the worst optimization algorithms.

The perspective proposed by Léon and Olivier in their collaboration 10 years ago provided a significant boost to the development of the algorithm that is nowadays the workhorse of ML systems that benefit our lives daily, and we offer our sincere congratulations to both authors on this well-deserved award. [Google AI blog: The NeurIPS 2018 Test of Time Award].

Towards a theory of software development expertise Baltes et al., ESEC/FSE’18

This is the last paper we’ll be looking at this year, so I’ve chosen something a little more reflective to leave you with (The Morning Paper will return on Monday 7th January, 2019). The question Baltes and Diehl tackle is this: “How do you get better as a software developer?” What does expert performance look like?

We present a first conceptual theory of software development expertise that is grounded in data from a mixed-methods survey with 335 software developers and in literature on expertise and expert performance…. [the theory] describes central properties of software development expertise and important factors influencing its formation.

In essence, ask a bunch of practitioners what they think, use a disciplined coding scheme to interpret the answers (a “grounded theory”), and then layer in what we know about expertise and expert performance in general. The end result is a “conceptual theory” that shows the various contributors to expert performance and the relationships between them. “Software Development” in the current work is synonymous with “programming.”

To make the paper come alive you need to engage with it a little: Does the theory developed by the authors make sense to you? What’s missing? How would you weight the various factors? How could you apply this on a personal level in 2019? How could this be applied in your team or organisation to raise the collective level of expertise next year?

Software developers can use our results to see which properties are distinctive for experts in their field, and which behaviors may lead to becoming a better software developer…. Employers can learn what typical reasons for demotivation among their employees are, and how they can build a work environment supporting the self-improvement of their staff.

### A grounded theory

The first phase involved sending a questionnaire to users active on both GitHub and StackOverflow between Jan 2014 and October 2015. The questionnaire was sent to 1,000 individuals, and received 122 responses.

(Enlarge)

The grounded theory (GT) coding exercise was then used to generate a theory from the qualitative data:

… the process of coding assigns “summative, salient, essence-capturing” words or phrases to portions of the unstructured data. Those codes are iteratively and continuously compared, aggregrated, and structured into higher levels of abstractions, the categories and the concepts. This iterative process is called constant comparison.

(Aside: it strikes me that the body of work on grounded theory development might be very interesting to study from the perspective of domain-driven design and the building of a ubiquitous language.)

After much distillation, the model comes out looking like this:

The grounded theory describes software development expertise as a combination of a certain quantity and quality of knowledge and experience, both general and for a particular language. The work context, behavior, character traits, and skills influence the formation of expertise, which can be observed when experts write well-structured, readable, and maintainable source code.

You’ll know an expert programmer by the quality of the code that they write. Experts have good communication skills, both sharing their own knowledge and soliciting input from others. They are self-aware, understanding the kinds of mistakes they can make, and reflective. They are also fast (but not at the expense of quality).

Experience should be measured not just on its quantity (i.e., number of years in the role), but on its quality. For example, working on a variety of different code bases, shipping significant amounts of code to production, and working on shared code bases. The knowledge of an expert is T-shaped with depth in the programming language and domain at hand, and a broad knowledge of algorithms, data structures, and programming paradigms.

### A preliminary conceptual theory

The next phase was to take the grounded theory and embed it within the existing literature on expertise and expert performance, for which the main resource used was ‘The Cambridge Handbook of Expertise and Expert Performance’.

This handbook is the first, and to the best of our knowledge most comprehensive, book summarizing scientific knowledge on expertise and expert performance.

The result of this process is a preliminary conceptual theory that looks like this:

Acquiring expertise is not exclusively a cognitive matter, personality and motivation influence behaviours that may or may not lead to improvements of expertise. The work context, including team members, managers, and customers, can also influence the behaviour of a developer, and this can also vary according to the type of task being undertaken.

Reaching true experts levels requires deliberate practice combined with monitoring, feedback, and self-reflection.

#### Deliberate practice

Having more experience with a task does not automatically lead to better performance. Research has shown that once an acceptable level of performance has been attained, additional “common” experience has only a negligible effect, in many domains the performance even decreases over time. The length of experience has been found to be only a weak correlate of job performance after the first two years.

Deliberate practice is required to become an expert: prolonged efforts to improve performance while continuously increasing the difficulty and centrality of development tasks.

…studies have shown that deliberate practice is necessary but not sufficient to achieve high levels of expert performance— individual differences also play an important role.

#### Monitoring, feedback, and self-reflection

Deliberate practice requires a way of monitoring performance, which could be e.g. from a teacher, coach, mentor, or peer: “the more channels of accurate and helpful feedback we have access to, the better we are likely to perform.“. Monitoring and self-reflection also influence motivation and consequently behaviour.

### The full conceptual theory

For the third and final phase the authors sampled two additional programmer populations, active Java developers, and very experienced developers, with the goal of further elaborating and refining the categories and relationships in the theory.

The final resulting model looks like this:

(Enlarge)

The most frequently cited tasks that an expert should be good at were designing software architecture, writing source code, and analysing and understanding requirements. Within the software architecture task, understanding modularisation and decomposition were frequently mentioned.

In terms of personality traits, experts should be open minded and curious, be team players, and pay attention to detail. Patience and self-reflection were also cited. In terms of general skills, “problem solving” came top of the list under which analytical thinking, logical thinking, and abstraction/decomposition all feature. Another important skill is being to assess trade-offs.

Mentors should be guiding, patient, and open-minded. Participants were most motivated by mentors that posed challenging tasks.

To facilitate continuous development of their employee’s software development skills, (employees suggested that) employers should:

1. Encourage learning (e.g. training courses, conference attendance, and access to a good analog or digital library)
2. Encourage experimentation (e.g. through side projects and by building a work environment that is open to new ideas and technologies)
3. Improve information exchange between development teams, departments, and even companies. E.g. lunch and learn sessions, rotation between teams, pairing, mentoring, and code reviews.
4. Grant freedom (primarily in the form of less time pressure) to allow developers to invest in learning new technologies or skills.

In contrast, non-challenging or routine tasks result in demotivation. Other causes of performance decline over time are lack of a clear vision or direction, absence of reward for quality work, stress in the work environment, and bad management or team structure.

How will you ensure that in 2019 you grow your expertise, and not simply add another year of (the same or similar) ‘experience’ ?

See you in January! Thanks, Adrian.

Identifying impactful service system problems via log analysis He et al., ESEC/FSE’18

If something is going wrong in your system, chances are you’ve got two main sources to help you detect and resolve the issue: logs and metrics. You’re unlikely to be able to get to the bottom of a problem using metrics alone (though you might well detect one that way), so that leaves logs as the primary diagnosis tool. The online service at Microsoft used as the main case study in the paper produces dozens of Terabytes of logs every day.

Logs play a crucial role in the diagnosis of modern cloud-based online service systems. Clearly, manual problem diagnosis is very time-consuming and error-prone due to the increasing scale and complexity of large-scale systems.

Log3C analyses logs to look for indications of impactful problems, using correlated KPIs as a guide. It finds these needles in the haystack with an average precision of 0.877 and an average recall of 0.883. A distributed version of Log3C has been deployed and used in production at Microsoft for several years, both to support a massive online service (we are not told which one), and integrated into “Product B” where it is used as a log analysis engine to analyse tens of billions of log messages every day.

Log3C greatly reduces engineer’s efforts on manually inspecting the logs and pinpointing root causes of failures. Furthermore, fault patterns are also extracted and maintained for analyzing similar problems in the future.

From the experiences thus gained in log analysis the authors draw the following lessons:

1. Simple anomaly / outlier detection doesn’t cut it. It’s tempting to believe that systems are regular most of the time, and therefore any outliers are problems, but in practice there’s can be a long tail of infrequent (but genuine) user behaviours. “Our experiences with the production system reveal that there are indeed many rare user behaviors, which are not real problems. A lot of effort could be wasted by examining these false positives.
2. It’s not just a single incidence of a problem that’s important, but also the underlying trend in the number of incidences of that problem. Are the number of incidences steadily increasing over a period of time? That’s of interest. Have totally new problems appeared? That could be an indication of a buggy deployment. Do the number of incidences of problem decrease after a fix, but then settle down to a steady but non-zero level? This can indicate an incomplete fix or partial solution.

To work its magic, Log3C combines both metrics and logs. KPIs are used to guide log analysis to focus on periods in time when we have external evidence of a system problem (e.g., the error rate is up). This helps to separate the long tail of genuine user behaviours from true problems. Then Log3C uses a novel clustering algorithm called Cascading Clusters in order to be able to effectively cluster the massive amounts of log data generated by the system. Problems are identified by looking for rare clusters correlated with KPI degradation. Clustering is made more difficult because logs are highly imbalanced – the vast majority of log entries are for regular non-problem behaviours.

### High level approach

Log3C consists of four steps: log parsing, sequence vectorization, cascading clustering, and correlation analysis.

(Enlarge)

…at each time interval, logs are parsed into log events and vectorized into sequence vectors, which are then grouped into multiple clusters through cascading clustering. However, we still cannot extrapolate whether a cluster is an impactful problem, which necessitates the use of KPIs. Consequently, in step four, we correlate clusters and KPIs over different time intervals to find impactful problems.

### From log messages to sequence vectors

The first step is to turn log messages into log events (on the assumption you’re not logging as events already). For each log message such as “HTTP Request URL: http://…&#8221; the goal is to recover the log message template, e.g. “HTTP Request URL: {url}” and turn each template into an event type.

To do this first some common parameter fields are removed using regular expressions (URLs, ip addresses, and so on), and then log messages are clustered into coarse-grained groups based on weighted edit distance. Each cluster is further broken down into finer-grained groups and log events are extracted by looking for the longest common substrings within each group.

We can now turn a sequence of log messages into a sequence of events:

To form a log sequence, log messages that share the same task ID are linked together and parsed into log events. Moreover, we remove the duplicate events in the log sequence.

For ‘task ID’ think e.g. request ID. Duplicate removal eliminates logs from retrying operations and loops, and makes it easier to match sequences that are different occurrences of the same event).

Each log sequence is then turned into a vector. If we have n different event types then then resulting sequence vectors will have n elements, each element representing the ‘weight’ of the corresponding event in the sequence. Event weights are produced as the sum of two terms, with the relative priority determined by a configurable parameter $\alpha$:
$\displaystyle w(e) = \alpha . IDF(e) + (1 - \alpha).IMP(e)$

Where IDF(e) is an Inverse Document Frequency term, which gives frequent events low weights, and rare events high weights. IDF weights are normalised into the range [0, 1] using the sigmoid function.

IMP(e) is a measure of the importance of the event with regards to KPIs. A regression model is built to measure the correlation between log events and KPI values to find the importance weights. Assume that KPIs are computed once for every time interval. Count the number of occurrences of each event type within that interval across all sequences, and then apply the weights from the regression model. (This is visually represented in step 2 of the figure above).

Since the importance weight is directly associated with KPIs and is thereby more effective in problem identification, we value the importance weight more, i.e., $\alpha$ < 0.5. In our experiments we empirically set $\alpha$ to 0.2.

### From sequence vectors to clusters

Once all log sequences have been vectorised, sequence vectors are grouped into clusters separately for each time interval.

However, the conventional clustering methods are incredibly time-consuming when the data size is large…

Hence Loc3C using a novel clustering technique called cascading clustering. It’s an iterative process, with each iteration going through three steps: sampling, clustering, matching.

First we randomly sample a portion of sequence vectors, then we run clustering on the sampled vectors. For each resulting cluster, a pattern is extracted (the mean of all sequence vectors in the cluster). Then take all of the original unsampled sequence vectors and assign them to the closest discovered cluster assuming it is within some distance threshold. At the end of this process we will be left with a set of sequence vectors that were not close enough to a cluster to have been assigned. These remaining vectors then ‘cascade’ into the next iteration as the new input.

By iterating these processes, all sequence vectors can be clustered rapidly and accurately. The reason behind this is that large clusters are separated from the remaining data at the first several iterations.

Neat eh?

### From clusters and KPIs to problem identification

The clusters represent different types of log sequences (system behaviours), but these may not necessarily be problems. So the final step is to correlate clusters with changes to KPIs.

Another regression model captures the correlation between cluster sizes and KPI values over multiple time intervals. “More specifically, we utilize a multivariate linear regression model, which correlates independent variables (cluster sizes) with the dependent variable (KPI).”

Among all independent variables, those that have statistical significance make notable contributions to the dependent variable. Moreover, the corresponding clusters indicate impactful problems, whose occurrences contribute to the change of KPI.

### Log3C in action

Log data is collected from Microsoft’s service ‘X’ and the data from a specific period (we are not told the time or duration of this period) on each of three days is manually labelled. This enables comparison of the effectiveness of Log3C against a ground truth.

Here’s how well Log3C’s automated log analysis does on these datasets:

And compared to a standard Hierarchical Agglomerative Clustering algorithm, Cascading Clustering is much faster:

As stated earlier in this write-up, beyond these results we also have the fact that Log3C is used in production settings at Microsoft to assist in the maintenance of online service systems.

Log3C is available at https://github.com/logpai/Log3C.