Skip to content

Deep photo style transfer

August 18, 2017

Deep photo style transfer Luan et al., arXiv 2017

Here’s something a little fun for Friday: a collaboration between researchers at Cornell and Adobe, on photographic style transfer. Will we see something like this in a Photoshop of the future?

In 2015 in the Neural Style Transfer paper (‘A neural algorithm of artistic style‘), Gatys et al. showed how deep networks could be used to generate stylised images from a texture example. And last year we looked at Texture networks, which showed how to achieve the same end with marginally lower quality, but much faster (it’s the algorithm behind the Prisma app). For example:

Neural Style Transfer has generated a lot of interest since the original Gatys paper, and in researching for this post I find a nice review paper summarising developments from Jing et al.: ‘Neural Style Transfer: A Review.‘ Here you can find comparative results from a number of algorithms (see below):


The resulting images look like artwork (paintings), but they don’t look like photographs. Instead of painterly style transfer, Luan et al., give us photographic style transfer.

Photographic style transfer is a long-standing problem that seeks to transfer the style of a reference style photo onto another input picture. For instance, by appropriately choosing the reference style photo, one can make the input picture look like it has been taken under a different illumination, time of day, or weather, or that it has been artistically retouched with a different intent.

The following example illustrates the difference nicely. Here we see an input photo image, a reference style photo image, and the results of combining them with (a) Gatys et al., (b) CNNMRF, which tries to match patches between the input and the style image to minimise the chances of inaccurate transfer, and (c) Deep photo style transfer (this paper):

(Enlarge so that you can explore the differences more closely).

To obtain photorealistic results, Luan et al., had to overcome two key challenges: structure preservation and semantic accuracy.

For structure preservation the challenge is to transfer style but without distorting edges and regular patterns – windows stay aligned on a grid for example. “Formally, we seek a transformation that can strongly affect image colors while having no geometric effect, i.e., nothing moves or distorts.

Semantic accuracy and transfer faithfulness refers to mapping appropriate parts of the style and input images:

…the transfer should respect the semantics of the scene. For instance, in a cityscape, the appearance of buildings should be matched to buildings, and sky to sky; it is not acceptable to make the sky look like a building.

How it works (high level overview)

Deep photo style transfer takes as input an ordinary photograph (the input image) and a stylized and possibly retouched reference image (the reference style image). The goal is to transfer the style of the reference to the input while keeping the result photorealistic. The core of the approach follows Gatys et al., but with two additions: a photorealism regularisation term is added to the objective function during optimisation, and the style transfer is guided by a semantic segmentation of the inputs.

How do you find a term to penalise images which are not photorealistic? After all, …

Characterising the space of photorealistic images is an unsolved problem. Our insight is that we do not need to solve it if we exploit the fact that the input is already photorealistic. Our strategy is to ensure that we do not lose this property during the transfer by adding a term that penalizes image distortions.

If the image transformation is locally affine in colour space then the photorealism property should be preserved. Locally affine means that for each output patch there is an affine function that maps the input RGB values onto their output counterparts.

Formally, we build upon the Matting Laplacian of Levin et al., who have shown how to express a grayscale matte as a locally affine combination of the input RGB channels.

The semantic segmentation approach builds on ‘Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs‘ (DilatedNet). Image segmentation masks are generated for the inputs for a set of common labels (sky, buildings, water, and so on), and masks are added to the input image as additional channels. The style loss term is updated to penalise mapping across different segment types.

To avoid “orphan semantic labels” that are only present in the input image, we constrain the input semantic labels to be chosen among the labels of the reference style image. While this may cause erroneous labels from a semantic standpoint, the selected labels are in general equivalent in our context, e.g. “lake” and “sea”.

Code is available at

Results and comparisons

Here are a series of examples of images generated by Deep Photo Style Transfer, as compared to those produced by the base Neural Style algorithm and CNNMRF on the same input images:


The following example shows a comparison against related work with non-distorting global style transfer methods.


And finally, here’s a comparison against the time-of-day hallucination method of Shih et al.. “The two results look drastically different because our algorithm directly reproduces the style of the reference style image, whereas Shih’s is an analogy-based technique that transfers the color change observed in a time-lapse video.” (The Shih approach needs a video as input).


In user studies, 8 out of 10 users said that Deep Photo Style Transfer produced the most faithful style transfer results while retaining photorealism:

On the design of distributed programming models

August 17, 2017

On the design of distributed programming models Meiklejohn, arXiv 2017.

Today’s choice is a lovely thought piece by Christopher Meiklejohn, making the case for distributed programming models. We’ve witnessed a progression in data structures from sequential (non-thread safe) to concurrent, to distributed (think CRDTs). Will the same thing happen with our programming models? And if so, what might they look like?

Distributed programming, while superficially believed to be an extension of concurrent programming, has its own fundamental challenges that must be overcome.

From sequential, to concurrent, to distributed

The core of most programming models in widespread use today is a von Neumann style model where computation revolves around mutable storage locations (variables) and programs progress in a set order with a single thread of execution.

Concurrent programming extends this to accommodate multiple sequential threads of execution. The threads may be executing in parallel on multiple processors, or a single processor may be alternating control.

Concurrent programming is difficult.

A lot of the difficulty comes in reasoning about whether or not the program is correct under all possible schedules…

Given multiple threads of execution, and a nondeterministic scheduler, there exists an exponential number of possible schedules, or executions, the program may take, where programmers desire that each of these executions result in the same value, regardless of a schedule.

The desirable property of confluence holds that the evaluation order of a program does not impact its outcome.

Many modern languages embed concurrency primitives directly into the language itself, it’s not something left solely to the libraries.

Isn’t distributed programming just concurrent programming where some of the threads happen to execute on different machines? Tempting, but no. Distribution is a true extension of concurrency in the same way that concurrency extends sequential.

On the surface, it appears that distribution is just an extension of concurrent programming: we have taken applications that relied on multiple threads of execution to work in concert to achieve a goal and only relocated the thread of execution to gain more performance and more computational resources. However, this point of view is fatally flawed.

Whereas the challenges of concurrent programming revolve around non-determinism, distributed brings with it a fundamentally different set of challenges (which brings to mind Peter Deutsch’s fallacies of distributed computing).

Distribution is fundamentally different from concurrent programming: machines that communicate on a network may be, at times, unreachable, completely failed, unable to answer a request, or, in the worst case, permanently destroyed. Therefore, it should follow that our existing tools are insufficient to solve the problems of distributed programming. We refer to these classes of failures in distributed programming as “partial failure”…

Distribution challenges

The core problem in distributed systems is the problem of agreement (aka consensus). It shows up in two forms:

  • Leader election: the process of selecting an active leader amongst a group of nodes, and
  • Failure detection: the processing of detecting that a node has failed and can no longer at as a leader.

Two impossibility results demonstrate why these problems are hard and set the boundaries for what is possible: the CAP theorem and the FLP result.

FLP tells us that in a truly asynchronous system, agreement is impossible in the presence of even a single failed process. We can’t distinguish between a very delayed response and a failure.

FLP is solved in practice via randomized timeouts that introduce non-determinism into the leader election process to prevent inifinite elections.

The CAP theorem says that when some of the processes in a system cannot communicate with each other (are partitioned), we have to sacrifice availability if we want to maintain linearizability (consistency).

Both CAP and FLP incentivize developers to avoid using replicated, shared state, if that state needs to be synchronized to ensure consistent access to it. Applications that rely on shared state are bound to have reduced availability, because they either need to wait for timeouts related to failure detection or for the cost of coordinating changes across multiple replicas of shared state.

What might distributed programming models look like?

With distributed programming becoming the new normal, “it is paramount that we have tools for building correct distributed systems.” Just as concurrency primitives made their way into the programming models themselves, will distribution primitives do the same?

… if an application must operate offline, a language should have primitives for managing consistency and conflict resolution; similarly, if a language is to adhere to a strict latency bound for each distributed operation, the language should have primitives for expressing these bounds.

Any such distributed programming model must sit within the bounds specified by the CAP theorem. So at one end of the spectrum we have AP models, and at the other end we have CP models.

Lasp is an example of an AP model that sacrifices consistency for availability. In Lasp, all data structures are CRDTs, and therefore all operations must commute and all data structures must be expressible as bounded join-semilattices. This places restrictions on the kinds of data structures Lasp can support.

Austere (I couldn’t find a reference, sorry) is an example of a CP model that sacrifices availability for consistency. It uses two-phase locking before every operation, and two-phase commit to commit all changes. If a replica can’t be contacted, the system will fail to make progress.

Despite these differences, both of these models involve a datastore that tracks local replicas of shared state, and both are small extensions to the λ-calculus providing named registers pointing to locations in the datastore.

The “next 700” distributed programming models will sit between the bounds presented by Austere and Lasp: extreme availability where consistency is sacrificed vs. extreme consistency where availability is sacrificed.

(“next 700” is a lovely nod to Peter Landin’s classic paper ‘The next 700 programming languages‘)

What does it mean in practice to sit between these two extremes?

More practically, we believe that the most useful languages in this space will allow the developer to specifically trade-off between availability and consistency at the application level. This is because the unreliability of the network, dynamicity of real networks, and production workloads require applications to remain flexible.

Another alternative would be multiple languages, each making a different trade-off. This is closer to what happened with NoSQL datastores, but the industry has much less appetite for large numbers of programming languages than it does for large numbers of datastores. The main argument I see in favour of languages that choose a fixed point in the design space is that it should be easier to reason about the semantics of the resulting programs. Evidence shows that programmers find it very hard to reason correctly about availability and consistency under a single model, let alone when the system can be using multiple models at the same time. That said, on balance I agree with Meiklejohn’s conclusion that being able to make trade-offs within a single language is probably the best option. The convincing argument for me is that different parts of the same system clearly do have differing consistency requirements.

These trade-offs should be specified declaratively, close to the application code. Application developers should not have to reason about transaction isolation levels or consistency models when writing application code.

I’m not convinced developers can get away from reasoning about isolation and consistency even when those concerns are expressed declaratively (see e.g. ACIDRain for a recent counterpoint). But perhaps the point here is that developers shouldn’t have to write their own custom logic to do so.


Spry is an example of a design point in the space between CP and AP. (Though there doesn’t seem to be a whole lot in that GitHub repo…).

Spry is a programming model for building applications that want to tradeoff availability and consistency at varying points in application code to support application requirements.

The example given is of a CDN system where CA tradeoffs are made in both directions:

  • content older than a given threshold will never be returned (consistency over availability)
  • cached content will be returned if we can’t get the latest within a certain period of time (availability – and latency – over consistency)

Which puts me in mind of the PACELC formulation (in the case of a Partition, we can trade between Availability and Consistency, Else we can trade between Latency and Consistency).

In Spry, developers can add bounded staleness and freshness requirements via annotations. (We don’t get any source examples sadly, and the repo content is a little thin on the ground).

Interactions of individual and pair programmers with an intelligent tutoring system for computer science

August 16, 2017

Interactions of individual and pair programmers with an intelligent tutoring system for computer science Harsley et al., SIGCSE’17

A short and easy paper for today, which examines the difference between pair programming and working individually in an educational context. The study involves 116 students participating in a series of seven linked-list programming tasks designed to teach the fundamentals of linked-lists. Students use the Collab ChiQat-Tutor interface which combines tutor feedback, a coding interface, a graphical problem statement, and problem description.

The whole environment is heavily instrumented so that the researchers can gather data on problem solving time, reliance on system examples, coding efficiency (error rate, redo/undo behaviour, copy-and-pastes,…), signs of engagement, and so on. Students also wear headsets and their spoken dialogue is recorded as well.

The following table summarises the key data points collected:

To the best knowledge of the authors, this is the first study of its kind to combine such a relatively large scale and such a fine level of detail.

75 students completed the tasks individually, and 41 (Shouldn’t this be an even number!?) completed the tasks in pairs. Every student, whether working individually or in a pair, completed an individual short test both before and after completing the learning tasks. This allows determination of how effectively students learned from the exercise.

The key findings?

We discovered that while both groups exhibited significant learning gains, pair programmers completed problems more quickly, relied less heavily on system-provided examples, coded more efficiently, and showed higher signs of engagement. Moreover, students in both conditions found the system equally helpful.

Let’s take a closer look to see what lies behind the above statement.

Does pairing improve learning?

In this instance, no. The improvement between pre- and post- test scores was the same for those who worked individually and for those who worked in pairs. That’s somewhat surprising to me – I would have expected engaging with the material in an additional way (through spoken dialogue between pair members) to aid understanding and retention. Perhaps the learning environment is sufficiently good that there’s no margin left for a difference to show? In pairing on tasks in general where the pair have differing skill sets, transference of knowledge is certainly one of the benefits I would expect to see.

Do pairs complete tasks quicker?

The headline answer was ‘yes’, but when we look more closely it’s not quite so clear cut.

Pair programmers successfully complete the task in less time than individuals in five out of the seven tasks, but the difference is only statistically significant for two of them. By far the largest difference is seen on the first task. Rather than this being something about the task itself, the authors speculate that this is due to pairs being able to ramp-up faster (i.e., they’d be quicker on the first task, regardless of what the first task was):

… the pair is able to drastically decrease ramp up time. They accomplish both system acclimation and problem solving. Yet, if this difference solely alludes to the pairs’ ability to problem-solve through an unknown user interface more quickly, the result is nonetheless important.

Somewhat contradicting this viewpoint, is the analysis of tasks 5 and 7 where individuals were quicker than pairs. “Both of these problems offer a shift in the problem type and skills required to solve.” That is, they seem to involve some kind of ramp-up to the new situation. The differences here are not statistically significant though (still, the authors can’t resist providing an explanation for them!).

It is also only on the first task that pairs are more than twice as fast as individuals. So if we were interested in tasks completed per person-hour or some such metric (not an objective for this study) then it is only here that pairs win.

Do pairs code more efficiently?

On the surface this seems like an odd question. Surely efficiency is best determined by how long it takes to complete the task? When we dig a little deeper though, the researchers find something that to me hints at the potential quality differences between the work done by pairs and individuals. Pairs require on average 13 operations per solution, whereas individuals need 23. Individuals also use four times the number of undo and redo operations than pairs. Individuals also had more compilation errors, and submitted more incorrect solutions.

This suggests that individuals were more uncertain of their solution than pairs. In contrast, pairs had the ability of the navigator to plan and strategize over the direction of the solutions, causing less undos, redos, and overall operations as a result.

Individuals also submitted more problem restarts than pairs, “signifying the lack of certainty in direction and solution strategy for individuals.

Are pairs more engaged?

Pairs spent longer in the tutorial and switched problems less frequently than individuals. The authors interpret this as a sign of greater engagement.

Once faced with difficulty in solving a problem, pairs continued to work through the problem while individuals switched problems.

Do pairs and individuals use resources differently?

Within the learning environment, students can request system-provided examples. Pairs requested such examples significantly less often than individuals.

As students in both conditions solve the same number of problems, the lack of example usage among pairs shows that they are able to reason about the problem on their own.

How does pairing impact learner enjoyment / satisfaction?

There was no difference between in self-reported satisfaction with using the system between those who worked in pairs and those who worked individually.

Related studies of pair programming

Within industry, an earlier study of pair programming paired programmers randomly and measured programmers productivity in terms of code produced over time and code error rate. A development team consisting of 10 individuals, five pairs, experienced a 127 percent gain in code production per month and a decline of three orders of magnitude in coding errors.

Another study in an educational context found that pair programmers had higher program quality, course enjoyment levels, and significantly higher confidence in their programming solutions.

Other reports examining the benefits and drawbacks of pair programming cited by the authors include:

A quick Google Scholar search turns up plenty of other references if you’re interested in exploring further…

Writing parsers like it is 2017

August 15, 2017

Writing parsers like it is 2017 Chifflier & Couprie, SPW’17

With thanks to Glyn Normington for pointing this paper out to me.

Earlier this year we looked at ‘System programming in Rust: beyond safety‘ which made the case for switching from C to Rust as the default language of choice for system-level programming. Today’s paper is in the same vein, but focuses on pragmatic incremental updates to existing C-based systems by looking at their use of parsing code. Why parsers? Because they are often manipulating untrusted data that can be controlled by a hacker…

Unfortunately, data parsing is often done in a very unsafe way, especially for network protocols and file formats… Today, many applications embed a lot of parsers that could be targeted: web browsers like Firefox or Chrome, media players (VLC), document viewers, etc..

There are many examples of parser related bugs and vulnerabilities (see e.g. Cloudbleed for a recent one), and they keep on coming.

After years of code auditing, publishing and teaching good practice and still finding the same problems, we must admit that current methods do not work.

If current methods aren’t working, what are we to do? Chifflier & Couprie have a two part solution for us:

  1. Change the programming language
  2. Use parser combinators as the sweet spot between hand-rolled parsers and the rigid output of parser generators

The goal is to take an existing C (or C++) program and replace its existing parser(s) with one(s) offering better security, without disturbing the rest of the application. And if you haven’t guessed yet, that better parser implementation is going to be written in Rust.

Finding a better programming language

The programming language you choose to use has a big impact on the security of the resulting software. “Mind your language(s): A discussion about languages and security” makes a strong case for this. That paper also opens with a quote that I can’t resist repeating here:

An unreliable programming language generating unreliable programs constitutes a far greater risk to our environment and to our society than unsafe cars, toxic pesticides, or accidents at nuclear power stations. Be vigilant to reduce that risk, not to increase it. – C.A.R Hoare

By using a safe programming language, responsibility for whole classes of security related issues moves from the developer to the compiler. In their search for a better language, the authors looked for:

  • Strong typing, “because type safety is an essential quality for security.”
  • The absence of garbage collection (due to concerns over performance and multi-threading)
  • A memory model close to the low-level hardware, and ideally supporting zero-copy compatibility with C.

In the end, they settled on Rust for the following reasons:

  • A managed memory model that prevents use-after-free and double-free flaws
  • No garbage collection
  • Thread-safety guaranteed by the compiler
  • Strong typing
  • Runtime efficiency
  • Zero-copy data buffers
  • Easy integration with C with minimal data conversions
  • Clear marking of safe/unsafe parts of the code
  • A good/large community: “it’s important to use a good language, but it is better to have active and helpful users.”

Embracing parser combinators

Changing the language is important, but it is not enough. Another component is required to help fix logical bugs, and make sure parsers are both easier to write and do not contain errors.

Many low-level parsers ‘in the wild’ are written by hand, often because they need to deal with binary formats, or in a quest for better performance. Often these end up with hard to maintain code. At the other end of the spectrum there are parser generators, which tend to produce parsers of good quality but still present challenges in writing the grammar and integrating context specific libraries.

In between these two choices lie parser combinators.

Parser combinators propose a middle ground between hand-written parsers and generated parsers. They are made of small functions that handle basic parts of a format, like recognizing a word in ASCII characters or a null byte. Those functions are then composed in more useful building blocks like the pair combinator that applies two parsers in sequence, or the choice combinator that tries different parsers until one of them succeeds.

The functions are completely deterministic and hold no mutable state. The deterministic behaviour of parser combinators simplifies writing the state machine that manages data accumulation.

The authors’ parser combinator library of choice for Rust is nom, written by the second author (see ‘Nom, a byte oriented, streaming, zero copy, parser combinators library in Rust,’ and

The underlying idea of nom stems from Rust’s memory handling. Since the compiler always knows which part of the code owns which part of the memory, and Rust has a slice type (a wrapper over a pointer and a size) that represents data borrowed from somewhere else, it is possible to write a parser that will never copy input data. It will instead only return generated values or slices of its input data. This approach has the potential to make efficient parsers.

With stateless parsers that don’t need ownership of the input data, it becomes relatively easy to write the parser in nom, add unit tests in Rust, and then call into Rust from C.

Incremental adoption in existing systems

… it seems more reasonable and pragmatic to try to change only a small part of the initial software, without affecting the rest. For example, replacing a parser (ideally, only one function in the initial code) by its safe version, recoded in the new language.

For incremental adoption, integration into the host system needs to be as painless as possible. Fortunately, parser modules offer the triple benefits of a small interface to the rest of the system, normally not directly modifying the application’s state, and being high value targets for security upgrades. The recommended approach is as follows:

  • Create and test the parser in a separate Rust project (so that it can be reused across multiple applications)
  • In the Rust interface code, transform unsafe C types (e.g., pointers to void) to richer Rust types, and the revers when returning values to C.
  • Ensure the Rust interface appears as a set of deterministic reentrant functions from the C side. If it is necessary to keep state around, it can be wrapped in a Box passed to the C code as an opaque pointer, with accessor functions provided.
  • The simplest option is to make sure that the Rust part never owns the data to parse, by only working on and returning slices. In this way, the only important values moving across the interface are pointers and lengths.

That just leaves the small matter of the build system:

Integrating the Rust code in the build system is a large part of the work needed. While it is not especially hard, it requires fixing a lot of details, like passing the platform triplets between compilers, setting up the right paths for intermediary or final objects, ordering the build of components and linking them.

Case studies

The paper contains two case studies: replacing the FLV parser in the VLC media player, and writing a safe TLS parser for the Suricata IDS. This led to the Rusticata project (, containing implementations of fast and secure parsers in Rust, together with a simple abstraction layer to use them in Suricata. Rustica is compiled as a shared library exposing C functions.

To avoid duplicating support functions such as logging, alerting, and so on, pointers to the requisite C functions are wrapped and passed to the Rust code.

For an interesting comparison to the Rustica TLS parser effort, see ‘Not quite so broken TLS‘ that we looked at last on The Morning Paper last year.

The appendix contains an interesting analysis of the code generated by Rust compiler. Most things look good, but of note is the fact that the Rust code has no protection activated for the stack:

Admittedly, the generated code is supposed not to contain any overflow thanks to the compiler checks, but it is a bit sad that a well-established function integrated to LLVM is not used, even if the integration causes some difficulties (tracked in Rust issue #15179). That also means that a Rust library loaded from a C executable will provide ROP gadgets. Hopefully, the library supports ASLR, but this is not satisfactory, and could be improved in the future

Cardiologist-level arrhythmia detection with convolutional neural networks

August 14, 2017

Cardiologist-level arrythmia detection with convolutional neural networks Rajpurkar, Hannun, et al., arXiv 2017

See also

This is a story very much of our times: development and deployment of better devices/sensors (in this case an iRhythm Zio) leads to collection of much larger data sets than have been available previously. Apply state of the art deep learning techniques trained on those data sets, and you get a system that outperforms human experts.

We develop a model which can diagnose irregular heart rhythms, also known as arrhythmias, from single-lead ECG signals better than a cardiologist.

Diagnosing arrhythmias

Detecting arrhythmias from ECG records has traditionally been challenging for computer systems, with accuracy rates ranging from 50% to just 1 in 7 correct diagnoses. Thus arrhythmia detection is usually performed by expert technicians and cardiologists.

To automatically detect heart arrhythmias in an ECG, an algorithm must discern the complex relationships between them over time. This is difficult due to the variability in wave morphology between patients as well as the presence of noise.

The model developed by the authors can identify 12 different heart arrythmias, sinus rhythm, and noise, for a total of 14 output classes. The morphology of the ECG during a single heart-beat, as well as the pattern of the activity of the heart over time determine the underlying rhythm. There are subtle differences in rhythms that can be hard to detect yet are critical for treatment.

Collecting the data

Machine learning models based on deep neural networks have consistently been able to approach and often exceed human agreement rates when large annotated datasets are available. These approaches have also proven to be effective in healthcare applications…

The iRhythm Zio device is much smaller and simpler to wear than previous general Holter monitors. It captures beat-to-beat heart rhythms for up to 14 days (traditional monitoring systems are usually only used for one or two days), continuously recording. The data is kept on-device, and downloaded at the end of the assessment period, yielding up to 57% more data than traditional monitoring (source: iRhythm ZioXT web page). Off the back of the Zio device, iRhythm Technologies have built a company valued just shy of $1B (IRTC).

We construct a dataset 500 times larger than other datasets of its kind. One of the most popular previous datasets, the MIT-BIH corpus contains ECG recordings from 47 unique patients. In contrast, we collect and annotate a dataset of about 30,000 unique patients from a pool of nearly 300,000 patients who have used the Zio patch monitor.

Every record in the training set is 30 seconds long, and can contain more than one rhythm type. Each record is annotated by a clinical ECG expert, who marks segments of the signal as corresponding to one of 14 rhythm classes.

Label annotations were done by a group of Certified Cardiographic Technicians who have completed extensive training in arrhythmia detection and a cardiographic certification examination by Cardiovascular Credentialing International.

For testing, 336 records from 328 unique patients are used (the test and validation sets are separate to the training set). For these records, ground truth annotations were obtained by a committee of three board-certified cardiologists. For each record in the group, six individual cardiologists also provided annotations (without any collaboration) – this data is used to asses performance of the trained model as compared to an individual cardiologist.

Building and training a model

We draw on work in automatic speech recognition for processing time-series with deep convolutional neural networks and recurrent neural networks, and techniques in deep learning to make the optimization of these models tractable.

ECG arrhythmia detection is a sequence-to-sequence task. The input is a an ECG signal, and the output is a sequence of rhythm class labels. Each label corresponds to one segment of the output.

The model has 33 layers of convolution followed by a fully-connected layer and softmax, all trained using a log-likelihood objective function. To make optimisation tractable the network also contains shortcut connections and uses batch normalisation. Dropout is also used, and Adam as the optimiser.

The network consists of 16 residual blocks with 2 convolutional layers per block. The convolutional layers all have a filter length of 16 and have 64_k_ filters, where k starts out as 1 and is incremented every 4th residual block. Every alternate residual block subsamples its inputs by a factor of 2, thus the original input unit is ultimately subsampled by a factor of 2^8.

Beating the experts

The model is evaluated using two metrics: sequence level accuracy and set level accuracy. Sequence level accuracy compares predictions against the ground truth annotations. Set level accuracy just takes the set of arrhythmias present in each 30 second record as the ground truth (i.e., it does not penalise for time misalignment.

Here we can see how well the model performs on each class, compared to an expert cardiologist:

Note that when the scores are taken in aggregate, the model outperforms the cardiologists both in precision and recall.

The model outperforms the average cardiologist performance on must rhythms, noticeably outperforming the cardiologists in the AV Block set of arrhythmias… This is especially useful given the severity of Mobitz II and complete heart block and the importance of distinguishing these two from Wenckebach which is usually considered benign.

Where the model does make mistakes, these ‘make sense’ given that the confused classes often have very similar ECG morphologies.

Given that more than 300 million ECGs are recorded annually, high-accuracy diagnosis from ECG can save expert clinicians and cardiologists considerable time and decrease the number of misdiagnoses. Furthermore, we hope that this technology coupled with low-cost ECG devices enables more widespread use of the ECG as a diagnostic tool in places where access to a cardiologist is difficult.

Automatic database management system tuning through large-scale machine learning

August 11, 2017

Automatic database management system tuning through large-scale machine learning Aken et al. , SIGMOD’17

Achieving good performance in DBMSs is non-trivial as they are complex systems with many tunable options that control nearly all aspects of their runtime operation.

OtterTune uses machine learning informed by data gathered from previous tuning sessions to tune new DBMS deployments. In experiments with OLTP workloads (on MySQL and Postgres) and OLAP workloads (on Vector), OtterTune produces a DBMS configuration that achieves 58-94% lower latency compared to the default settings or configurations generated by other tuning advisors.

We also show that OtterTune generates configurations in under 60 minutes that are within 94% of ones created by expert DBAs.

Tuning challenges

The optimal configuration is different for every application / workload. (If this wasn’t true, presumably by now the default settings would be close to the optimal, which they aren’t!). To demonstrate this the authors take three different workloads and find an optimal configuration for each of them. Configuration 1 is the best for workload 1, and so on. Now see what happens when you run all three workloads under each configuration:

What’s best for one workload is very clearly not the best for the others!

Furthermore, for any given workload, there are a lot of configuration parameters to play with. The following chart shows the growth in the number of configuration parameters (knobs) in MySQL and Postgres over time:

Each individual configuration parameter may have many possible settings, and the differences in performance from one setting to the next may not be regular. Here’s an illustration showing the relationship between MySQL’s buffer pool size and latency for one workload:

And if all that wasn’t complex enough, many of the parameters are inter-dependent such that changing one affects the others.

The different combinations of knob settings means that finding the optimal configuration is NP-hard.

Here’s an example showing the inter-dependencies between log file size and buffer pool size in MySQL for a workload:

How OtterTune works

OtterTune is a tuning service that works with any DBMS. It maintains a repository of data collected from previous tuning sessions, and uses this data to build models of how the DBMS responds to different knob configurations. For a new application, it uses these models to guide experimentation and recommend optimal settings. Each recommendation provides OtterTune with more information in a feedback loop that allows it to refine its models and improve their accuracy.

At the start of an tuning session the DBA specifies the metric to optimise (latency, throughput, …). OtterTune then connects to the DBMS and determines the hardware profile (just an identifier from a pre-defined list at the moment, e.g., an EC2 instance type) and current configuration. Then the target workload is run (either to completion, or for a fixed duration) while OtterTune observes the chosen optimisation metric. At the end of the run, OtterTune grabs all of the DBMS internal metrics (statistics and counters etc.). These are stored in the repository and OtterTune selects the next configuration to test. This happens in two phases:

  1. OtterTune tries to map the run to one it has seen (and tuned) in the past (workload characterization), giving a strong indication of where to look in the search space.
  2. Using this information OtterTune identifies and ranks the most important configuration parameters and settings to try.

This process of observing a run under a given configuration and selecting the next configuration to try based on the results continues until the DBA is satisfied with the improvement over the initial configuration. To help the DBA determine when to stop, OtterTune provides an estimate of how much better the recommended configuration is than the best seen so far.

OtterTune assumes that the physical design of the database (indices, materialized views etc.) is reasonable. Future work may investigate how to apply similar techniques to optimise the database’s physical design. (See also Peloton, which we looked at earlier this year).

OtterTunes’s machine learning pipeline

Let’s take a deeper look at OtterTune’s machine learning pipeline, which is where all the smarts take place.

Before OtterTune can characterise a workload, it first tries to discover a model that best represents its distinguishing aspects. This process uncovers the most important DBMS metrics and the most important configuration parameters for the workload. All of the internal metrics are collected, and then a combination of factor analysis and k-means clustering is used to find the most important ones. Factor analysis aims to find a smaller set of latent factors that explain (underlie) the observed variables (it’s related to PCA, but not the same). Each factor is a linear combination of the original variables, and factors can be ordered by the amount of variability in the original data they explain.

We found that only the initial factors are significant for our DBMS metric data, which means that most of the variability is captured by the first few factors.

The results of the factor analysis yield coefficients for metrics in each of the selected factors. From these coefficients closely correlated metrics can be identified and pruned. The remaining metrics are then clustered using the factor coefficients as coordinates. This results in clusters of metrics as shown in the following figure:

Inspecting the clusters shows that each cluster corresponds to a distinct aspect of a DBMS’s performance.

From each cluster, OtterTune keeps a single metric, the one closest to the cluster center.

From the original set of 131 metrics for MySQL and 57 for Postgres, we are able to reduce the number of metrics by 93% and 82%, respectively.

Having reduced the metric space, OtterTune next seeks to find the subset of configuration parameters which have the most impact on the target objective function (optimisation goal).

OtterTune uses a popular feature selection technique for linear regression, called Lasso, to expose the knobs that have the strongest correlation to the system’s overall performance. In order to detect nonlinear correlations and dependencies between knobs, we also include polynomial features in our regression.

The number of features that Lasso keeps depends on the strength of a penalty parameter that penalises models with large weights. By starting with a large penalty and gradually reducing it, it is possible to observe the order in which configuration parameters are added back into the regression, and this ordering is used to determine how much of an impact they have on the target metric.

The Lasso computations are performed in the background as new data arrives from different tuning sessions. Each invocation takes around 20 minutes and consumes about 10GB of memory for a repository with 100K trials and millions and data points.

From the ranked list, OtterTune dynamically increases the number of knobs used in a tuning session over time. “As we show in our evaluation, this always produces better configurations than any static knob count.

To find the closest matching workload from those it has seen before, OtterTune computes the Euclidean distance between the vector of (non-redundant) metrics for the target workload and the corresponding vector for those in its repository. This matching is repeated after each experiment while OtterTune gathers more data since the matched workload can vary for the first few experiments before converging on a single workload.

To recommend a configuration to try next, OtterTune uses Gaussian Process (GP) regression. (We looked at Bayesian optimisation previously with CherryPick and BOAT).

OtterTune starts the recommendation step by reusing the data
from the workload that it selected previously to train a GP model.
It updates the model by adding in the metrics from the target workload that it has observed so far. But since the mapped workload is not exactly identical to the unknown one, the system does not fully trust the model’s predictions. We handle this by increasing the variance of the noise parameter for all points in the GP model that OtterTune has not tried yet for this tuning session.

Whether OtterTune chooses exploration (investigating areas in the model where its confidence is low) or exploitation (searching around configurations close to its best so far) depends on the variance of data points in its GP model. It will always choose the configuration with the greatest expected improvement.

OtterTune uses gradient descent to find the local optimum on the surface predicted by the GP model using a set of configurations, called the initialization set, as starting points.

The initialization set contains a mix of top performing configurations from the current tuning session, and configurations with randomly selected values. It takes OtterTune about 20 seconds to complete the gradient descent search per observation period.


OtterTune is evaluated using workloads from YCSB, TPC-C, a Wikipedia-based OLTP workload, and TPC-H and MySQL v5.6, Postgres v9.3, and Actian Vector v4.2.

OtterTune is able to find good configurations within 15-60 minutes, as evidenced by the charts below (green line is OtterTune, the blue line is another tuning tool called iTuned).

The key charts for me though are the following, which show how effective the configurations found by OtterTune are when compared to (a) the DBMS default settings, (b) tuning scripts from open-source tuning advisor tools, (c) an experienced DBA, and (d) the configuration used by Amazon RDS.

Our results show that OtterTune produces configurations that achieve up to 94% lower latency compared to [the DBMS] default settings or configurations generated by other tuning advisors. We also show that OtterTune generates configurations in under 60 minutes that are comparable to ones created by human experts.

Enabling signal processing over data streams

August 10, 2017

Enabling signal processing over data streams Nikolic et al., SIGMOD ’17

If you’re processing data coming from networks of sensors and devices, then it’s not uncommon to use a mix of relational and signal processing operations.

Data analysts use relational operators, for example, to group signals by different data sources or join signals with historical and reference data. They also use domain-specific algorithms such as Fast Fourier Transform (FFT) to do spectral analysis, interpolation to handle missing values, or digital filters to recover noisy signals.

This can be accomplished in environments such as SQL Server, Greenplum and Spark through integration with R. The trouble is that there’s an impedance mismatch between the world of relational query processing and the world of R / signal processing. General purpose streaming engines support relational and streaming queries over relational or tempo-relational data, whereas numerical tools typical support offline operations on arrays. This leads to a lot of inefficiency every time you cross between the two worlds. For real-time processing of streaming data, the overheads can easily be unacceptable.

TrillDSP, which builds on top of the Trill streaming analytics engine from Microsoft, provides a deep integration of digital signal processing (DSP) operations and the Trill general purpose query processor. It provides a nice example of the benefits and elegance of doing so, so let us hope we see similar developments for other stream processing engines following suit.

We have built a query engine, named TRILLDSP, that deeply integrates relational and signal processing… (1) it provides a unified query language for processing tempo-relational and signal data; (2) its performance is comparable to numerical tools like MATLAB and R and orders of magnitude better than existing loosely-coupled systems; (3) it provides mechanisms for defining custom DSP operators and their integration with the query language; (4) it supports incremental computation in both offline and online analysis.

An example of mixed relational and signal processing

Consider an IoT application receiving a stream of temperature readings from different sensors in time order. We want to run a five-stage processing pipeline on the signal coming from each source:

  1. Grouping by source, while retaining time order inside each group
  2. Interpolation to provide a uniformly sampled signal suitable for feeding to a DSP algorithm
  3. Windowing to create 512-sample windows of data with a 256 sample hop size
  4. Spectral analysis of each window using FFT: a frequency representation is computed at each hopping point, and then a user-defined function (e.g., retaining only the top-k spectrum values) operates on the frequency representation. Finally, an inverse FFT operation is performed to transform the samples back again.
  5. Unwindowing, to restore the original signal form – projecting output arrays back to the time axis and summing up the overlapping values.

Here’s how we can express the computation in TrillDSP:

And for comparison, this is what it would look like in native R:

And in SparkR:

For pure DSP tasks, TrillDSP is comparable with or better than best-of-breed signal processing systems such as MATLAB, Octave, and R. It also outperforms WaveScope, another streaming engine with DSP support.

For queries mixing relational and signal processing, TrillDSP shows up to 146x better performance than loosely-coupled systems such as SparkR and SciDB-R.

TrillDSP: From streams to signals

We know that we need data in array format for DSP operations, but TrillDSP chose not to make arrays first-class citizens in the data model. Instead, arrays are exposed only through designated DSP operators that hide the complexity of on-the-fly translation between relational and array models. This preserves the performance and query language for the existing relational API, while still giving DSP experts what they need. An initial loose coupling design (shelling out to R) spent more than 91% of the total execution time just in translation between relational and array formats. Supporting DSP operations in-situ is required for good performance:

IoT applications run the same query logic over many sensor devices. For that reason, we design signal operations to natively support grouped processing and implement them as stateful operators that internally maintain the state of each group. Coupling these operators with Trill’s streaming temporal MapReduce operator is key to efficient grouped signal processing, as seen in experiments.

Signals are treated as a special kind of stream in which stream events have no overlapping lifetimes. If a stream does has events which overlap in time, on simple way to convert the stream to a signal is to average overlapping stream events on the value field. In TrillDSP:

var s0 = steam.Average(e => e.Value)

We’ll see how to create uniformly sampled signals in the next section.

Signal payloads can be of arbitrary structure, including complex and real-valued signals, so long as the payload type implements the following interface:

Thus customized signal payloads can be used for processing in different application domains. As well as discrete-time signals users may also want to work with more general continuous signals whose values can be expressed as a function of time.

For example, in amplitude modulation, users multiply a signal with a continuous carrier signal, which is usually a sine wave of a given frequency and amplitude… To capture continuous signals efficiently, TrillDSP introduces functional signal payloads that carry a lambda function for computing signal values at any point in time.

You can either specify a function per event, or the more efficient function per signal.

TrillDSP supports a wide range of operations on signals as shown in the following table:

Uniformly-sampled signals

Most DSP algorithms expect signals of equally spaced samples. Uniform signals in TrillDSP are defined by a sampling period and a sampling offset. The offset defines the initial time shift of samples and is useful for correlating. The TrillDSP Sample operator produces a uniform signal by sampling a non-uniform one. To handle irregular events, the Sample operator can be passed an interpolation policy, that specifies the rules for computing missing values. A policy consists of an interpolation function and an interpolation window size determining the maximum time distance among reference samples used for interpolation.

TrillDSP provides a set of common interpolation functions: constant, last-valued, step (zero-order), linear (first-order), and second-degree polynomial (second-order), and also supports user-defined interpolation functions.

Signal windows are implemented as fixed-size circular arrays to efficiently support offline and online processing with overlapping windows.

For binary operation on uniformly sampled signals the operands need to share the same sampling period and offset. TrillDSP provides resampling operations (upsample, downsample, and resample) that allow users to change these properties on the fly.

Digital filters

Custom digital filters can be written by specifying the sizes of the feed-forward and feeb-backward windows, how to interpret missing values, and the filtering function itself. Such filters implement the following interface:

As an example, a finite impulse response (FIR) filter produces outputs which are the weighted sums of the most recent inputs. It uses only the feed-forward loop, and can be implemented in TrillDSP as follows:


We briefly saw the windowing support when we looked at an example pipeline earlier. The Window operator supports the fundamental windowing workflow when analyzing uniform signals.

The workflow consists of three steps: (1) the first step splits the signal into smaller, possibly overlapping components, (2) the second step transforms each component using a sequence of operations, and (3) the third step recombines the processed components into the final signal.

The transformations in step 2 are expressed as a pipeline of operators. Each pipeline operator implements an interface providing support for incremental computation and re-evaluation. For many signal operations, re-evaluation is the only option.


TrillDSP’s pure DSP performance is evaluated against MATLAB, Octave, and R. For workloads combining relational and signal processing (i.e., the target workloads), TrillDSP is evaluated against Spark with R integration, SciDB with R integration, and WaveScope (the system closest in spirit to TrillDSP).

Here’s a pure DSP example based on FIR filtering (note the log axis):

TrillDSP’s SIMD implementation of dot product outperforms the naïve loop implementation in WaveScope and R by a wide margin. TrillDSP stays competitive with Octave while being up to 1.8x slower than MATLAB. In contrast to the offline tools, TrillDSP incurs additional overheads from copying stream events into arrays and checking for missing values. These experiments demonstrate that TrillDSP can easily integrate with 3rd-party libraries to deliver performance competitive with or even better than the state-of-the-art tools used by practitioners in offline analysis.

Of course, beyond just performance, TrillDSP also offers true online support and tight integration with relational operators. Consider an example of grouped signal interpolation and filtering. The task is to take two uniformly sampled signals with different sampling periods, group them by sensor id, and in each group upsample the first signal to match the sampling period of the second before reporting events at which both group signals have simultaneously extreme values.

The following chart shows the running time of various systems normalized by the running time of TrillDSP with 16 cores over varying numbers of groups. (So for example, a value of 4 on the y-axis means that configuration is 4x slower than TrillDSP with 16 cores on the same task). Note again the log scale on the y-axis.

We note that even the single-threaded TRILLDSP program outperforms all the other systems in all but one cases due to its group-aware interpolation operator simultaneously processing multiple subsignals. Running TRILLDSP using 16 cores accelerates the single-threaded performance by up to 9.4 times, which leads to much better performance overall, up to 146x faster than SPARKR and up to 67x faster than SCIDB-R. Both of these systems behave poorly for large numbers of groups, eventually crashing when processing one million groups. Such bad performance was primarily caused by expensive reshuffling operations, which in SPARKR also involve writing to disk.