Skip to content

Automated localization for unreproducible builds

June 22, 2018

Automated localization for unreproducible builds Ren et al., ICSE’18

Reproducible builds are an important component of integrity in the software supply chain. Attacks against package repositories and build environments may compromise binaries and produce packages with backdoors (see this report for a recent prominent example of compromised packages on DockerHub). If the same source files always lead to the same binary packages, then an infected binary can be much more easily detected. Unfortunately, reproducible builds have not traditionally been the norm. Non-determinism creeping into build processes means that rebuilding an application from the exact same source, even within a secure build environment, can often lead to a different binary.

Due to the significant benefits, many open-source software repositories have initiated their validation processes. These repositories include GNU/Linux distributions such as Debian and Guix, as well as software systems like Bitcoin.

If you have a non-reproducible build, finding out why can be non-trivial. It takes time and a lot of effort to hunt down and eradicate the causes. For example, Debian unstable for AMD64 still had 2,342 packages with non-reproducible builds as of August 2017. (The number today as I’m writing this is 2,826). You can see a stubbornly persistent group of unreproducible builds in this screen grab from


RepLoc is a tool for highlighting the files within a package which are the likely root cause of non-reproducible builds. Tested against 671 unreproducible Debian packages, it achieves a Top-1 accuracy rate of 47.09%, and at Top-10 accuracy rate of 79.28%. That’s a significant aid to developers looking at cryptic messages from the builds of packages that may include many hundreds of files. With the help of RepLoc, the authors also identified and fixed non-reproducibility issues in 6 packages from Debian and Guix.

Detecting and diagnosing unreproducible builds

The Debian workflow for testing build reproducibility looks like this:

The source is built in two different environments, deliberately constructed with different environment variables and software configurations. If the two resulting binaries are not identical, the package is flagged for human inspection, a process called localization which seeks to localise the causes of unreproducible builds. One of the major inputs to this process is the diff logs, as generated by diffoscope.

Those logs produce output that looks like this:

Here we can see diffoscope highlighting a difference in libcompat.a. In this case, the root cause is in the Makefile:

Can you spot it?

The root issue is the unstable ordering of files passed to the ar utility to generate libcompat.a (wildcard libcompat/*.c).

Here’s a patch that fixes the issue.

In general there are many possible causes of unreproducible builds including timezones (making e.g. anything that uses the __DATE__ macro to be unreproducible ) and locale settings (making e.g. capture of output text unreproducible).

Introducing RepLoc

RepLoc begins with the build logs and diff log created by diffoscope, and seeks to automatically highlight the most likely files in which the root cause can be found. RepLoc’s Query Augmentation (QA) component uses information from the logs to refine queries that help to pinpoint likely causes. The Heuristic Filtering component embodies 14 hand-coded heuristic rules that further help to highlight possible causes. The combined outputs are passed to ranking component to produce the list of ranked likely-culprit source files as the final output.

Query Augmentation

The files names highlighted in the diff log are used as the initial seed query set. The build logs contain additional information relating to these files, for example:

The QA component splits the build logs into directory sections passed on the ‘Entering / Leaving directory’ messages in the logs. Each directory section thus contains a set of commands, which is denoted a ‘command file’ in the paper. TF/IDF is used to assign a weight value to each command file to assess it’s relevance to the initial seed queries. The commands from the most relevant command files are then added to the query set, to produce an augmented query set. (In the example above, the ar cru bin-x86_64/libcompat.a... command causes us to add the content of this command file).

Heuristic filtering

The authors extract fourteen heuristic rules from Debian’s documentation. These rules are encoded as Perl regular expressions(!), as summarised in the table below.

  • The time and date rules look for uses of the __TIME__ and __DATE__ macros.
  • The gzip rule (3) looks for uses of gzip without the -n argument (in which case gzip embeds timestamps in the resulting archive).
  • The date cmd rule (4) looks for capture of the current date using the date shell command
  • PL_LOCALTIME looks for Perl scripts capturing date and time
  • DATE_IN_TEX looks for date embedding in tex files
  • SORT_IN_PIPE captures cases where sort is used without a local set
  • TAR_GZIP_PIPE looks for tar and gzip executed in a pipeline
  • PL_UNSORTED_KEY catches traversal of unsorted hash keys in Perl
  • LS_WITHOUT_LOCALE captures cases where ls is used without a locale
  • UNSORTED_WILDCARD looks for the use of wildcard in Makefiles without sorting

With the rules in hand, it’s a simple matter of running grep over the source tree. Heuristic filtering has good recall, but poor precision (i.e., it can produce a lot of false positives).


We can compute the cosine similarity (using TF/IDF) between each package file and the augmented queries to produce a ranked list of candidate files from the QA module. These are then combined with the files highlighted by the HF module to give a simple weighted score:
\displaystyle score(f) = (1-\alpha)QA(f) + \alpha HF(f)

Where QA(f) is the similarity score produced by the QA module, and HF(f) is 1 if the HF module flagged the file, and 0 otherwise. α is a configurable parameter to tune the balance between the two terms.

RepLoc in action

At the time of the study, Debian had 761 packages with accompanying patches that turned unreproducible builds into reproducible ones. This constitutes the ground truth dataset for RepLoc. The dataset is further divided into four subsets based on the major causes of non-reproducibility. The table below summarises how well RepLoc gets on. Concentrate on the bottom line (RepLoc) in each row (the other lines show how RepLoc behaves with different subsets of its modules).

  • A@N is a top-N accuracy rate score, defined as the percentage of top-N ranked file lists produced by RepLoc that contain at least one problematic file (from the patches).
  • P@N is a top-N precision score, defined as the percentage of files reported in a top-N list that are genuinely problematic
  • R@N is at top-N recall score, defined as the percentage of all problematic files that are successfully identified in a top-N list.

Overall, RepLoc achieves an average accuracy score of 79% for Top-10 files. I.e., if you examine the first ten files RepLoc highlights, you have a 79% chance of finding an issue causing an unreproducible build in at least one of them. You will also find on average 75% of all the files with reproducibility problems by the time you have worked through that top-10 list.

The authors then used RepLoc to see if they could find the root causes of unreproducible builds for packages where no ground truth was available (i.e., there was no-known reproducible build process for them). Three packages from Debian are fixed (regina-rexx, fonts-uralic, and manpages-tr). The problematic files are right at the top of list produced by RepLoc. Three packages from Guix are also fixed (libjpeg-turbo, djvulibre, and skalibs). Once more the problematic files are right at the top of the list produced by RepLoc.

Future work

For the future work, we are interested in the localization of problematic files for tool-chain related issues. Also, inspired by record-and-play techniques from crash reproduction based debugging research, it would be interesting to leverage these techniques to detect more accurate correspondence between the build commands executed and the built binaries.

Generalized data structure synthesis

June 21, 2018

Generalized data structure synthesis Loncaric et al., ICSE’18

Many systems have a few key data structures at their heart. Finding correct and efficient implementations for these data structures is not always easy. Today’s paper introduces Cozy (, which can handle this task for you given a high-level specification of the state, queries, and update operations that need to be supported.

Cozy has three goals: to reduce programmer effort, to produce bug-free code, and to match the performance of handwritten code. We found that using Cozy requires an order of magnitude fewer lines of code than manual implementation, makes no mistakes even when human programmers do, and often matches the performance of handwritten code.

Let’s start out by looking at four case studies from the evaluation, to get a feel for where Cozy applies.

  • ZTopo is a topological map viewer implemented in C++. A core data structure is the map tile cache for map tiles that are asynchronously loaded over the network and cached on disk or in memory.
  • Sat4j is a Boolean sat solver implemented in Java. A core data structure is the variable store.
  • Openfire is a large scalable IRC server implemented in Java. It’s in-memory contact manager is extremely complex (and has been a frequent source of bugs).
  • Lucene is a search engine back-end implemented in Java. It uses a custom data structure that consumes a stream of words and aggregates key statistics about them.

Starting from a succinct high-level specification, Cozy can generate implementations for all of these data structures, requiring far fewer lines of code for the Cozy specs than for the handwritten implementations:

Because the specifications are shorter, simpler, and more abstract, they are much easier to understand. Programmers writing specifications are therefore less likely to make mistakes, and mistakes will be easier to discover, diagnose, and correct. The specifications also serve as concise, unambiguous documentation.

All the tests from the projects still pass with Cozy’s implementations. Looking at previous reported and fixed bugs, we can see that had Cozy been used in the first place, many bugs would have been prevented:

Cozy’s code is also competitive with handwritten code when it comes to performance:

The only case where it doesn’t match or exceed the handwritten implementation is Sat4j. Here the handwritten code exploits some facts about the data that Cozy doesn’t know – variable IDs can be used as indexes into an array since they always fall between zero and a known maximum bound. Cozy must insert safety checks because it doesn’t know this.

Hopefully that has whet your appetite to learn more about how Cozy works. Let’s take a look under the hood!

Cozy’s high-level algorithm

Cozy starts with a high-level specification of the data structure, from which it constructs an initial implementation. Then it uses guided exploration to iteratively search for better implementations. Each iteration consists of two steps: query synthesis followed by “incrementalisation”. For the case studies above, Cozy was left to explore for three hours. You can stop the exploration at any point, because the implementations Cozy generates are always correct.

We’ll use a simplified version of the Openfire data structure as a running example. In Openfire, a user’s contact list is computed based on the groups that each user belongs to. The data structure needs to be able to efficiently answer the question “should user u1 appear in the contacts of user u2?”. The query method visible defines this relationship. A Cozy specification declares the abstract state maintained by the data structure, the queries that must be supported, and the update operations that modify data. Here’s an example for Openfire:

As specified above, visible runs in O(|groups| x |members) time. After a few rounds, Cozy will find an implementation that runs in O(g) time, where g is the maximum number of groups that any one user belongs to.

Cozy generates an initial data representation using three fields, one for each component of the abstract state. For each field in a representation, Cozy also generates a concretization function to compute the field’s representation from the abstract state. With the initial implementation, these functions are simple selectors:

  v1(users, groups, members) = users 
  v2(users, groups, members) = groups
  v3(users, groups, members) = members

Now every query and update operation can be rewritten in terms of these fields by simple substitution. So visible becomes:

This gives us a baseline implementation from which we can start to look for improvements.

In this example, after a few steps Cozy ends up with a Map-based representation that looks like this:

To go along with this new representation, Cozy also generates new concretization functions:

The new representation and implementation generated by the query synthesis step above doesn’t keep the new state variables s1, s2, and s3 up to date when join is called. The incrementalization step restores correct functioning by adding code to join to update the new state variables.

A simple but inefficient solution would be to recompute the value of each concrete state variable from scratch. Because an update usually makes a small change to the abstract state, Cozy produces an incremental update that makes small changes to the concrete state in response to a small change to the abstract state.

The update procedure is rephrased as a set of queries that compute the changes that should take place, together with a simple hardcoded snipped that applies the computed changes.

Here’s the code Cozy produces to update the concrete state when a user joins a group:

This code has introduced two new query operations: altered_keys_s1 and new_value_for_key_s1.

These new queries will be added to the query set for the data structure and optimised by the query synthesiser in future optimisation rounds.

At the end of each improvement step, a dead code elimination pass gets rid of unused variables (v1, v2, and v3 can now be eliminated in this example). All code that keeps those variables up to data can also be eliminated. Cozy uses a mark-and-sweep algorithm to do this housekeeping.

The details

Cozy’s core specification language looks like this:

The Cozy specs that users write contain sugared forms (e.g.) len(X) that are desugared into this core language. When I first started reading about Cozy, I initially expected it to have a full armoury of internal data structures at its disposal, but actually Cozy does everything today using maps!

We plan to extend Cozy with additional primitives for heaps, trees, and other efficient data structures in the future. For the case studies we examined, maps alone are sufficient to discover efficient implementations.

When searching for better query implementations, Cozy using multiple threads, each thread searching in parallel. (Which reminds me, what guarantees does Cozy give regarding concurrent access to the data structures it generates? I couldn’t see anything in the paper about this. That seems pretty important given the use cases…). There are two key parts to the search strategy: how to generate candidates, and how to evaluate candidates.

To bias the search towards useful expressions, Cozy adds a small number of handwritten diversity rules into the mix. Whenever Cozy considers a new candidate expression, it applies these rules and also considers the resulting expressions.

In practice, Cozy’s enumerative search machinery does not function well without the diversity rules, and vice-versa.

Cozy uses equivalence class deduplication to keep the search space manageable. Given a set of example inputs (produced by the verifier), if two expression produce the same output on every example, Cozy uses its cost model to decide which version to keep. The cost model looks like this:

With new queries synthesised, Cozy uses pre-defined update sketches for different types. For example, with maps the update sketch finds keys whose values have changed and updates each one in the map.

The last word

Cozy is effective because incrementalization allows it to implement both pure and imperative operations using only a query synthesizer. A high-quality cost function and diversity injection make the query synthesizer powerful and practical… Our case studies demonstrate that data structure synthesis can improve software development time, correctness, and efficiency.

ConflictJS: finding and understanding conflicts between JavaScript libraries

June 20, 2018

ConflictJS: finding and understanding conflicts between JavaScript libraries Patra et al., ICSE’18

The JavaScript ecosystem is fertile ground for dependency hell. With so many libraries being made available and the potential for global namespace clashes, it’s easy for libraries to break each other. Sometimes in an obvious to spot way (that’s a good day!), and sometimes in subtle ways that are harder to detect.

ConflictJS is a tool for finding conflicting JavasScript libraries. It’s available as open source and nicely documented, so you can try it for yourself from

We use ConflictJS to analyze and study conflicts among 951 real-world libraries. The results show that one out of four libraries is potentially conflicting and that 166 libraries are involved in at least one certain conflict.

Why do conflicts happen?

At a language level, until ES6 modules at least, there was no built-in namespacing mechanism (though we do have a number of conventions and module libraries). In principle developers can follow a ‘single API object’ pattern where the entire API of a library is encapsulated behind a single object. In practice, many of them don’t (71% of libraries did not do this, from 951 studied for this paper). There are also third-party module systems such as AMD and CommonJS, but they’re not universally used and not fully compatible.

…since widely used libraries cannot rely on recently added language features, they typically ensure backward compatibility by relying on other ways to export their APIs. In summary, the lack of namespace and modules in currently deployed versions of JavaScript creates a non-trivial problem for library developers.

Different types of conflicts

As an example of a conflict, consider a client that uses both Strophe.js (an XMPP) library and JSEncrypt.js (encryption). Both libraries write to the global variable Base64. Say you build an application using JSEncrypt, test it out, and everything is working fine. Later on you add Strophe to the project, loaded after JSEncrypt is loaded. Calls to get encrypted data from JSEncrypt will suddenly start returning the value false.

Conflicts occur when two different libraries both write to the same path in the global namespace. The authors define four different types of conflict that may occur:

  1. Inclusion conflicts: when the mere act of including multiple libraries is enough to cause an exception to be thrown.
  2. Type conflicts: when multiple libraries write type-incompatible values to the same globally reachable location.
  3. Value conflicts: when multiple libraries write different values (but with compatible non-function types) to the same globally reachable location.
  4. Behaviour conflicts: when multiple libraries write different functions to the some globally reachable location.

Examples of the four cases are given in the table below:


How ConflictJS works

ConflictJS works in two stages. First it dynamically analyses the loading of each library to determine the global access paths it writes to. Then it takes pairs of potentially conflicting libraries (i.e., they both write to at least one shared global access path) and generates tests to explore whether or not there really is a problem in practice. ConflictJS is precise, since any problem it reports is genuinely a problem, but it is not sound – that is, it may miss some conflicts.

Distinguishing between potential and actual conflicts is important to avoid false positives. For example, both JSLite.js and ext-core.js write to Array.prototype.remove. The functions are syntactically different, but semantically equivalent.

To check for inclusion conflicts, it is sufficient to generate a client that loads both libraries. To check for type conflicts, ConflictJS generates a client that reads the value at the conflicting access path, and checks its type. If different configurations (l1 alone, l2 alone, l1 then l2, l2 then l1) cause the client to see different different types, than a type conflict is reported. The check for value conflicts is similar, but instead of comparing types, ConflictJS does a deep comparison of the objects written to the global access path.

That just leaves behaviour conflicts. ConflictJS generates a function call to test functions written to the same global access path. First the number of parameters, n, is estimated from the length property of the function object, then the generator decides on a random number ranging between 0 and n arguments to pass. It randomly chooses the argument types choosing between boolean, string, number, array, object, undefined, and null. To create objects, the generator creates up to 10 properties and assigns randomly generated values to them.

Once the arguments are generated, the function is called using the generated arguments. If and only if the call succeeds, without raising an exception, for at least one library configuration, the generator synthesizes a client that contains this call.

What the analysis reveals…

ConflictJS is tested with 951 JavaScript selected – these are the subset of libraries from the CDNjs content delivery network that can be used in isolation on a standard desktop browser ( they have no further dependencies).

Between them, these 951 libraries write to a total of 130,714 different access paths. 4,121 of these, across 268 libraries, cause a potential conflict. Of the potentially conflicting libraries, ConflictJS finds an actual conflict in 62% of them (that’s 17% of the 951 total libraries). All four conflict types are represented.

…all four kinds of conflict are prevalent in practice, which confirms our decisions to consider all four kinds in ConflictJS… the majority of conflicts are non-inclusion conflicts, i.e., they do not cause an exception just after loading the conflicting libraries. Finding such conflicts and reasoning about them is challenging for both library developers and users alike.

The authors then went on to study a random sample of 25 of the conflicting libraries. They uncovered seven patterns of root cause for conflicts, as shown in the table below.


Our work not only provides a practical tool for library developers to detect conflicts and for library users to avoid conflicting libraries, but also highlights the importance of language features for encapsulating independently developed code.

Debugging with intelligence via probabilistic inference

June 19, 2018

Debugging with intelligence via probabilistic inference Xu et al., ICSE’18

Xu et al. have built a automated debugger that can take a single failing test execution, and with minimal interaction from a human, pinpoint the root cause of the failure. What I find really exciting about it, is that instead of brute force there’s a certain encoded intelligence in the way the analysis is undertaken which feels very natural. The first IDE / editor to integrate a tool like this wins!

The authors don’t give a name to their tool in the paper, which is going to make it awkward to refer to during this write-up. So I shall henceforth refer to it as the PI Debugger. PI here stands for probabilistic inference.

We model debugging as a probabilistic inference problem, in which the likelihood of each executed statement instance and variable being correct/faulty is modeled by a random variable. Human knowledge, human-like reasoning rules and program semantics are modeled as conditional probability distributions, also called probabilistic constraints. Solving these constraints identifies the most likely faulty statements.

In the evaluation, when debugging problems in large projects, it took on average just 3 interactions with a developer to find the root cause. Analysis times are within a few seconds. One of the neat things to fall out from using probabilistic inference is that the developer interacting with the system doesn’t have to give perfect inputs – the system copes naturally with uncertainty.


To see how PI Debugger uses probabilistic inference to pinpoint likely root causes, let’s work through a concrete example. Here’s a Python program defining a function path_url. The function takes an input url and safely encodes the path portion. For example, given as input def?x=5 the output should be

This program has a bug. When given an already encoded input, it encodes it again (replacing % with ‘%25’). For example, the input results in the output, whereas in fact the output should be the same as the input in this case.

Let’s put our probabilistic thinking caps on and try and to debug the program. We ‘know’ that the url printed on line 19 is wrong, so we can assign low probability (0.05) to this value being correct. Likewise we ‘know’ that the input url on line 1 is correct, so we can assign high probability (0.95). (In probabilistic inference, it is standard not to use 0.0 or 1.0, but values close to them instead). Initially we’ll set the probability of every other program variable being set to 0.5, since we don’t know any better yet. If we can find a point in the program where the inputs are correct with relatively high probability, and the outputs are incorrect with relatively high probability, then that’s an interesting place!

Since url on line 19 has a low probability of being correct, this suggests that url on line 18, and purl_str at line 12 are also likely to be faulty. PI Debugger actually assigns these probabilities of being correct 0.0441 and 0.0832 respectively. Line 18 is a simple assignment statement, so if the chances of a bug here are fairly low. Now we trace the data flow. If purl_str at line 12 is likely to be faulty then s at line 16 is also likely to be faulty (probability 0.1176).

Line 16 gets executed three times, two of which produce correct outputs, and one of which produces incorrect outputs. This reduces the suspicion on line 16 itself (now probability 0.6902), and suggests the first instance of i at line 16 is faulty.

Using variable name similarity inference rules, PI Debugger assigns higher correlation probability between variables with similar names (e.g. path_url and url) than those with dissimilar names (e.g. url and make_str). If the first instance of i is faulty, then purl at line 7 is also likely faulty. This could be due to a faulty value of path or a bug in append. PI Debugger knows to assign lower probabilities to bugs in library code.

If path is faulty on line 7, then either path itself is faulty, or the output of line 3 is faulty. We know that the input url on line 1 is likely correct, so the odds are in favour of the rhs of the assignment on line 3 being the culprit. And indeed it is – we should have a test to make sure that path is not already encoded before we call encode.

Although we describe the procedure step-by-step, our tool encodes everything as prior probabilities and probabilistic constraints that are automatically resolved. In other words, the entire aforementioned procedure is conducted internally by our tool and invisible to the developer.

PI Debugger overview

PI Debugger encodes the process just described, and consists of four main components: the tracing component, the probabilistic constraint encoding component, the probabilistic inference engine, and the feedback component.

The inference procedure can be intuitively considered as a process of fusing hints from various observations (i.e., High/Low prior probabilities) through the propagation channels dictated by the constraints.

Before beginning the analysis all the runtime variables in the program trace are converted in static single assignment (SSA) form. We then proceed in two main phases:

  1. Inferring variable correctness probabilities. This step uses dynamic slicing and propagation constraint probabilities. For example, given an assignment a = b + c, then if any two of the three components are correct, we can infer that the other is also likely to be correct. The encodings are sent to the inference engine which builds a factor graph and computes posterior probabilities using the Sum-Product algorithm.
  2. Inferring statement instance correctness. This stage combines the probabilities from the first stage with domain knowledge rules to determine the probability of each executed statement being correct. There are three types of generated constraints (i) constraints correlating variable probabilities and statement instance probabilities, (ii) constraints from program structure, for example a function with a large number of statements is more likely to include the faulty statement, (iii) constraints from naming convention, on the assuming that function and variable names suggest functionality to some extent. These constraints and prior probabilities are once more sent to the inference engine.

We compute marginal probabilities based on the Sum-Product algorithm. The algorithm is essentially a procedure of message passing (also called belief propagation), in which probabilities are only propagated between adjacent nodes. In a message passing round, each node first updates its probability by integrating all the messages it receives and then sends the updated probability to its downstream receivers. The algorithm is iterative and terminates when the probabilities of all nodes converge. Our implementation is based upon libDAI, a widely used open-source probabilistic graphical model library.

At the end of the process, the most probable root cause is reported to the developer. If the developer disagrees, they can provide their assessment of the likelihood the operands in the reported statement are faulty. This then becomes a new observation and we can run inference again.

Constraint generation

Section 5 in the paper contains detailed information on the constraint generation process. For variables there are three encoding rules, covering the forward causality of a statement (e.g. the likelihood of the lhs of an assignment being correct given the rhs), the backward causality (e.g. the likelihood of the rhs of an assignment being correct given the lhs), and causality for control predicates (the likelihood of the correct branch being taken). The computation rules propagate probability differently depending on the type of statement involved (e.g., assignment, mod, attribute read, read of a collection element, equivalence,…).

Three different kinds of constraints are generated for statement instances:

  • value-to-value statement constraints model the causality between variable probabilities and statement instance probabilities
  • program structure constraints modelling hints from program structure, and
  • naming convention constraints modelling hints from names.

Evaluation results

PI Debugger is applied to a set of real world bugs taken from some of the largest Python projects on GitHub. The monster table below summarises the results. The ‘PD’ column reports the number of interactions, including the final root cause step. (The ‘FD’ column is the number of steps taken by the Microbat debugging tool). PI Debugger completes inferences in around a second for most cases.


Four bugs are further selected for a user study:

16 students are randomly partitioned into two groups and asked to find the root cause of the bugs, one group using PI Debugger, and the other using pdb. On average, the PI Debugger group were 34% faster.

The results show that our technique can identify root causes of a set of real-world bugs in a few steps, much faster than a recent proposal that does not encode human intelligence. It also substantially improves human productivity.

DeepTest: automated testing of deep-neural-network-driven autonomous cars

June 18, 2018

DeepTest: automated testing of deep-neural-network-driven autonomous cars Tian et al., ICSE’18

How do you test a DNN? We’ve seen plenty of examples of adversarial attacks in previous editions of The Morning Paper, but you couldn’t really say that generating adversarial images is enough to give you confidence in the overall behaviour of a model under all operating conditions. Adversarial images approach things from a ‘think like an attacker’ mindset. We want to ‘think like a tester.’ For example, the work on DeepXplore which uses model ensembles to find differences in outputs that suggest bugs. The importance of testing DNNs is especially obvious when it comes to applications such as autonomous driving. Several of the ideas from DeepXplore are used in DeepTest, which looks specifically at testing of autonomous driving system. I think you could apply the DeepTest techniques to test other kinds of DNNs as well.

…despite the tremendous progress, just like traditional software, DNN-based software, including the ones used for autonomous driving, often demonstrate incorrect/unexpected corner-case behaviours that lead to dangerous consequences like a fatal collision.

DeepTest is a system designed to aid in the testing of autonomous driving models. When used to test three of top performing DNNs from the Udacity self-driving car challenge, it unearthed thousands of erroneous behaviours, many of which could lead to potentially fatal crashes.

Testing challenges in autonomous driving

We’re interested in testing the Deep Neural Network(s) at the core of an autonomous driving system. These take inputs from a variety of sensors, and actuate car systems such as steering, braking, and accelerating.

This paper focuses on the camera input and the steering angle output. We can’t apply traditional measures such as statement coverage to understand how well tested a DNN is, so we need to find an alternate metric. Borrowing from DeepXplore, DeepTest uses a notion of neuron coverage. Another interesting challenge is how you know whether the output of the model is correct in any given scenario. DeepXplore introduced the notion of using an ensemble of models to detect models making unusual predictions for a given input. DeepTest has a neat twist on this, using an ensemble of inputs which should all lead to the same output (e.g. the same road in different weather and visibility conditions) to detect erroneous outputs.

How do you measure test coverage?

The input-output space (i.e., all possible combinations of inputs and outputs) of a complex system like an autonomous vehicle is too large for exhaustive exploration. Therefore, we must devise a systematic way of partitioning the space into different equivalence classes by picking one sample from each of them. In this paper, we leverage neuron coverage as a mechanism for partitioning the input space based on the assumption that all inputs that have similar neuron coverage are part of the same equivalence class (i.e., the target DNN behaves similarly for these inputs).

Neuron coverage is simply the fraction of neurons that are activated across all test inputs. Since all neurons eventually contribute to the output, if we maximise neuron coverage we should also be maximising output diversity. For RNN/LSTM models that incorporate loops, intermediate neurons are unrolled to produce a sequence of outputs, with each neuron in an unrolled layer treated as a separate individual neuron for the purpose of coverage computation.

To see whether neuron coverage really is a useful metric in practice, experiments are done with three different DNNs:

The models are fed a variety of inputs, and the neuron coverage, steering angle, and steering direction (left or right) outputs are recorded. Spearman rank correlation shows a statistically significant correlation between neuron coverage and steering angle outputs. Different neurons get activated for different outputs, indicating neuron coverage is a a good approximation for testing input-output diversity. Steering direction also correlates with neuron coverage.

How can you systematically improve test coverage?

As testers, our goal is to improve the neuron coverage. But how? Synthetic inputs are of limited use, so DeepTest borrows a trick often used to increase diversity in training sets: it applies image transformations to seed images to generate new inputs.

… we investigate nine different realistic image transformations (changing brightness, changing contrast, translation, scaling, horizontal shearing, rotation, blurring, fog effect, and rain effect). These transformations can be classified into three groups: linear, affine, and convolutional.

Starting with 1000 input images, and applying seventy transformations to each (taken from the core set of transformations, but with varying parameters), a set of 70,000 synthetic images are generated. The results show that transforming an image does indeed improve neuron coverage:

And here’s the drill-down by transformation type:

If one transformation is good, what about applying multiple transformations at the same time to generate a synthetic image? The following chart shows the effect on neuron coverage as we successively apply a sequence transformations to a given seed image. The results indicate that different image transformations tend to activate different sets of neurons.

Our results demonstrate that different image transformations can be stacked together to further increase neuron coverage. However, the state space of all possible combinations of different transformations is too large to explore exhaustively. We provide a neuron-coverage-guided greedy search technique for efficiently finding combinations of image transformation that result in higher coverage.

The algorithm keeps track of transformations that successfully increase neuron coverage for a given image, and priortises those transformations while generating more synthetic images.

These guided transformations increase coverage across all models, as shown in the following table. Rambo S-3 doesn’t improve very much, but note that is was on 98% coverage to start with!

How do you know whether a model output is correct?

We know how to generate inputs that will increase neuron coverage. But how do we know whether or not those inputs reveal a flaw in the network?

The key insight is that even though it is hard to specify the correct behavior of a self-driving car for every transformed image, one can define relationships between the car’s behavior’s across certain types of transformations.

If we change weather or lighting conditions, blur the image, or apply affine transformations with small parameters values (metamorphic relations), we don’t expect the steering angle to change significantly. There is a configurable parameter λ that specifies the acceptable deviation from the original image set results, based on mean-squared-error versus labelled images in the training set.

As we can see in the figure below, the transformed images produce higher error rates – i.e., they are diverging from the non-transformed output behaviours.

Results from testing autonomous driving DNNs

Using this metamorphic-relation-based test, we can look for differences in model outputs caused by the transformations. DeepTest is able to find quite a lot of them!

Here are some sample images from the failing tests. You can see more at


Manual checking reveals two false positives where DeepTest reports erroneous behaviours but the outputs (as assessed by the authors) actually are safe.

Fixing test failures

Retraining the DNNs with some of the synthetic images generated by DeepTest makes them more robust, as shown in the table below.

What about your DNN?

We use domain-specific metamorphic relations to find erroneous behaviors of the DNN without detailed specification. DeepTest can be easily adapted to test other DNN-based systems by customizing the transformations and metamorphic relations. We believe DeepTest is an important first step towards building robust DNN-based systems.

Popular is cheaper: curtailing memory costs in interactive analytics engines

June 15, 2018

Popular is cheaper: curtailing memory costs in interactive analytics engines Ghosh et al., EuroSys’18

(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).

We’re sticking with the optimisation of data analytics today, but at the other end of the spectrum to the work on smart arrays that we looked at yesterday. Getafix (extra points for the Asterix-inspired name, especially as it works with Yahoo!’s Druid cluster) is aimed at reducing the memory costs for large-scale in-memory data analytics, without degrading performance of course. It does this through an intelligent placement strategy that decides on replication level and data placement for data segments based on the changing popularity of those segments over time. Experiments with workloads from Yahoo!’s production Druid cluster that Getafix can reduce memory footprint by 1.45-2.15x while maintaining comparable average and tail latencies. If you translate that into a public cloud setting, and assuming a 100TB hot dataset size — a conservative estimate in the Yahoo! case — we’re looking at savings on the order of $10M per year.

Real-time analytics is projected to grow annually at a rate of 31%. Apart from stream processing engines, which have received much attention, real time analytics now includes the burgeoning area of interactive data analytics engines such as Druid, Redshift, Mesa, Presto, and Pinot.

Such systems typically require sub-second query response times. Yahoo!’s (Oath’s) Druid deployment has over 2000 hosts, stores petabytes of data, and serves millions of queries per day with sub-second latency. To get these response times, queries are served from memory.

Interactive data analytics

Data flows into an interactive data analytics engine via both batch and streaming pipelines. Data from streaming pipelines is collected by real-time nodes which chunk events by time interval and push them into deep storage. Such a chunk is called a segment. Segments are immutable units of data that can be placed on compute nodes (also known as historical nodes). One segment may be replicated across a number of nodes. A coordinator handles data management, creating and placing segment replicas. Clients send queries to a frontend router (broker), which maintains a map of which nodes are currently storing which segments. Queries often access multiple segments.

This paper proposes new intelligent schemes for placement of data segments in interactive analytics engines. The key idea is to exploit the strong evidence that an any given point in time, some data segments are more popular than others.

Let’s look at some of that evidence, based on Yahoo!s production Druid cluster workloads. Segment accesses show a clear skew, with more recent segments noticeably more popular than older ones. The top 1% of segments are accessed an order of magnitude more than the bottom 40% combined.

However, we can’t rely exclusively on recency. Some older segments continue to stay popular. The following figure shows the level of overlap between segments accessed during an hour of the Yahoo! trace and a reference hour (B1, A3). Traces labeled “A” are from October 2016, “B” traces are from January 2017, and “C” traces from February 2017. Segments from B1 have a 50% chance of co-occurring with segments from A1 that are five months older.

The average query latency for segments depends on the overall cluster size, the query rate, and the replication factor. In the following plot, the replication factor (on the x-axis) is the same for all segments, and the configurations are given as a (cluster size / query rate) pair. For example, the 15 / 2500 line is for a cluster with 15 historical nodes serving 2500 queries per second.

The thing to notice is that each line has a knee after which adding more replicas ceases to help (at 9 replicas for the 15/2500 configuration, and 6 replicas for the other two).

Our goal is to achieve the knee of the curve for individual segments (which is a function of their respective query loads), in an adaptive way.

Segments, balls, and bins

Let’s start with a static problem. Given n historical nodes, and k queries that access a subset of segments, how can we find a segment allocation that minimises both the total runtime and the total number of segment replicas? Assume for the moment that all queries take unit time per segment accessed, historical nodes have no segments loaded initially, and all historical nodes have equal compute power.

For each segment, count the number of distinct queries that access it. The example in the paper has 6 queries accessing four segments between them, with counts { S1:6, S2:3, S3:2, S4:1 } (i.e., segment one is accessed by all six queries, segment two by four of them, and so on). We can think of each query-segment pair as a coloured ball, with the the number of balls of that colour being the query count. A picture probably helps here:

Let the historical nodes be bins, and we have a classic coloured-ball bin-packing problem.

The problem is then to place the balls in the bins in a load-balanced way that minimizes the number of “splits” for all colors, i.e., the number of bins each color is present in, summed up across all colors. The number of splits is the same as the total number of segment replicas. Unlike traditional bin packing which is NP-hard, this version of the problem is solvable in polynomial time.

Each coloured ball is a query access, so if each node has the same number of balls, each node will receive the same number of queries. So historical nodes will finish serving queries at the same time, giving the minimum total makespan. There are several ways we could put four balls in each node. To minimise the amount of memory required, we need to minimise the number of different colours at each node (i.e., the number of segments that node needs to keep in memory). This is what is achieved by minimising the number of splits.

The solution to the problem is given by algorithm 1 below:

The algorithm maintains a priority queue of segments, sorted in decreasing order of popularity (i.e, number of queries accessing the segment). The algorithm works iteratively: in each iteration it extracts the next segment from the head of the queue, and allocates the query-segment pairs corresponding to that node to a historical node, selected based on a heuristic called ChooseHistoricalNode…

ChooseHistoricalNode works on a ‘best fit’ basis: it picks the node that has the least number of ball slots remaining after allocating all of the queries from the chosen segment to it. If no nodes have sufficient capacity, then the node with the largest available capacity is chosen. The leftovers, i.e., the unassigned queries (balls) are then put back into the sorted queue.

Extensions to the basic model

After each run of the algorithm above, a further balancing step takes place which looks to balance the number of segments (not balls) assigned to each node. This helps to minimise the maximum memory used by any historical node in the system. Let the segment load of a historical node be the number of segments assigned to it. Start with all the historical nodes that have higher than the above segment load. For those nodes, find the k least popular replicas on that node (where k is the difference between the given node’s segment load and the average). Add those replicas to a global re-assign list.

Now work through the list, and assign each replica to a historical node H such that:

  1. H does not already have a replica of the segment
  2. The query load imbalance after the re-assignment will be less than or equal to a configurable parameter γ (γ = 20% works well in practice). Query load imbalance is defined as 1 – (min query load for the node / max query load for the node).
  3. H has the least segment load of all historical nodes meeting criteria one and two.

If the cluster is not homogeneous, the algorithm can be further modified to distribute query load proportionally among nodes based on their estimate compute capacities. Node capacity is estimated by calculating the CPU time spent in processing queries. This approach has the nice side effect of automatically dealing with stragglers, which will tend to report lower CPU time in the sampling window (due to slow disk, flaky NIC, and other causes of waiting). The lesser reported capacity will then ensure popular segments are not assigned to these nodes.

Load balancing

Query routing decides which historical nodes hosting a replica of a segment a query should be routed to. Load based query routing was found to work the best. In this scheme, each broker keeps an estimate of every historical node’s current load (using the number of open connections it has to it as a proxy). An incoming query is routed to the historical node with the lowest load, out of those that host the desired segment.

Making it dynamic

Getafix handles dynamically arriving segments as well as queries. The overall system looks like this:

The static solution given above is run in periodic rounds. At the end of each round the query load statistics are gathered, the algorithm is run, and a segment placement plan produced. Then we can look at the delta between the new plan and the current configuration, and start moving things around. The coordinator calculates segment popularity via an exponentially weighted moving average, based on the total access time for each segment in each round. The current implementation sets the round duration to 5 seconds (seems very short to me!), “which allows us to catch popularity changes early, but not react too aggressively.”

Whereas today sysadmins typically manually configure clusters into tiers based on their hardware characteristics, Getafix’s capacity-aware solution essentially automatically discovers tiers, moving popular replicas to powerful historical nodes. This auto-tiering is unaware of network bandwidth constraints though.

There’s one more trick up the sleeve to try and minimise data movement. Consider the following scenario, with current HN assignments in the top row, and the assignments produced by the planner for the next round in the bottom row.

If do the straightforward thing, and make HN1 host the segments in E1, HN2 host the segments in E2, and HN3 host the segments in E3, we have to move 3 segments in total. But if instead we juggle the mapping of expected (E) packings to nodes such that E1 is mapped to HN3, E2 is mapped to HN2, and E3 is mapped to HN1 then we only have to move two segments. The Hungarian Algorithm is used to find the minimum matching. It has complexity O(n^3), but that’s ok because n never gets above a few hundred nodes. What remains unclear to me is how this transfer minimisation step meshes with the capacity aware placement when nodes are not homogeneous. I.e., the algorithm has figured out HN1 is powerful, and wants to put hot segments there, but the network transfer step shuffles things so that they end up on a (possibly less powerful HN3).


I’m way over target length today trying to explain the core ideas of Getafix, so I’ll give just the briefest of highlights from the evaluation:

  • Compared to the best existing strategy (from a system called Scarlett) Getafix uses 1.45-2.15x less memory, while minimally affecting makespan and query latency.
  • Compared to the commonly used uniform replication strategy (as used by Druid today), Getafix improves average query latency by 44-55%, while using 4-10x less memory.
  • The capacity-aware version of best fit improves tail query latency by 54% when 10% of nodes are slow, and by 17-22% when there is a mix of nodes in the cluster. In this case total memory is also reduced by 17-27%. A heterogeneous cluster is automatically tiered with an accuracy of 75%.

See section 5 in the paper for the full details!

Analytics with smart arrays: adaptive and efficient language-independent data

June 14, 2018

Analytics with smart arrays: adaptive and efficient language-independent data Psaroudakis et al., EuroSys’18

(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).

We’re going lower-level today, with a look at some work on adaptive data structures by Oracle. It’s motivated by a desire to speed up big data analytic workloads that are “increasingly limited by simple bottlenecks within the machine.” The initial focus is on array processing, but the ambition is to extend the work to more data types in time.

Modern servers have multiple interconnected sockets of multi-core processors. Each socket has local memory, accessible via a cache-coherent non-uniform memory access (ccNUMA) architecture. In the NUMA world the following hold true:

  • remote memory accesses are slower than local accesses
  • bandwidth to a socket’s memory and interconnect can be separately saturated
  • the bandwidth of an interconnect is often much lower than a socket’s local memory bandwidth

If we want to crunch through an array as fast as possible in a NUMA world, the optimum way of doing it depends on the details of the machine, and on the application access patterns. Consider a 2-socket NUMA machine, the figure below shows four possible arrangements:

In (a) we place the array on a single socket, and access it from threads on both sockets. The bottleneck here will be the socket’s memory bandwidth. In (b) the array is interleaved across both sockets, and the bottleneck becomes the interconnect. In (c) the array is replicated, using more memory space but removing the interconnect as a bottleneck. In (d) the array’s contents are also compressed to reduce the memory usage resulting from replication. For this particular application, combination (d) is the fastest, but that won’t hold for all applications and all machines.

Smart arrays provide multiple implementations of the same array interface, so that trade-offs can be made between the use of different resources. In particular, they support multiple data placement options, in combination with bit compression. The selection of the best configuration can be automated. The selection algorithm takes as input:

  1. A specification of the target machine, including the size of the memory, the maximum bandwidth between components, and the maximum compute available on each core.
  2. The performance characteristics of the arrays, such as the costs of accessing a compressed item. This information is obtained from performance counters. It is specific to the array and the machine, but not the workload.
  3. Information from hardware performance counters describing the memory, bandwidth, and processor utilisation of the workload.

Smart arrays can significantly decrease the memory space requirements of analytics workloads, and improve their performance by up to 4x. Smart arrays are the first step towards general smart collections with various smart functionalities that enable the consumption of hardware resources to be traded-off against one another.

Configuration options(aka ‘smart functionalities’)

There are four data placement options to choose from:

  • OS default just uses the OS default placement policy. On Linux, this means physically allocating a virtual memory page on the socket on which the thread that first touches it is running.
  • Single socket allocates the array’s memory pages on a specified socket
  • Interleaved allocates the array’s memory pages across sockets in a round-robin fashion
  • Replicated places one replica of the array on each socket

In addition, we may opt to use bit compression for the array.

Bit compression is a light-weight compression technique that is popular for many analytics workloads such as column-store database systems. Bit compression uses less than 64 bits for storing integers that require fewer bits.

Bit compression increases the number of values per second that can be loaded through a given bandwidth. On the flip side, it increases the CPU instruction footprint. This can hurt performance compared to using uncompressed elements. However, if we’re hitting a memory bandwidth bottleneck, the compressed array can still be faster.

Assessing the impact of different configurations on a variety of workloads

We’ll briefly touch on implementation details later, but for now it suffices to know that Smart Arrays are implemented in C++ with efficient access from Java. Experiments are run with an aggregation workload, and a variety of graph analytics workloads.

The aggregation workload involves a parallel element-wise summation of two 4GB arrays of 64-bit integers. The experiment is conducted both from C++ and from Java, on an 8-core machine and on an 18-core machine. The results are shown in the following chart.

On the 8-core machine the interleaved placement is worst, since the limited bandwidth of the interconnect is less than a socket’s bandwidth. The single socket placement does better, exploiting the socket’s memory bandwidth, but the replicated placement is best of all, reducing the time by 2x.

The 18-core machine has much higher interconnect bandwidth. This is enough to make interleaving perform better than the single socket version, and replication only offers a slight advantage.

In this case, bit compression reduces the overall time by up to 4x compared to the OS default placement, or by up to 2x for the other data placements.

For graph analytics, a degree centrality computation and PageRank are used for the evaluation. With degree centrality on the 8 core machine replication again proves to be the best choice. The results are similar to the aggregation experiment for the 18-core case as well.

With PageRank replication on the 8-core machine gives up to a 2x improvement, but is not especially effective with 18 cores. Bit compression does not improve performance, but compressing both vertices and edges (V+E) can reduce memory space requirements by around 21%.

Automating the selection of the best configuration

As we observe in the experimental evaluation, depending on the machine, the algorithm, and the input data, the cost, benefit, and availability of the optimisations can vary.

The following table summarise the trade-offs:

Rather than having to guess the best trade-offs by hand, the authors built an automated system for determining the best configuration, using the inputs we saw earlier in this post.

There are two decision flow charts, one to use when selection candidates without compression, and one to use when also including compression.

Following these decision charts, we will end up with two candidates: one with compression and one without. Now we have to decide which one of these two we should ultimately go with. We add back in to the compression profile estimate the additional compute required to perform compression based on the number of accesses per second and the cost per access. For each placement we compute two ratios per socked:

  1. The ratio of the maximum compute rate relative to the current rate
  2. The ratio of the maximum memory bandwidth for each candidate placement relative to the current bandwidth – this gives the socket speedup assuming the workload is not compute-bound.

The minimum of these two ratios is then taken as the estimated speedup of each socket, and the average across all sockets is taken as the estimated speedup given by the configuration. Pick the candidate configuration with the best estimated speedup!

The selection algorithm was evaluated using the aggregation and degree centrality workloads where the correct (best overall) placement was chosen 30 times out of 32.

The average performance of the selected configuration for each benchmark and hardware configuration paring was 0.2% worse than the optimal configuration for that pairing, and 11.7% better than then best static configuration.

Implementation details

There’s quite a bit of detail in the paper, which I’ve skipped over entirely, concerning the C++ implementation and the way it is made efficiently accessible from Java. Here are the highlights:

  • The implementation of the C++ library uses the Callisto runtime system, which supports parallel loops with dynamic distribution of loop iteration and includes a basic Java library to express loops.
  • The GraalVM is used to create language-independent (Java and C++) smart arrays and use them efficiently from Java. The GraalVM (see ‘One VM to rule them all’) is a modified version of the Java HotSpot VM which adds Graal, a dynamic compiler implemented in Java, and Truffle, a framework for building high-performance language implementations.

The pieces all come together like this:

Future directions

  • More data types including sets, bags, and maps. Smart arrays can be used to implement data layouts for these by encoding binary trees into arrays. It’s also possible to trade size against performance by using hashing instead of trees to index smart arrays.
  • Additional smart functions including randomisation (fine-grained index-remapping of elements) to ensure that “hot” nearby data items are mapped to different locations. Also up for discussion are additional data placement techniques using domain specific knowledge, and alternative compression techniques which may achieve higher compression rates on some kinds of data.
  • Runtime adaption, supporting runtime selection and re-selection of the best configuration based on the current workload.