Skip to content

A dirty dozen: twelve common metric interpretation pitfalls in online controlled experiments

September 25, 2017

A dirty dozen: twelve common metric interpretation pitfalls in online controlled experiments Dmitriev et al., KDD 2017

Pure Gold! Here we have twelve wonderful lessons in how to avoid expensive mistakes in companies that are trying their best to be data-driven. A huge thank you to the team from Microsoft for sharing their hard-won experiences with us.

In our experience of running thousands of experiments with many teams across Microsoft, we observed again and again how incorrect interpretations of metric movements may lead to wrong conclusions about the experiment’s outcome, which if deployed could hurt the business by millions of dollars.

Warning: I definitely exceeded by usual length target with this post – there was just too much good stuff that I didn’t want to leave out. I hope you find the extra time it takes to read through worth it!

Before we dive into the details, there are a couple of interesting data points about experimentation at Microsoft I wanted to draw your attention too:

  • Microsoft have an experimentation system used across Bing, MSN, Cortana, Skype, Office, xBox, Edge, Visual Studio, and so on, that runs thousands of experiments a year.
  • This isn’t just fiddling on the margins – “At Microsoft, it is not uncommon to see experiments that impact annual revenue by millions of dollars, sometimes tens of millions of dollars.”

If we’re going to talk about metric interpretation then first we’d better have some metrics. Section 4 in the paper has some excellent guidelines concerning the types of metrics you should be collecting. You’re going to need a collection of metrics such as these in order to diagnose and avoid the pitfalls.

Let’s talk about metrics

Most teams at Microsoft compute hundreds of metrics to analyze the results of an experiment. While only a handful of these metrics are used to make a ship decision, we need the rest of the metrics to make sure we are making the correct decision.

There are four primary classes of metrics that Microsoft collects:

  1. Data Quality Metrics. These help to determine whether an experiment was correctly configured and run, such that the results can be trusted. An important example is the ratio of the number of users in the treatment to the number of users in the control. You need this metric to avoid the first pitfall…
  2. Overall Evaluation Criteria (OEC) Metrics. The leading metrics that determine whether the experiment was successful. These are metrics that can both be measured during the short duration of an experiment, and are also indicative of long term business value and user satisfaction. In the ideal case, a product has just one OEC metric.
  3. Guardrail Metrics. “In addition to OEC metrics, we have found that there is a set of metrics which are not clearly indicative of success of the feature being tested, but which we do not want to significantly harm when making a ship decision.” These are the guardrail metrics – on MSN and Bing for example, page load time is a guardrail metric.
  4. Local Feature and Diagnostic Metrics. These measure the usage and functionality of individual features of a product. For example, click-through rates for individual elements on a web page. These metrics often help to diagnose / explain where OEC movement (or lack of it) is coming from. They can also reveal situations where improvements in one area harm other areas.

Pitfall #1: Metric sample ratio mismatch

Here’s a puzzler for you:

We ran an experiment on the homepage where, when users clicked on a link, in treatment the destination page opened in a new browser tab while in control it opened in the same tab. The results showed 8.32% increase in Page Load Time (PLT) of the homepage. How could this one line JavaScript
change cause such a large performance degradation?

The PLT metric is simple the average load time for all pages in a variant (control or treatment), i.e., sum of all page load times for a variant, divided by number of page loads in the variant. (I know you know what average means, there’s a reason I’m spelling it out!). This gives us a clue that the number of page loads matters. With results opening in the same tab, users often use the back button to go back to the results page – this results in more page loads (versus keeping the results tab open in the treatment), and those loads are faster due to the browser cache.

Clearly, the set of page loads over which the metric was computed were different between the treatment and the control. In a situation like this the metric value cannot be trusted.

This is a sample ratio mismatch, and if it happens all bets are off. Without being aware of this you can make misinformed decisions.

…it is critical that the experimentation system automatically detects sample ratio mismatches and warns the user about them… A good start is to look separately at the numerator and denominator of the metric.

Pitfall #2: Misinterpretation of ratio metrics

On the MSN homepage, a module located close to the bottom of the main page was moved to a higher position on the page so that less scrolling was needed to reach it. The click-through rate (CTR) of the module decreased by 40%! That doesn’t sound right!

Take a look at the metric:

If you think about it for a moment, you’ll realise that a lot more users are seeing the module. In fact, both the number of times the module was seen, and the number of clicks on the module went up. It’s just that the former went up more than the latter. Overall page CTR stayed flat, but revenue actually increased because the promoted module was more monetizable than those that were pushed down to make room for it. So a 40% decrease in CTR turned out to be a good thing!

Ratio metrics can be misleading or potentially invalid if the denominator of the metric changes.

The authors’ preferred way to compute ratio metrics is as the average of ratios (as in the example metric in this section). This tends to give higher sensitivity and greater resilience to outliers.

To detect denominator mismatch in ratio metrics, we recommend to always define count metrics for the numerator and the denominator, and provide those in the result alongside the ratio metric.

Pitfall #3: Telemetry loss bias

Skype experimented with a protocol change for delivering push notifications. The result showed strong statistically significant changes in some call-related metrics such as fraction of successfully connected attempted calls. But the notification mechanism has absolutely nothing to do with calls!

The push notifications were waking up the device, allowing it enough time to check whether it was on wifi, and if so prepare a telemetry batch and send it over. Server-side metrics were unchanged (total capture), but more client-side metrics were being delivered.

Since the new events that make it (or events that are lost) are typically a highly biased set, telemetry loss bias practically invalidates any metrics that are based on the affected events… Therefore, for every client event used in metrics, a Data Quality metric measuring the rate of the event loss should be created.

The simplest way is to compare to a corresponding server event if you have one. In the absence of that you can assign sequence numbers client side and look for gaps in the sequence.

Pitfall #4: Assuming underpowered metrics had no change

In one experiments, page views went up by 0.5% (that’s a meaningful impact on the business), but the associated p-value was not statistically significant. Should we assume no true impact on page views? An investigation revealed that the experiment did not have enough power for the metric.

Power is the probability of rejecting the null hypothesis given that the alternative hypothesis is true. Having a higher power implies that an experiment is more capable of detecting a statistically significant change when there is indeed a change.

Before an experiment you should conduct an a priori power analysis to estimate sufficient sample sizes – at the very least for OEC and Guardrail metrics. Often the sample size will be the same for typical experiments on a given product – e.g., for Bing experiments in the US the rule of thumb is at least 10% of users for one week.

We recommend to have at least 80% power to detect small enough changes in success and guardrail metrics to properly evaluate the outcome of an experiment.

Pitfall #5: Claiming success with a borderline p-value

A experiment gave a statistically significant positive increase in one of the OEC metrics, with a p-value of 0.029. Shall we celebrate and ship it?

In whenever key metrics move in a positive direction we always run a certification flight which tries to replicate the results of the experiment by performing an independent run of the same experiment.

Even in A/A experiments the behaviour of metrics can be seen flipping between statistically significant and not. (Clearly they aren’t, by construction!) Always re-run experiments which have borderline p-values, if the effects continue to be borderline and the traffic can’t be increased, Fisher’s method can be used to obtain a more reliable conclusion.

Pitfall #6: Continuous monitoring and early stopping

You’ve probably heard about this one… an experiment is showing statistically significant results already, before the scheduled finishing time, can we stop it? Likewise, an experiment isn’t showing the effect we hoped for, can we keep running it for longer just in case? No, and no.

In our experience, stopping early or extending the experiment are both very common mistakes experiment owners make. These mistakes are subtle enough to even make it into recommended practices in some A/B testing books. The issue is exacerbated by the fact that in practice continuous experiment monitoring is essentially a requirement to be able to shut down the experiment quickly in case a strong degradation in user experience is observed.

You can either (a) train experiment owners not to do this, (b) adjust p-values to account for extra checking, or (c) use a better test than p-values. For example, a Bayesian framework naturally allows for continuous monitoring and early stopping.

Pitfall #7: Assuming the metric movement is homogeneous

Sometimes treatments have differing affects on different subsets of the treatment groups (e.g., users in different countries, or the feature not working correctly in certain browsers). If you’re not aware of such effects you may end up shipping a feature that improves user experience overall, but substantially hurts experience for a group of users.

Due to a large number of metrics and segments that need to be examined in order to check whether there is a heterogenous treatment effect, it’s pretty much impossible to reliably detect such effects via a manual analysis.

The Microsoft system automatically analyses every experiment scorecard and warns if heterogenous treatment effects are found.

Pitfall #8: Segment (mis)interpretation

Segmenting users is pretty common, but interpreting metric movements on segments needs to be done with extra care. You may well have heard of Simpson’s paradox, which shows how it is possible for a metric to go up in every individual segment, but still go down overall for example. The condition used for defining a segment must not be impacted by the treatment.

If statistically significant difference is observed (in sample ratios), the results for that segment group, and often for all other groups in the segment, are invalid and should be ignored. A common way to run into Simpson’s paradox is to keep recursively segmenting the users until a statistically significant difference is found.

Recursive segmentation is also an instance of the multiple comparisons problem.

Pitfall #9: Impact of outliers

On the homepage, the number of images in the slideshow at the top of the page was increased. The result showed a significant decrease in engagement, the opposite of what was expected. It turned out that the experiment increased engagement so much that real users were being labelled as bots!

Outliers can skew metric values and increase the variance of the metric making it more difficult to obtain statistically significant results. Because of this, it is common to apply some kind of filtering logic to reduce the outlier impact. As the example above shows, however, one needs to ensure that all variants in the experiment are impacted by the outlier filtering logic in the same way.

The authors recommend always having a metric that counts the number of values affected by outlier handling logic in the Data Quality Metrics section.

Pitfall #10: Novelty and primacy effects

We added a shiny thing, and engagement went up, yay!

When we looked at the experiment results for each day segment, we found that the percent delta in clicks on top sites between treatment and control declined quickly, suggesting a novelty effect.

What you really want of course, is to understand the long-term impact on the business and users. Ignoring novelty effects can lead to an investment in features that aren’t worth it in the long run. On the flip side of the coin, primacy effects are those where initial treatment effects appear small in the beginning, but increase over time (e.g., as a machine learning system better adapts). When a feature has suspected primacy effects you can do a ‘warm start’ so that the marginal improvement in performance over time is smaller.

We also recommend segmenting treatment effect by different days of the experiment, or different user visits, to see if the treatment effect changes over time.

Pitfall #11: Incomplete funnel metrics

For funnel metrics, you need to measure all parts of the process. And with final success rates often as low as 1%, you need to take care to ensure that the metrics at every step are sufficiently powered to detect any significant changes.

At every step of the funnel we need to ensure that the success rates are compared and not just the raw clicks or user counts. In addition, we should measure both conditional and unconditional success rate metrics. Conditional success rates are defined as the proportion of users that complete the given step among users that attempted to start the step. Unconditional success rates compute the success rate taking into consideration all users who started at the top of the funnel.

Pitfall #12: Failure to apply Twyman’s Law.

If you see an unexpected metric movement, positive or negative, it normally means there is an issue. For example, a too-good-to-be-true jump in number of clicks turned out to be because users were confused and clicking around trying to figure things out!

Twyman’s law says that any figure that looks interesting or different is usually wrong.

Even surprising results that appear to be positive should be treated with skepticism.

At Microsoft, we have configured automated alerts and auto-shutdown of the experiments if we detect unexpected large metrics movements in both the positive and negative directions.

Further reading…

If you enjoyed today’s paper, you might also like:

I may cover some of these in future editions of The Morning Paper too.

Distributed deep neural networks over the cloud, the edge, and end devices

September 22, 2017

Distributed deep neural networks over the cloud, the edge, and end devices Teerapittayanon et al., ICDCS 17

Earlier this year we looked at Neurosurgeon, in which the authors do a brilliant job of exploring the trade-offs when splitting a DNN such that some layers are processed on an edge device (e.g., mobile phone), and some layers are processed in the cloud. They find that partitioning can bring substantial benefits, and that the optimal partitioning varies dynamically according to network conditions, mobile battery life, and so on. Today’s paper choice introduces Distributed Deep Neural Networks or DDNNs. Similarly to Neurosurgeon, DDNNs partition networks between mobile/embedded devices, cloud (and edge), although the partitioning is static. What’s new and very interesting here though is the ability to aggregate inputs from multiple devices (e.g., with local sensors) in a single model, and the ability to short-circuit classification at lower levels in the model (closer to the end devices) if confidence in the classification has already passed a certain threshold. It looks like both teams worked independently and in parallel on their solutions.

Overall, DDNNs are shown to give lower latency decisions with higher accuracy than either cloud or devices working in isolation, as well as fault tolerance in the sense that classification accuracy remains high even if individual devices fail.

The big idea, distributed deep neural networks

Let’s quickly rehash the main arguments for partitioning a network between an end device and the cloud: often the end device isn’t powerful enough to run the full model, yet transmitting all of the raw sensor data to the cloud incurs significant overhead. If we can run the lower layers of the network on an end device then we only need to send the outputs of those layers up to the cloud, normally a significantly smaller data transmission. (See Neurosurgeon for a great analysis). Over and above this baseline, DDNNs can incorporate geographically distributed end devices (the example in the paper is a collection of devices with cameras attached, collectively covering a local area). DDNNs can scale vertically across multiple tiers (e.g., devices, edge / gateway, and cloud) and horizontally across devices.

… we propose distributed deep neural networks (DDNNs) over distributed computing hierarchies, consisting of the cloud, the edge (fog) and geographically distributed end devices. In implementing a DDNN, we map sections of a single DNN onto a distributed computing hierarchy.

Training happens centrally. Let’s explore some of the deployment topologies DDNNs enable:

As a starter for 10, here’s the classic “upload everything to the cloud and process it there” topology:

As in Neurosurgeon, we could also do some local processing on device, and then transfer a compressed representation to the cloud for further processing:

Note here the additional possibility of a fast local exit (we’ll discuss how this is possible in the next section). If we’re confident enough in the classification on the device itself, we can stop right there.

The next scenario shows how DDNNs can scale horizontally to include input from multiple distributed devices. The device inputs are aggregated in the lower layers of the network.

As well as scaling horizontally (the above scenario), we can also scale vertically, for example by introducing an additional tier (with another early exit opportunity) at the edge:

And of course, we can combine horizontal and vertical scaling to create interesting topologies such as these:

Building blocks: BNNs and BranchyNets

For efficiency on devices, DDNN uses Binarized neural networks:

Binarized neural networks (BNNs) are a recent type of neural network, where the weights in linear and convolutional layers are constrained to {-1, 1} (stored as 0 and 1 respectively)… Embedded binarized neural networks (eBNNs) extend BNNs to allow the network to fit on embedded devices by reducing floating point temporaries through re-ordering the operations in inference.

DDNN uses BNNs and eBNNs (now there’s a mouthful!) for end devices, so that they can be jointly trained with the network layers in the edge and the cloud.

To support early exit points, DDNN builds on prior work in BranchyNet. BranchyNet introduced entropy-based confidence criteria based on computed probability vectors. If confidence exceeds a given threshold, then the input is classified and no further computation is performed by higher network layers. DDNN places exit points at physical boundaries.

Aggregating inputs from multiple devices

In the examples above that included multiple component inputs to a layer (e.g., multiple end devices), we need a way to aggregate the input from each device in order to perform classification. There are three different techniques the authors experimented with for this: simple max pooling (MP) that takes the max of each component, average pooling (AP), and concatenation (CC) -simply concatenating all the input vectors together.


While DDNN inference is distributed over the distributed computing hierarchy, the DDNN system can be trained on a single powerful server or in the cloud.

Training needs to take into account the possibility of early exits. These are accommodated by minimising a weighted sum of the loss functions at each exit point.


Inference in DDNN is performed in several stages using multiple preconfigured exit thresholds T (one element T at each exit point) as a measure of confidence in the prediction of the sample.

The authors used a normalised entropy threshold as the confidence criteria, yielding an uncertainty level between 0 and 1, where close to 0 means low uncertainty (high confidence).

Considering a topology such as this:

  1. Each end device first sends summary information to the local aggregator.
  2. The local aggregator determines if the combined summary information is sufficient for accurate classification.
  3. If so, the sample is classified (exited). (I’m assuming that if we wanted to communicate the classification result to the cloud, we could do that at this point).
  4. If not, each device sends more detailed information to the edge in order to perform further processing for clarification.
  5. If the edge believes it can correctly classify the sample it does so and no information is sent to the cloud.
  6. Otherwise, the edge forwards intermediate computation to the cloud which makes the final classification.

Evaluation: multi-view, multi-camera scenario

An example really helps here, and there’s a lot of good additional information contained in the evaluation. Consider a surveillance system (I’m not sure what else to call it really!) with six cameras attached to six different networked devices, and each placed at a different location in the same general area. We get image feeds from each camera, and the classifier the authors build simply plays a game of “car, bus, or person.”

The DDNN architecture used for the evaluation looks like this (there’s some good information in the figure caption, which I won’t repeat here).

There are 680 training samples and 171 testing samples. Of note, the classes are not evenly represented and each device sees a different mix of samples – meaning that the individual accuracy of each device differs wildly.

The best aggregation combination turns out to be max pooling at the local aggregator, and concatenation at the cloud aggregator:

Using CC at the cloud aggregator means that during backpropagation gradients are passed through all devices. Using MP at the local aggregator level forces a relationship between outputs for the same class on multiple devices and improves local accuracy.

The following chart shows what happens when exploring different threshold for local exit.

… setting the threshold to 0.8 results in the best overall accuracy with significantly reduced communication, i.e., 97% accuracy while exiting 60.82% of samples locally as shown in the table below.

The next experiment looks at what happens to accuracy as you add more end devices. Devices are added in order from least accurate to most accurate.

The more devices we add the better it gets, but the cloud exit point always outperforms the local exit point regardless of the number of devices.

Generally these results show that by combining multiple viewpoints we can increase the classification accuracy at both the local and cloud level by a substantial margin when compared to the individual accuracy of any device. The resulting accuracy of the DDNN system is superior to any individual device accuracy by over 20%. Moreover, we note that the 60.82% of samples which exit locally enjoy lowered latency in response time.

One last thing… the authors also test the fault tolerance of the system by simulating end device failures and looking at the resulting accuracy of the system.

Regardless of the device that is missing, the system still achieves over a 95% overall classification accuracy. Specifically, even when the device with the highest individual accuracy has failed, which is Device 6, the overall accuracy is reduced by only 3%.

At present all layers in the DDNN are binary. Future work includes looking at larger datasets, more end devices, and alternative layer types in the cloud.

CLKSCREW: Exposing the perils of security-oblivious energy management

September 21, 2017

CLKSCREW: Exposing the perils of security-oblivious energy management Tang et al., USENIX Security ’17

This is brilliant and terrifying in equal measure. CLKSCREW demonstrably takes the Trust out of ARM’s TrustZone, and it wouldn’t be at all surprising if it took the Secure out of SGX too (though the researchers didn’t investigate that). It’s the deepest, widest impact, hardest to fix security issue I’ve seen in a long time.

Designing secure systems is really hard. One side channel, control over one single bit, and you can be compromised. Fault attacks try to induce bit corruptions at key moments. Differential fault attacks (DFA) compare execution under normal and faulted conditions, and can be use for example to infer AES keys based on pairs of correct and faulty ciphertexts. For example:

Assuming a fault can be injected during the seventh AES round to cause a single-byte random corruption to the intermediate state in that round, with a corrupted input to the eighth round, this DFA can reduce the number of AES-128 key hypotheses from the original 2^128 to 2^12, in which case the key can be brute-forced in a trivial exhaustive search.

Physical fault attacks require access to the device, opening it up, and using e.g., lasers, heat or radiation. But what if you could conduct remote fault attacks via software? It turns out that all of the well-intentioned mechanisms we’ve been adding for power and energy management let you do exactly that.

In this work, we present the CLKSCREW attack, a new class of fault attacks that exploit the security-obliviousness of energy management systems to break security. A novel benefit for the attackers is that these fault attacks become more accessible since they can now be conducted without the need for physical access to the devices or fault injection equipment.

Demonstrating the potency of the attack on commodity ARM devices (a Nexus 6 phone), the authors show how it can be used to extract secret keys from an ARM TrustZone, and can escalate privileges to load self-signed code into Trustzone.

Oh and by the way, energy management technology is everywhere, and there doesn’t seem to be any quick fix for CLKSCREW. It’s not a software bug, or a hardware bug, it’s a fundamental part of the energy management design. SoC and device vendors are apparently “working towards mitigations.

To understand how CLKSCREW works, we first need a little bit of background on DVFS, Dynamic Voltage and Frequency Scaling.

Dynamic Voltage and Frequency Scaling

DVFS made its debut in 1994, and has become ubiquitous in almost all commodity devices since. It works by regulating frequency and voltage: power, an important determinant of energy consumption, is directly proportional to the product of operating frequency and voltage.

DVFS regulates frequency and voltage according to runtime task demands. As these demands can vary drastically and quickly, DVFS needs to be able to track these demands and effect the frequency and voltage adjustments in a timely manner. To achieve this, DVFS requires components across layers in the system stack.

There are voltage/frequency regulators in hardware, a vendor-specific regulator driver, and an OS-level CPUfreq power governor. Because accurate layer-specific feedback is needed to do a good job of power management, software level access to the frequency and voltage regulators is freely available.

The frequency regulator contains a Phase Lock Loop (PLL) circuit that generates a synchronous clock signal for digital components. The frequency of the clock is adjustable, and typically the operating frequency of each core can be individually controlled. In the case of the Nexus 6 for example, each core can be set to one of three frequencies. Power to the cores is controlled by the Subsystem Power Manager with memory-mapped control registers to direct voltage changes.

Pushing the frequency too high (overclocking) or under-supplying voltage (undervolting) can cause unintended behaviour in digital circuits. Memory flip-flops change their output to the value of the input upon the receipt of the rising edge of the clock signal. The input has to be held stable for a time window while this happens. Overclocking reduces the clock cycle time below this stable period, and undervolting increases the overall circuit propagation time meaning that the period the input needs to be stable increases. The following figure shows an example leading to an erroneous value of 0 due to overclocking.

Challenges in constructing a CLKSCREW attack

We need to be able to:

  • push the operating limits of regulators to a point where such attacks can take place.
  • conduct the attack in a manner that does not affect the execution of the attacking code
  • inject a fault into the target code without causing too much perturbation to non-targeted code
  • be relatively precise in when the fault is injected
  • target a specific range of code execution that may take orders of magnitude fewer clock cycles within an entire operation

The authors demonstrate how to achieve all of these.

On the Nexus 6 as an example, there are 15 possible official Operating Performance Points. By probing the device by stepping through voltage and frequency ranges until it either reboots or freezes, the authors demonstrate large areas beyond the official operating performance points where the regulators can be configured:

Attack enabler #1: There are no safeguard limits in the hardware regulators to restrict the range of frequencies and voltages that can be configured.

In the figure above we can see that reducing the operating voltage simultaneously lowers the minimum required frequency needed to induce a fault in an attack (push the system above the blue line). Thus if the frequency settings don’t let us set the clock fast enough, we can always reduce the voltage.

Attack enabler #2: reducing the operating voltage lowers the minimum required frequency to induce faults.

To attack target code without disrupting the attacker, we can simply pin the attack code and the victim code to different cores, since this allows each of them to operate in different frequency domains.

Attack enabler #3: the deployment of cores in different voltage/frequency domains isolates the effects of cross-core fault attacks.

To attack trusted code running in ARM Trustzone (Intel SGX works the same way), we can take advantage of the fact that ARM can execute both trusted and untrusted code on the same physical core. “On such architectures, the voltage and frequency regulators typically operate on domains that apply to cores as a whole.” Thus any frequency or voltage change initiated by untrusted code inadvertently affects the trusted code execution.

Attack enabler #4: hardware regulators operate across security boundaries with no physical isolation.

For timing of the attack (to flip a bit at the moment we want to), we can combine profiling to find out how long to wait, with a spin-loop in the attacking code to delay this amount of time before triggering the fault. A couple of Trustzone specific features help with the profiling part of the puzzle:

Attack enabler #5: execution timing of code running in Trustzone can be profiled with hardware counters that are accessible outside Trustzone


Attack enabler #6: memory accesses from the non-secure world can evict cache lines used by Trustzone code, thereby enabling Prime+Probe style execution profiling of Trustzone code.

Putting it all together, the key steps in a CLKSCREW attack are as follows:

  • Invoke both the victim and attack threads a few times in quick succession to clear away any microarchitectural residual states remaining from prior executions of other code.
  • Profile for a timing anchor to determine when to deliver the fault injection
  • For high-precision delivery, configure the attack thread to spin-loop a predetermined number of times before inducing the fault
  • Given a base operating voltage, raise the frequency of the victim core, keep it high for long enough to induce the fault, then restore it to its original value.

Example attack: inferring AES keys

In section 4 of the paper the authors show how AES keys stored within Trustzone can be inferred by lower-privileged code from outside Trustzone. Using a hardware cycle counter to track the execution duration (in cycles) of the AES decryption operation allows an attacker to determine the execution time. Here’s a plot over 13K executions:

A grid search finds the faulting frequency and duration that induce erroneous AES decryption results. Then by varying the delay before inducing the fault they find that about 60% of faults are precise enough to affect exactly one AES round, and more than half of these cause random corruptions of exactly one byte.

As we saw in the introduction…

… Being able to induce a one-byte random corruption to the intermediate state of an AES round is often used as a fault model in several physical fault injection works.

With all the parameters worked out, it took on average 20 attempts to induce a one-byte fault to the input of the eight AES round. Given the faulty plaintext produced by this, and the expected one, it took about 12 minutes using Tunstall et al.’s DFA algorithm to generate 3650 key hypotheses – one of which is the key stored within Trustzone.

Example attack: loading self-signed apps

In section 5 of the paper the authors show how CLKSCREW can subvert RSA signature chain verification used in loading firmware images into Trustzone. The details are ingenious, and I don’t have space to cover them all here (do check out the full paper if you’re interested, you’re missing a lot otherwise). The aha moment for me was as follows: the RSA cryptosystem depends on the computational infeasibility of factorizing a modulus N into its prime factors p and q. If we can corrupt one or more bits in N, then it’s likely we’ll end with a composite number of more than two prime factors – some of which are small – which we can factorize.

About 20% of faulting attempts resulted in a successful fault within the target buffer, yielding 805 faulted values, of which 38 were factorizable. Selecting one of the factorizable N_s the authors embed an attack signature into the _widevine trustlet and conduct CLKSCREW faulting attempts while invoking the self-signed app. On average, one instance of the desired fault occurred with every 65 attempts.


Section 6 discusses possible hardware and software defenses. The short version is that none of them seem particularly compelling. We’re probably looking at a deep and invasive redesign.

Our analysis suggests that there is unlikely to be a single, simple fix, or even a piecemeal fix, that can entirely prevent CLKSCREW style attacks. Many of the design decisions that contribute to the success of the attack are supported by practical engineering concerns. In other words, the root cause is not a specific hardware or software bug but rather a series of well thought-out, nevertheless security-oblivious design decisions.

A new class of attacks

I’ll leave you with this thought: CLKSCREW isn’t just the latest in a known exploit genre, CLKSCREW opens the door to a whole new class of energy-management based attacks.

As researchers and practitioners embark upon increasingly aggressive cooperative hardware-software mechanisms with the aim of improving energy efficiency, this work shows, for the first time, that doing so may create serious security vulnerabilities… Furthermore, CLKSCREW is the tip of the iceberg: more security vulnerabilities are likely to surface in emerging energy optimization techniques, such as finer-grained controls, distributed control of voltage and frequency islands, and near/sub-threshold optimisations.

A longitudinal, end-to-end view of the DNSSEC ecosystem

September 20, 2017

A longitudinal, end-to-end view of the DNSSEC ecosystem Chung et al., USENIX Security 2017

DNS, the Domain Name System, provides a vital function on the Internet, mapping names to values. Unprotected, it’s also an attractive target for hackers with attack vectors such DNS spoofing and cache poisoning. Thus about two decades ago a set of security extensions, DNSSEC, were introduced. DNSSEC allows clients and resolvers to verify that DNS responses have not been forged or modified in flight.

You’d think that after two decades, everyone would be using DNSSEC everywhere. That’s not the case, but we’re making some progress:

Largely in response to powerful attacks such as the Kaminsky Attack, DNSSEC adoption has increased recently. As of early 2017, more than 90% of top-level domains (TLDs) and 47% of country-code TLDs (ccTLDs) are DNSSEC-enabled. Widely used DNS resolvers now attempt DNSSEC validation by default, e.g., as of January 2012 Comcast (one of the largest ISPs in the US) requests and validates DNSSEC records for all queries, and Google (which operates the largest public DNS resolver) did the same in March 2013.

This still means that, for example, over half of country-code TLDs are not DNSSEC enabled. But it gets much worse. In order for DNSSEC to provide the protections it claims, it needs to be used properly on both the server side (key management, signature publication etc.) and on the client side (validating responses).

This paper performs the first large-scale, longitudinal measurement study into how well DNSSEC’s PKI is managed… Our investigation reveals pervasive mismanagement of the DNSSEC infrastructure. For example, we found that 31% of domains that support DNSSEC fail to publish all relevant records required for validation; 39% of the domains use insufficiently strong key-signing keys; and although 82% of resolvers in our study request DNSSEC records, only 12% of them actually attempt to validate them.

DNSSEC in practice it seems, is pretty much broken.

How DNSSEC works

Let’s begin by looking at how DNSSEC is supposed to work. DNS uses records to map domain names to values. For example, an A record maps a domain name to an IP address, and an NS record maps a domain name to the authoritative name server for the domain. DNSEC adds its own record types:

  • DNSKEY records are public keys. Typically each zone uses two DNSKEY records to sign DNS records. The two DNSKEYs for a zone are the Key Signing Key (KSK) and the Zone Signing Key (ZSK). The KSK is only used to produce RRSIGs for DNSKEY records, the ZSK is used for all other record types. ZSKs are intended to be rolled over daily or weekly, and KSKs monthly or yearly.
  • RRSIG (Resource Record Signature) records are cryptographic signatures of other records. Each RRSIG is a signature over all records of a certain type, for a certain name. Such a set of records is called an RRSet. RRSIGs are created using the private key corresponding to a public DNSKEY.
  • DS (Delegation Signer) records are essentially hashes of DNSKEYs. These are uploaded to the parent zone, ultimately establishing a chain of trust reaching up to the root zone. DS records in the parent zone are authenticated using RRSIGs just like any other record type.

When a resolver wants to look up a name it iteratively determines the authoritative name server for the domain and obtains the record (absent caching). If the resolver supports DNSSEC it also needs to fetch (by setting the D0 DNSSEC OK bit in its request) and validate all of the associated DNSEC records.

The DNSSEC PKI is rooted at the KSK of the DNS root zone. This KSK is well-known by DNSSEC-aware resolvers. Validating a DNS response starts at the root and continues down the DNS hierarchy: A resolver begins by using the KSK to validate the root DNSKEY RRSIG, which validates the root zone’s ZSK. The resolver can then validate the child zone’s DS record (and thereby the child zone’s KSK) using the RRSIG for the DS records in the root zone, as this is signed with the root zone’s ZSK. This process continues until the record in question is authenticated.

DNSSEC in practice – domains

To study a large number of domains, the authors looked at domains in the .com, .net, and org TLDs. These comprise about 150M TLDs, covering 64% of the Alexa top-1M and 75% of the Alexa top-1K websites. Daily scans are done over a period of 21 months. To get a finer-grained view, hourly scans are also done for a subset of these domains (all second-level domains in .com and .org, about 708K domains in total) over a period of three months.

One key observation is that DNSSEC deployment is rare: between 0.6% (.com) and 1.0% (.org) of domains have DNSKEY records published in our latest snapshot. The fraction of domains that have DNSKEYs is, however, steadily growing. For example, for .org the fraction rose from 0.75% in March 2015 to over 1.0% in December 2016, even though the number of second-level domains in these TLDs is growing as well.

The spikes in growth turn out to be due to actions by a small number of authoritative name servers (e.g. enabling DNSSEC for 11,026 domains in one go).

Popular websites are more likely to sign their domains, but the overall deployment remains low even among the most popular domains: the Top-10K sites have a DNSSEC deployment rate of only 1.85%!).

Even with DNSSEC enabled, the proper DS records may not be installed in the parent domain. Without this, domains cannot be validated even if they provide correct RRSIGs for all their records.

We observe that 28% – 32% of signed domains do not have a DS record, meaning they cannot be validated.

(So that knocks out about a third of our already meagre 1.85% or less).

When you drill into this, you find that not all name servers are equal.

Table 1 (below) shows the results for the top 15 authoritative name servers, which cover 83% of the signed domains we study. We find a highly skewed distribution, with most of the name servers publishing DS records for almost all signed domains, but with four failing to upload a DS record for nearly all of their domains.

You had one job!!

Some signed domains also fail to publish RRSIGS (1.7%).

Let’s suppose we have published all of the required records. We still need the signatures and timestamps in those records to be correct and not expired. About 0.5% of RRSIG records are invalid under these criteria.

Even the best record management practices can result in an insecure system if the cryptographic keys that they rely on are mismanaged…

In principle each domain’s KSK and ZSK should be unique, but are name servers tempted to save administrative overhead and share keys? You bet!

Key sharing is mostly explained by policies at a small number of name servers:

*, we’re looking at you again!

Even if you did have unique keys, are they strong enough? Adopting NIST standards for defining weak keys (e.g., RSA keys with a length less than or equal to 1024 bits, DSA keys less than 2048 bits, elliptic curve keys less than 160 bits), the authors found that the vast majority of ZSKs are weak (91.7%)!

For the final hammer blow, the authors looked at whether these keys were rotated as per the recommendations. (Have a guess 😉 ).

Over 70% of domains do zero KSK rollovers during our 21 month study period.

(And of those that do, about 7% use abrupt rollovers that may break clients). About 45% of domains had no ZSK rollovers either.

DNSSEC in practice – resolver validation

Suppose a name server actually did get all of the above right. A client is still not protected unless its resolver requests and validates DNSSEC records properly. The authors measured 403,355 exit nodes from 177 countries and 8,842 ASes over a period of 13 days. 83% of resolvers sent requests with the D0 bit set, indicating that they support DNSSEC. However…

We found that 3,635 resolvers (82.1% of the DNSSEC-aware resolvers) from 146 ASes fail to validate the DNSSEC responses, even though they issue the DNS requests with the D0 bit set.

Here are the top 15 culprits:

And here’s the story when looking just at public DNS services:

(Thank you Google).

A bleak picture

Our study paints a bleak picture of the security provided by the DNSSEC ecosystem, one that has not improved substantially over time. Our findings highlight the need for continuous auditing of DNSSEC deployments and automated processes for correctly and securely managing DNSSEC material.

To type or not to type: quantifying detectable bugs in JavaScript

September 19, 2017

To type or not to type: quantifying detectable bugs in JavaScript Gao et al., ICSE 2017

This is a terrific piece of work with immediate practical applications for many project teams. Is it worth the extra effort to add static type annotations to a JavaScript project? Should I use Facebook’s Flow or Microsoft’s TypeScript if so? Will they really catch bugs that would otherwise have made it to master?

TL;DR: both Flow and TypeScript are pretty good, and conservatively either of them can prevent about 15% of the bugs that end up in committed code.

“That’s shocking. If you could make a change to the way we do development that would reduce the number of bugs being checked in by 10% or more overnight, that’s a no-brainer. Unless it doubles development time or something, we’d do it.” – engineering manager at Microsoft.

(It’s amusing to me that this quote comes from a manager at the company where they actually develop TypeScript! You’d think if anyone would know about the benefits…. Big companies eh).

Let’s dig in.

The flame wars

Static vs dynamic typing is always one of those topics that attracts passionately held positions. In the static typing camp we have the benefits of catching bugs earlier in the development cycle, eliminating altogether some classes of bugs in deployed systems, improved things like code-assist and other tool support, and enabling compiler optimisations. In the dynamic typing camp we have cleaner looking code, and greater flexibility and code malleability.

JavaScript is dynamically typed.

Three companies have viewed static typing as important enough to invest in static type systems for JavaScript: first Google released Closure, then Microsoft published TypeScript, and most recently Facebook announced Flow.

Shedding some light through empirical data

Inspired by previous studies in other areas, the authors study historical bugs in real world JavaScript projects in GitHub.

The fact that long-running JavaScript projects have extensive version histories, coupled with the existing of static type systems that support gradual typing and can be applied to JavaScript programs with few modifications, enables us to under-approximately quantify the beneficial impact of static type systems on code quality.

In other words, if the developers had been taking advantage of TypeScript or Flow at the time, would the bug have made it past the type checker? If not, it’s reasonable to assume it would never have been committed into the repository in the first place.

Here’s an example of a trivial bug that type annotations can detect:

Through this process, we end up with an under-estimation of the total benefits that might be available through static type checking.

In the Venn diagram below, we see on the left the universe of bugs that can potentially be detected by a statically checked type system. Type checking may help catch some of these faster during development. On the right we see bugs that have made it into public repositories. Only a subset of these have clearly linked fixes / patches. This study looks at the intersection of type-system detectable bugs and those that have public fixes.

We consider public bugs because they are observable in software repository histories. Public bugs are more likely to be errors understanding the specification because they are usually tested and reviewed, and, in the case of field bugs, deployed. Thus, this experiment under-approximates static type systems’ positive impact on software quality, especially when one considers all their other potential benefits on documentation, program performance, code completion, and code navigation.

Finding bugs to study

The goal is to find a corpus of bugs to study that is representative of the overall class, and large enough to support statistical significance. Finding representative bugs is addressed by uniform sampling. The authors sample commits that are linked to a GitHub issue from a snapshot of all publicly available JavaScript projects on GitHub. Each is then manually assessed to determine whether or not it really is an attempt to fix a bug (as opposed to a feature enhancement, refactoring, etc.). For the commits that pass this filter, the parent provides the code containing the bug.

To report results that generalize to the population of public bugs, we used the standard sample size computation to determine the number of bugs needed to achieve a specified confidence interval. On 19/08/2015, there were 3,910,969 closed bug reports in JavaScript projects on GitHub. We use this number to approximate the population. We set the confidence level and confidence interval to be 95% and 5%, respectively. The result shows that a sample of 384 bugs is sufficient for the experiment, which we rounded to 400 for convenience.

At the end of the sampling process, the bug pool contained bugs from 398 different projects (two projects happened to have 2 bugs each in the corpus). Most of these bug fixing commits ended up being quite small: about 48% of them touched only 5 or fewer lines of code, with a median of 6.

Bug assessment process

To figure out how many of these bugs could have been detected by TypeScript and Flow, we need some rules for how far we’re prepared to go in adding type annotations, and long long we’re prepared to spend on it. A preliminary study on a smaller sample of 78 bugs showed that for almost 90% a conclusion could be reached within 10 minutes, so the maximum time an author was allowed to spend annotating a bug was set at 10 minutes.

Each bug is assessed both with TypeScript (2.0) and with Flow (0.30). To reduce learning effects (knowledge gained from annotating with one system speeding annotation with the other), the type system to try first is chosen at random. The process is then to read the bug report and the fix and spend up to the allotted ten minutes adding type annotations. Sometimes the tools can detect the bug without needing any annotations to be added at all. Other times the report will make it clear that the bug is not type related – for example a misunderstanding of the intended application functionality. In this case the bug is marked as type-system undetectable.

We are not experts in type systems nor any project in our corpus. To combat this, we have striven to be conservative: we annotate variables whose types are difficult to infer with any. Then we type check the resulting program. We ignore type errors that we consider unrelated to this goal. We repeat this process until we confirm that b is ts-detectable because ts throws an error within the patched region and the added annotations are consistent (Section II), or we deem b is not ts-detectable, or we exceed the time budget M.

Details of the approach used to gradually add type annotations are provided in section III.D.

Does typing help to detect public bugs?

Of the 400 public bugs we assessed, Flow successfully detects 59 of them, and TypeScript 58. We, however, could not always decide whether a bug is ts-detectable within 10 minutes, leaving 18 unknown. The main obstacles we encountered during the assessment include complicated module dependencies, the lack of annotated interfaces for some modules, tangled fixes that prevented us from isolating the region of interest, and the general difficulty of program comprehension.

The 18 unknown bugs are then investigated to conclusion, at which point the score is 60 each for TypeScript and Flow.

Running the binomial test on the results shows that, at the confidence level of 95%, the true percentage of detectable bugs for Flow and TypeScript falls into [11.5%, 18.5%] with mean 15%.

Flow and TypeScript largely detect the same bugs:

Which is better: Flow or TypeScript?

The bugs that the two systems can detect largely overlap, with just 3 bugs that are only detected by TypeScript, and 3 that are only detectable by Flow.

All three Flow-detectable bugs that TypeScript fails to catch are related to concatenating possibly undefined or null values to a value of type string. Two of the three TypeScript-detectable bugs that Flow fails to detect are due to Flow’s incomplete support for using a string literal as in index. The remaining bug slips through the net due to Flow’s permissive handling of the window object.

Flow has builtin support for popular modules, like Node.js, so when a project used only those modules, Flow worked smoothly. Many projects, however, use unsupported modules. In these cases, we learned to greatly appreciate TypeScript community’s DefinitelyTyped project. Flow would benefit from being able to use DefinitelyTyped; TypeScript would benefit from automatically importing popular DefinitelyTyped definitions.

Yes, but what’s the cost of adding all those annotations?

Another consideration, both when comparing TypeScript and Flow, and when deciding whether to use either, is the cost of adding the annotations in terms of time and ‘token pollution’ in the source code.

… on average Flow requires 1.7 tokens to detect a bug, and TypeScript 2.4. Two factors contribute to this discrepancy; first, Flow implements stronger type inference, mitigating its reliance on type annotations; second, Flow’s syntax for nullable types is more compact.

The authors also measured the time taken to add the required annotations, with Flow coming out on top:

Thanks to Flow’s type inference, in many cases, we do not need to read the bug report and the fix in order to devise and add a consistent type annotation, which leads to the noticeable difference in annotation time.

Over to you

Is a 15% reduction in bugs making it all the way through your development pipeline worth it to you? And if so, will you go with the Flow or settle for TypeScript? Flow’s lighter-weight approach may suit dynamic-typing advocates better, whereas TypeScript’s DefinitelyTyped library and approach may perhaps be more appealing to strong-typing advocates coming from other languages.

Whatever you decide, at least you now have some data points to help make a more informed decision.

Bringing the web up to speed with WebAssembly

September 18, 2017

Bringing the web up to speed with WebAssembly Haas et al., PLDI 2017

This is a joint paper from authors at Google, Mozilla, Microsoft and Apple describing the motivations for WebAssembly together with a very concise exposition of its core semantics. If you’re following along with the paper, I found it helpful to dip into the full WebAssembly Specification at points for extra clarification. I’m going to spend most of my time in this write-up covering the parts of the paper focusing on the design rationale, properties, and implementation experiences since diving into the detailed semantics is really better left to the full spec.

Why do we need WebAssembly?

There’s a great observation in the introduction section: the Web has become the most ubiquitous application platform ever, and yet by historical accident the only natively supported programming language for that platform is JavaScript! That’s made for a number of attempts to compile down to JavaScript (e.g., asm.js), but JavaScript isn’t really a good compilation target.

WebAssembly addresses the problem of safe, fast, portable low-level code on the Web. Previous attempts at solving it, from ActiveX to Native Client to asm.js, have fallen short of the properties that a low-level compilation target should have…

Those properties fall into two groups: safe, fast, and portable semantics (safe and fast to execute, language, hardware, and platform independent, deterministic and easy to reason about, simple interoperability with the Web platform); and safe and efficient representation (compact and easy to decode, easy to validate and compile, easy to generate for producers, streamable and parallelizable).

WebAssembly is the first solution for low-level code on the Web that delivers on all of the above design goals. It is the result of an unprecedented collaboration across major browser vendors and an online community group to build a common solution for high-performance applications.

As we begin to understand all of the opportunities that WebAssembly opens up for the Web over the next few years, I believe it has the potential to be one of the most impactful changes to the Web platform since the introduction of JavaScript itself. Because of the way WebAssembly is also designed for easy embedding and modular safety, I believe we’ll also see it popping up on the server-side in all sorts of interesting places too (see e.g. ‘JavaScript for extending low latency in-memory key-value stores‘).

The design of WebAssembly

To our knowledge, WebAssembly is the first industrial-strength language or VM that has been designed with a formal semantics from the start. This not only demonstrates the “real world” feasibility of such an approach, but also that it leads to a notably clean design… nothing in its design depends on the Web or a JavaScript environment. It is an open standard specifically designed for embedding in multiple contexts, and we expect that stand-alone implementations will become available in the future.

Ultimately WebAssembly is a binary code format, but it is presented as a language with syntax and structure, “an intentional design choice which makes it easier to explain and understand.” The abstract syntax is quite compact and looks like this:

A binary takes the form of a module, and the runtime instantiation of a module is an instance. Computation is based an a stack machine, since a stack organisation has been show to achieve a more compact program representation than a register machine.

As the distribution format is binary, the speed and simplicity of validation is key to good performance and high assurance. Here the team learned from experiences with the JVM and CIL stack machines and their validation algorithms:

By designing WebAssembly in lock-step with a formalization, we managed to make its semantics drastically simpler. For example, JVM bytecode verification takes more than 150 pages to describe in the current JVM specification, while for WebAssembly it fits on one page.

WebAssembly: the less boring parts

As the authors point out, the treatment of things such as local variables and value types is all fairly standard. But there are a few areas where WebAssembly makes more interesting decisions. The first of these is its linear memory model which is central to the memory safety guarantees.

Each model has its own large area of bytes referred to as a linear memory. This is created with an initial size but may be grown later.

Linear memory is disjoint from code space, the execution stack, and the engine’s data structures; therefore compiled programs cannot corrupt their execution environment, jump to arbitrary locations, or perform other undefined behavior.

The worst that can happen with a buggy or exploited WebAssembly model is that its own memory gets messed up. Thus untrusted modules can be safely executed in the same address space as other code. This gives fast in-process isolation, and also allows a WebAssembly engine to be embedded into any other managed language runtime without violating memory safety.

The second key design choice is not to support jumps, but instead to offer a foundational set of structured control flow constructs.

This ensures by construction that control flow cannot form irreducible loops, contain branches to blocks with misaligned stack heights, or branch into the middle of a multi-byte instruction. These properties allow WebAssembly code to be validated in a single pass, compiled in a single pass, or even transformed to an SSA-form intermediate form in a single pass.

Modules may import and export functions, and the import mechanism also serves as a safe foreign function interface (FFI) through which a WebAssembly program can communicate with its embedding environment. Values crossing the language boundary are automatically converted according to JavaScript rules.

WebAssembly seeks to give deterministic semantics in cases where hardware behaviour differs (corner cases such as out-of-range shifts, divide-by-zero, and so on). Three possible sources of non-determinism remain:

  1. The representation of NaN values for floating point: CPUs differ, and normalising after every numeric operation is too expensive. The pragmatic choice is for instructions to output a canonical NaN representation, unless an input is a non-canonical NaN form, in which case the output NaN is non-deterministic.
  2. The amount of available resource may differ wildly across devices. Not much you can do about that! A grow-memory instruction may therefore non-deterministically return -1 when out of memory.
  3. If you call host functions, then WebAssembly makes no guarantees about how they behave.

WebAssembly does not (yet) have threads, and therefore no non-determinism arising from concurrent memory access. Adding threads and a memory model is the subject of ongoing work beyond the scope of this paper.



The execution model is specified based on the state for a global store, which records all of the module instances, tables, and memories that have been allocated in it. (Tables are vectors of opaque values of a particular element type, at most one table may be defined for a module in the current version of WebAssembly). A set of small-step reduction relations specify the execution rules, with reductions defined over configurations that consist of a global store, local variable values, and the instruction sequence to execute. Here’s a simple example, for the full set of rules see Figure 2 in the paper.

(t.\mathbf{const}\ c_1) t.unop \hookrightarrow t.\mathbf{const} \ unop_t(c)

We can interpret this as follows:

  1. Due to validation, we know that a value of type t is on top of the stack
  2. Pop the value t.\mathbf{const} c_1 from the stack
  3. Let c be the value produced by executing the unary operator unop with argument c_1.
  4. Push t.\mathbf{const} c onto the stack


On the Web, code is fetched from untrusted sources. Before it can be executed safely, it must be validated. Validation rules for WebAssembly are defined succinctly as a type system. This type system is, by design, embarrassingly simple. It is designed to be efficiently checkable in a single linear pass, interleaved with binary decoding and compilation.

The typing rules fit on a single page (see Figure 3 in the paper). Again, I’ll take just a single example for exposition.

\dfrac{C_{local}(i) = t}{C\ \vdash\  \mathbf{get\_local}\ i : \epsilon \rightarrow t}

We can read this as ‘given a context in which the local variable at index i is of type t (the part above the line), then under the assumptions encoded in the context, the expression produces a result of type t‘.

The WebAssembly type system is sound: the reduction rules cover all execution states that can arise for valid programs and hence the absence of undefined behaviour in the execution semantics. Lots of good things follow:

In particular, this implies the absence of type safety violations such as invalid calls or illegal accesses to locals, it guarantees memory safety, and it ensures the inaccessibility of code addresses or the call stack. It also implies that the use of the operand stack is structured and its layout determined statically at all program points, which is crucial for efficient compilation on a register machine. Furthermore, it es tablishes memory and state encapsulation – i.e., abstraction properties on the module and function boundaries, which cannot leak information unless explicitly exported/returned – necessary conditions for user-defined security measures.

Binary format

WebAssembly is transmitted over the wire in a binary format. The translation from the abstract syntax is straightforward. Each binary represents a single module and is divided into sections with function types in their own section ahead of the code for their bodies. This means that loading latency can be minimised by starting streaming compilation as soon as function bodies arrive over the wire. Furthermore, consecutive function bodies can be compiled in parallel. V8 and SpiderMonkey both use parallel compilation and achieve a 5-6x improvement in compilation time with 8 threads.


The embedder defines how modules are loaded, how imports and exports between modules are resolved, provides foreign functions to accomplish I/O and timers, and specifies how WebAssembly traps are handled. In our work the primary use case has been the Web and JavaScript embedding, so these mechanisms are implemented in terms of JavaScript and Web APIs.

Inside the browser, you can load, compile and invoke WebAssembly modules through a JavaScript API.

Implementation experiences

Independent implementations of WebAssembly have been developed for all major browsers. V8, SpiderMonkey, and JavaScriptCore (WebKit) reuse their optimising JIT compilers to compile modules ahead of time before instantiation. Chakra (Microsoft Edge) lazily translates individual functions to an interpreted internal bytecode format on first execution, and then later JIT compiles hot functions.

Our experience reusing the advanced JITs from 4 different JavaScript engines has been a resounding success, allowing all engines to achieve high performance in a short time.

There is also a reference interpreter for the WebAssembly language, implemented in Ocaml, “due to the ability to write in a high-level stylized way that closely matches the formalization, approximating an ‘executable specification’.” This is used to test production implementations and the spec., and to prototype new features.

Performance tests with V8 and SpiderMonkey using the PolyBenchC benchmark suite show that WebAssembly has good performance:

Overall, the results show that WebAssembly is very competitive with native code, with 7 benchmarks within 10% of native and nearly all of them within 2x of native.

WebAssembly is on average 33.7% faster than asm.js, with validation taking only 3% of the time it does for asm.j.s

WebAssembly binaries are also compact, being on average 62.5% the size of asm.js, and 85.3% of native x86-64 code.

Future directions

The first goal is to provide fully comprehensive support for low-level code (i.e., compiled from C/C++), and future versions will include support for zero-cost exceptions, threads, and SIMD instructions.

Beyond that, we intend to evolve WebAssembly further into an attractive target for high-level languages by including relevant primitives like tail calls, stack switching, or coroutines. A highly important goal is to provide access to the advanced and highly tuned garbage collectors that are built into all Web browsers, thus eliminating the main shortcoming relative to JavaScript when compiling to the Web.

Struc2vec: learning node representations from structural identity

September 15, 2017

struc2vec: learning node representations from structural identity Ribeiro et al., KDD’17

This is a paper about identifying nodes in graphs that play a similar role based solely on the structure of the graph, for example computing the structural identity of individuals in social networks. That’s nice and all that, but what I personally find most interesting about the paper is the meta-process by which the authors go about learning the latent distributed vectors that capture the thing they’re interested in (structural similarity in this case). Once you’ve got those vectors, you can do vector arithmetic, distance calculations, use them in classifiers and other higher-level learning tasks and so on. As word2vec places semantically similar words close together in space, so we want structurally similar nodes to be close together in space.

Recall that for word embeddings we need the context of a word (the other words that appear close to it in sentences) in order to learn a word vector. If we could come up with some similar notion of context for a node in graph, then we could use the same approach. Since we’re interested in similarity based on graph structure, regardless of node and edge labels etc., we’ll make sure that the information we want to ignore doesn’t inform the context in any way. The technique is to create a meta-graph encoding structural information about the graph of interest, and then use random walks starting from each node in the meta-graph to determine the context set.

Problem and high level approach

Structural identity is a concept of symmetry in networks in which nodes are identified based on the network structure. The concept is strongly related to functions or roles played by nodes in the network, an important problem in social and hard sciences.

We want to learn representations that capture the structural identity of nodes in the network, such that:

  • the distance between the latent representation of nodes is strongly correlated to their structural similarity.
  • the latent representation does not depend on any non-structural node or edge attribute (e.g., node labels).

Struc2vec has four main steps:

  1. Determine the structural similarity between each vertex pair in the graph, for different neighbourhood sizes.
  2. Construct a weighted multi-layer graph, in which each layer corresponds to a level in a hierarchy measuring structural similarity (think: ‘at this level of zoom, these things look kind of similar’).
  3. Use the multi-layer graph to generate context for each node based on biased random walking.
  4. Apply standard techniques to learn a latent representation from the context given by the sequence of nodes in the random walks.

Let’s now take a closer look at each of those four steps.

A measure of structural similarity

Here’s a graph G with diameter k^*.

For a node u in the graph, let R_{k}(u) be the set of nodes at radius k (k hops) from u.

Now, for two nodes u and v, we want to know how similar their structures are for a given value of k. In other words, we want to compute a measure of structural distance for the k-neighbourhoods. Call this distance function f_{k}(u,v). The function is only defined when both nodes have at least one k-neighbour.

The structural distance f_{k}(u,v) has two components: the distance at the (k-1)-neighbourhood, and the degree sequence distance g(s(R_k(u)),s(R_k(v))) between the ordered degree sequence s of the nodes in k-neighbourhoods of u and v.

f_{k}(u,v) = f_{k-1}(u,v) + g(s(R_k(u)),s(R_k(v)))

The two degree sequences may be of different sizes, and each element in the sequence will have some value in the range [0, n-1] for an n-node graph. The authors treat these sequences like a timeseries and use DTW (dynamic time warping) to assess their similarity. DTW in turn needs yet another distance function, which computes a distance between matched elements in the first and second sequences. For this, we can use:

d(a,b) = \frac{\max(a,b)}{\min(a,b)} - 1

This has a few desirable properties: two identical ordered degree sequences will have zero distance; and the ratio between e.g., degrees 1 and 2 is much larger than between e.g., degrees 101 and 102.

At this point, we have ourselves a structural similarity distance measure!

A multi-layered context graph

For a graph with diameter k^{*}, we’re going to construct a multi-layer graph with k^{*} levels. Each layer k contains one node for each node in the original graph, and weighted undirected edges are created between the nodes in the layer according to their structural distance in their k-neighbourhood:
w_{k}(uv,) = e^{-f_{k}(u,v)}

Edges are defined only if f_k(u,v) is defined. Here’s what one layer of the graph might look like.

Now we need to make connections across the layers. Each vertex is connected to its corresponding vertex in the layer above and the layer below. Moving down the layers, these edges simply have weight 1. Moving up the layers, the weight of the edge is based on the similarity \Gamma_{k}(u) of a node u to other nodes in layer k.

w(u_k,u_{k+1}) = \log(\Gamma_k(u) + e)

\Gamma_k(u) is the number of edges incident to u that have weight larger than the average edge weight of the complete graph in layer k.

Creating context through random walks

The context for a node u is based on random walks in the multi-layer graph M, starting at the corresponding vertex in layer 0. Repeating the walk a number of times yields multiple independent contexts for u.

At each step in the random walk, we either move to another node in the current layer, or follow and edge to switch layers. The walk stays in the current layer with probability q, choosing a node v to walk to with probability

p_{k}(u,v) = \frac{e^{-f_k(u,v)}}{Z_k(u)}

Where Z_k is just the normalization factor for the vertex u in layer k.

If we’re changing layers, then we go up a layer with probability proportional to the edge weight:

p_k(u_k, u_{k+1}) = \frac{w(u_k, u_{k+1})}{w(u_k, u_{k+1}) + w(u_k, u_{k-1})}

(And hence down a layer, if a lower layer exists, with probability 1 - p_k(u_k, u_{k+1})).

Learning the vectors

Let the node sequences be generated by the random walks be the context, and use your favourite technique to learn a representation that predicts the context nodes for a given node u. The authors used Skip-Gram for this.


In larger graphs, we can end up computing weights for a lot of edges ( O(k^{*}n^3) ). To make the technique more practical the authors discuss three optimisations: reducing the length of degree sequences considered, reducing the number of pairwise similarity calculations, and reducing the number of layers. Details are in §3.5. The evaluation results show that this pragmatic pruning has only a small impact on the learned vectors’ effectiveness.


Barbell graphs

A Barbell graph is constructed with two complete graphs connected by a path graph, like so:

If we look at the latent node structure vectors learned by struc2vec, we see that all of the complete graphs nodes from both bells are clustered together (the blue dots), as are symmetrical points on the connecting path.

Other approaches, such as DeepWalk don’t cope so well with the structural similarity and e.g., create two different clusters of vectors for the two bells:

Karate Club network

The Karate Club network has 34 nodes and 78 edges, with each node representing a club member and an edge between members denoting that they have interacted outside of the club. The authors construct a network consisting of two identical copies of the Club network joined by one connecting edge:

Struc2vec is able to closely pair the corresponding nodes in each half of the growth (which obviously have the same structure).

As an example, note that nodes 1, 34 and their correspondent mirrors (37 and 42) are in a separate cluster in the latent space. Interestingly, these are exactly the nodes that represent the club instructor Mr. Hi and his administrator John A. The network was gathered after a conflict between the two split the members of the club into two groups – centered on either Mr. Hi or John A. Therefore, nodes 1 and 34 have the specific and similar role of leaders in the network. Note that struc2vec captures their function even though there is no edge between them.

Robustness to noise

Starting with an egonet graph extracted from Facebook, edge sampling is used to create two new graphs in which each edge from the original graph is present with some probability s. For varying values of s, the following chart shows the distance in latent space between corresponding node pairs and all node pairs. Even when s = 0.3, so that the probability that an original edge appears in both sampled graphs is 0.09 (0.3^2), the framework still places corresponding nodes closer in the latent space.

This experiment indicates the robustness of the framework in uncovering the structural identity of nodes even in the presence of structural noise, modeled here through edge removals.

Using the learned vectors for classification

In a final experiment, the authors take air traffic network graphs for Brazil, America, and Europe where nodes correspond to airports and edges correspond to commercial flights. The struc2vec latent vectors are used to train a supervised classifier which predicts how active the airport is (which quartile the airport appears in when ranked by busyness). The struc2vec vectors are compared against alternative features for this task: simple node degree, and a prior work node2vec. Struc2vec demonstrates superior classification accuracy on this task.