Skip to content

WSMeter: A performance evaluation methodology for Google’s production warehouse-scale computers

April 23, 2018

WSMeter: A performance evaluation methodology for Google’s production warehouse-scale computers Lee et al., ASPLOS’18

(The link above is to the ACM Digital Library, if you don’t have membership you should still be able to access the paper pdf by following the link from The Morning Paper blog post directly.)

How do you know how well your large kubernetes cluster / data centre / warehouse-scale computer (WSC) is performing? Is a particular change worth deploying? Can you quantify the ROI? To do that, you’re going to need some WSC-wide metric of performance. Not so easy! The WSC may be running thousands of distinct jobs all sharing the same underlying resources. Developing a load-testing benchmark workload to accurately model this is ‘practically impossible.’ Therefore, we need a method that lets us evaluate performance in a live production environment. Google’s answer is the Warehouse Scale performance Meter (WSMeter), “a methodology to efficiently and accurately evaluate a WSC’s performance using a live production environment.” At WSC scale, even small improvements can translate into considerable cost reductions. WSMeter’s low-risk, low-cost approach encourages more aggressive evaluation of potential new features.

Challenges in defining a WSC performance metric

Consider a change to rollout a new DVFS (power management) policy. Is the impact positive or negative? The answer is both, depending on your perspective!

We observe that half of the jobs experience performance boost while the other half suffer from degradations. It is difficult to finalize the decision because the jobs have conflicting performance behaviors and there is no metric to summarize the overall performance impact.

Even when you pick one job with overall improvement performance, and one with overall degraded performance, and then look at the distribution across the instances of those jobs, it’s hard to draw clear conclusions given the input data variance and resource interference.

… the jobs and machines in WSCs show diverse behaviors at scale, making it difficult to accurately quantify the performance from few observations.

Quota-weighted performance

The goal is to boil the performance of the whole WSC down to a single number (we’ll talk about the limitations of that shortly, but for now let’s just roll with it). We can measure the performance of individual jobs, and intuitively the metric should be some aggregation of the per-job performance. Suppose the performance of an individual job i is given by p_i, then the simplest measure would be to take the mean value of p_i across all jobs. WSMeter starts there, but makes three tweaks:

  1. It uses a weighted average, where the performance of a given job is weighted by a factor indicating its importance. For Google, the weighting factor is based on a job’s quota, specifically the CPU quota, indicating the maximum amount of CPU the job is allowed to use. The weighting factor for a given job is the CPU quota for that job, divided by the total quota across all jobs.
  2. For the job performance metric, there are two choices. If all jobs in the scope of interest share the same user-level performance metric, then that can be used. Otherwise Google use Instructions per Cycle (IPC) as the performance metric. For Google jobs, this correlates well with user-level performance metrics.
  3. Instead of using the performance of all jobs in the metric calculation, it uses only a subset of the jobs. An analysis of Google workloads shows that much of the CPU quota is consumed by a relatively smaller number of heavyweight jobs that are common across their WSCs. (See the charts below).

We find that many of these heavyweight jobs are basic infrastructure jobs (e.g., databases, caches, distributed locks) that the other higher-level jobs utilize; they are therefore common across WSCs…

WSMeter samples from the top-N common jobs, and find through empirical investigation that considering 10%-20% of the common jobs reduces the deviation from the results observed when considering all common jobs to just a few percent.

We emphasize that the heavyweight jobs are still much more diverse compared to the jobs in typical load-testing benchmark suites (e.g., about a dozen), highlighting the need for an evaluation using live environments.

The thing that jumps out to me here is that this is a rich-get-richer scheme. By boiling WSC performance down to a single number, and deploying upgrades based on their ability to improve that number, we end up improving the performance of the largest jobs by weight in the WSC. Of course, in aggregate that’s going to give the best overall improvement. But the metric allows all sorts of carnage to happen in the tail. (Think of the usefulness of e.g. mean latency vs 99th-%ile latency, or even seeing the full distribution). There are particular characteristics of Google’s workloads in their WSCs that make this trade-off work for them. It would be worth thinking about whether or not your own workloads share similar characteristics before rushing to use this single metric.

A statistical performance model

We have a metric, so now we can turn our attention to the best way to capture the underlying measurements in a production setting. There is natural performance variation across job instances (input parameters, neighbours, and so on), so we can’t rely on just a single measurement. Instead, we can treat the performance of a job as a random variable following a certain probability distribution.

The Central Limit Theorem then suggests that if we can make sufficient observations, we can accurately deduce the average statistic.

The following figure shows distribution of a sample’s mean for 1000 sampling trials using each of two different sample sizes. When N=100, you can see the estimations heavily clustered around the true value, while the sample with N=1 does not share this characteristic.

WSMeter estimates the performance of each selected job in this way, and using those job performance estimations as inputs to the quota-weighted averaging process we saw earlier. There’s a little bit of trickiness involved in summing the individual t-distribution random variables though:

As the summation of t-distribution random variables does not follow a t-distribution (or other well-known distributions), we approximate [the individual job performance metrics] to follow a normal distribution.

Under this assumption the weighted sum of per-job performance follows a normal distribution with mean and variance:

Empirical investigation shows that it is safe to ignore the covariance term (see section 4.3).

Therefore, we approximate the WSC performance as a normal random variable whose mean is the weighted sum of the per-job performance estimations and the variance is the weight-squared sum of the per-job variances (i.e., ignore the covariance term).

To control the accuracy of the result (i.e., the variance), we can adjust the number of observations N_i for each job. Note that N_i can be different for each job we want to sample. WSMeter gives this problem to a sat solver, asking it to find an optimal combination of N_is under the follow constraints:

The minimum value of N_i is set to 4, based on an analysis of the top four heavyweight jobs and the number of samples needed to ensure a normal distribution. This value is a trade-off balancing coverage, fidelity, and cost.

WSMeter in action

End-to-end, WSMeter looks like this:


Assessing a potential performance improvement is a five-step process:

  1. The current WSC performance is baselined, using all the job instances in the WSC to calculate population performance statistics.
  2. The jobs to be included in the target evaluation are selected to achieve the target weight-coverage. At this point it is also necessary to decide upon the desired evaluation fidelity.
  3. The sat solver solves the optimisation problem to determine the number of job instances to observe for each job.
  4. A testing cluster is built with the new feature. This can either be done by replacing existing machines in the WSC, or by adding new machines. The new job performance characteristics as well as the overall performance characteristics are then measured.
  5. In the case that the new feature increases the performance variance of the jobs, the initial N_is may become insufficient. If so, the variances in the optimisation problem are updated and one more round of measurement is done until the requirement is satisfied.

For the case of a CPU upgrade, and a new DVFS policy…

… WSMeter accurately discerns 7% and 1% performance improvements from WSC upgrades using only 0.9% and 6.6% of the machines in the WSCs, respectively. We emphasize that naive statistical comparisons incur much higher evaluation costs (> 4x) and sometimes even fail to distinguish subtle differences.

The third case study is especially interesting, because it shows how the methodology can be applied to a WSC customer. A customer operates one compute-intensive service and one network intensive service of equal importance and cost, with 1,500+ instances of each job deployed on the cloud. The upgrade to be tested is a collection of compiler optimisations and software upgrades. When testing these changes in isolation, an 11% improvement is observed, but using WSMeter a picture of the real in-production gains, at around 6.6% emerges.

This discrepancy strongly encourages the WSC customers, service developers, and researchers, to evaluate new features on live WSCs to measure the realistic performance impacts.

The architectural implications of autonomous driving: constraints and acceleration

April 20, 2018

The architectural implications of autonomous driving: constraints and acceleration Lin et al., ASPLOS’18

Today’s paper is another example of complementing CPUs with GPUs, FPGAs, and ASICs in order to build a system with the desired performance. In this instance, the challenge is to build an autonomous self-driving car!

Architecting autonomous driving systems is particularly challenging for a number of reasons…

  1. The system has to make “correct” operational decisions at all times to avoid accidents, and advanced machine learning, computer vision, and robotic processing algorithms are used to deliver the required high precision. These algorithms are compute intensive.
  2. The system must be able to react to traffic conditions in real-time, which means processing must always finish under strict deadlines (about 100ms in this work).
  3. The system must operate within a power budget to avoid negatively impacting driving range and fuel efficiency.

So how do you build a self-driving car?

There are several defined levels of automation, with level 2 being ‘partial automation’ in which the automated system controls steering and acceleration/deceleration under limited driving conditions. At level 3 the automated system handles all driving tasks under limited conditions (with a human driver taking over outside of that). By level 5 the system is fully automated.

The following table shows the automation levels targeted by leading industry participants, together with the platforms and sensors used to achieve this:

In this work, the authors target highly autonomous vehicles (HAVs), operating at levels 3-5. Moreover, they using vision-based systems using cameras and radar for sensing surroundings, rather than the much more expensive LIDAR.

A typical end-to-end autonomous driving system looks something like this:


Captured sensor data is fed to an object detector and a localizer (for identifying vehicle location at decimeter-level) in parallel. The object detector identifies objects of interest and passes these to an object tracker to associate the objects with their movement trajectory over time. The object movement information from the object tracker and the vehicle location information from the localizer is combined and projected onto the same 3D coordinate space by a sensor fusion engine.

The fused information is used by the motion planning engine to assign path trajectories (e.g. lane change, or setting vehicle velocity). The mission planning engine calculates the operating motions needed to realise planned paths and determine a routing path from source to destination.

Due to the lack of a publicly available experimental framework, we build an end-to-end workload to investigate the architecture implications of autonomous driving vehicles.

For each of the six main algorithmic components (object detection, object tracking, localization, fusion, motion planning, and mission planning), the authors identify and select state of the art algorithms.

Object detection

For object detection, the YOLO DNN-based detection algorithm is used.

Object detection is focused on the four most important categories for autonomous driving: vehicles, bicycles, traffic signs, and pedestrians.

Object tracking

GOTURN (also DNN-based) is used for object tracking. GOTURN tracks a single object, so a pool of trackers are used.


For localization ORB-SLAM is used, which gives high accuracy and can also localize the vehicle regardles of viewpoint.


No reference is given for the fusion engine component, but it is comparatively simple compared to the other components, and perhaps was custom built for the project. It combines the coordinates of tracked objects from the GOTURN trackers with the vehicle position from ORB-SLAM and maps them onto the same 3D-coordinate space to be sent to the motion planning engine.

Motion & Mission planning

Both motion and mission planning components come from the Autoware open-source autonomous driving framework. Motion planning is done using graph-based search to find minimum cost paths when the vehicle is in large open spaces like parking lots or rural areas, in more structured areas ‘conformal lattices with spatial and temporal information’ are used to adapt the motion plan to the environment. Mission planning uses a rule-based approach combining traffic rules and the driving area condition, following routes generated by navigation systems such as Google Maps. Unlike the other components which execute continuously, mission planning is only executed once unless the vehicle deviates from planned routes.


How quickly an autonomous system can react to traffic conditions is determined by the frame rate (how fast we can feed real-time sensor data into the process engine), and the processing latency (how long it takes us to make operational decisions based on the captured sensor data). The fastest possible action by a human driver takes 100-150ms (600-850ms is more typical). The target for an autonomous driving system is set at 100ms.

An autonomous driving system should be able to process current traffic conditions within a latency of 100ms at a frequency of at least once every 100ms.

The system also needs extremely predictable performance, which means that long latencies in the tail are unacceptable. Thus 99th, or 99.99th, percentile latency should be used to evaluate performance.

All those processors kick out a lot of heat. We need to keep the temperature within the operating range of the system, and we also need to avoid overheating the cabin. This necessitates additional cooling infrastructure.

Finally, we need to keep an eye on the overall power consumption. A power-hungry system can degrade vehicle fuel efficiency by us much as 11.5%. The processors themselves contribute about half of the additional power consumption, the rest is consumed by the cooling overhead, and the storage power costs of storing tens of terabytes of map information.

With the CPU-only baseline system (16-core Intel Xeon at 3.2GHz), we can clearly see that three of the key components: object detection, tracking, and localisation, do not meet the 100ms individually, let alone when combined.

Beyond CPUs

Looking further into these three troublesome components, we can see that DNN execution is the culprit in detecting and tracking, and feature extraction takes most of the time in localisation.

…conventional multicore CPU systems are not suitable to meet all the design constraints, particularly the real-time processing requirement. Therefore, we port the critical algorithmic components to alternative hardware acceleration platforms, and investigate the viability of accelerator-based designs.

The bottleneck algorithmic components are ported to GPUs using existing machine learning software libraries. YOLO is implemented using the cuDNN library from Nvidia, GOTURN is ported using Caffe (which in turn using cuDNN), and ORB-SLAM is porte to GPUs using the OpenCV library.

FPGA optimised versions of DNN and feature extraction are built using the Altera Stratix V platform.

Previously published ASIC implementations are used for DNNs, and a feature extraction ASIC is designed using Verilog and synthesized using the ARM Artisam IBM SOI 45nm library.

Here’s how the CPU, GPU, FPGA, and ASIC implementations compare for latency and power across the three components:


This analysis immediately rules out CPUs and FPGAs for the detection and tracking components, because they don’t meet the tail latency target. Note also that specialized hardware platforms such as FPGAs and ASCIs offer significantly higher energy efficiency.

The perfect blend?

You can use different combinations of CPU, GPU, FPGA, and ASIC for the three different components, in an attempt to balance latency and power consumption. Focusing just on latency for the moment, this chart shows the 99.99th percentile latency for various combinations:


The lowest tail latency of all comes when using GPU-based object detection, and ASICs for tracking and localisation. If we look at power consumption though, and target a maximum of 5% driving range reduction, then we can see that this particular combination is slightly over our power budget.


The all-ASIC or ASIC + FPGA-based localisation are the only combinations that fit within the 5% range-reduction budget, and also (just!) meet the latency budget.

Higher resolution cameras can significantly boost the accuracy of autonomous driving systems. The authors modified their benchmarks to look at end-to-end latency as a function of input camera resolution. Some of the ASCI and GPU accelerated systems can still meet the real-time performance constraints at Full HD resolution (1080p), but none of them can sustain Quad HD: computational capability still remains the bottleneck preventing us from benefiting from higher resolution cameras.

The last word

We show that GPU- FPGA- and ASCI-accelerated systems can reduce the tail latency of [localization, object detection, and object tracking] algorithms by 169x, 10x, and 93x respectively… while power-hungry accelerators like GPUs can predictably deliver the computation at low latency, their high power consumption, further magnified by the cooling load to meet the thermal constraints, can significantly degrade the driving range and fuel efficiency of the vehicle.

Darwin: a genomics co-processor provides up to 15,000x acceleration on long read assembly

April 19, 2018

Darwin: a genomics co-processor provides up to 15,000x acceleration on long read assembly Turakhia et al., ASPLOS’18

With the slow demise of Moore’s law, hardware accelerators are needed to meet the rapidly growing computational requirements of X.

For this paper, X = genomics, and genomic data is certainly growing fast: doubling every 7 months and on track to surpass YouTube and Twitter by 2025. Rack-size machines can sequence 50 genomes a day, portable sequencers require several days per genome. Third-generation sequencing technologies are now available which produce much longer reads of contiguous DNA – on the order of tens of kilobases compared to only a few hundred bases with the previous generations of technology.

For personalized medicine, long reads are superior in identifying structural variants i.e. large insertions, deletions and re-arrangements in the genome spanning kilobases or more which are sometimes associated with diseases; for haplotype phasing, to distinguish mutations on maternal vs paternal chromosomes; and for resolving highly repetitive regions in the genome.

The long read technology comes with a drawback though – high error rates in sequencing of between 15%-40%. The errors are corrected using computational methods ‘that can be orders of magnitude slower than those used for the second generation technologies.

This paper describes “Darwin”— a co-processor for genomic sequence alignment that , without sacrificing sensitivity, provides up to 15,000x speedup over the state-of-the-art software for reference-guided assembly of third-generation reads. Darwin achieves this speedup through hardware/algorithm co-design…

The algorithms are D-SOFT and GACT. D-SOFT (Diagonal-band Seed Overlapping based Filtration Technique) implements the filtering (seeding) stage of alignment. GACT (Genome Alignment using Constant memory Traceback) uses dynamic programming for aligning arbitrarily long sequences using constant memory for the compute intensive step. Both algorithms have accompanying hardware accelerators, which are prototyped using FPGAs and also embodied in an ASIC design.

Genome sequence alignment and assembly

The sequence alignment problem takes a query sequence Q of letters in \{A,C,G,T\} (corresponding to the four nucleotide bases) and tries to find the best match (alignment) with a reference sequence R, assigning each letter in R and Q to either a single letter in the opposite sequence, or to a gap. The scoring scheme rewards matching bases, and penalises mismatches and gaps. For example:

The seed-and-extend approach starts with substrings (the seeds) of fixed size k drawn from Q, and finds their exact matches in R (called seed hits). Once seed hits are found, the cells surrounding the hits are explored (using a dynamic programming approach). This avoids the high cost of exploring the full space. A drawback of the seed-and-extend approach is that alignments with no exactly matching substrings of length greater than or equal to k will not be discovered, reducing the algorithm’s sensitivity.

Sequencing is the process of shearing DNA molecules into a large number N of random fragments and reading the bases in each fragment. N is chosen such that each base-pair is expected to be covered multiple times, enabling errors to be corrected via a consensus mechanism.

Genome assembly is the process of reconstructing a genome G from a set of reads S. The reconstruction can be reference guided, using reference sequence R to guide re-assembly, or it can be de novo, reconstructing G from S without the benefit (and bias!) of R.

D-SOFT filtering

For long reads with high error rates, simple seeding algorithms like BLAST give a high false positive rate that results in excessive computation in the extend or alignment stage. To address this issue, Myers14 uses multiple seeds from Q, counting the number of matching bases in a band of diagonals and then filtering on a fixed threshold.

D-SOFT uses a similar filtering strategy, but with a different implementation.


D-SOFT filters by counting the number of unique bases in Q covered by seed hits in diagonal bands. If the count in a given band exceeds h bases, then the band is selected for alignment.

In the figure above, the X axis corresponds to positions in R, divided into N_B bins, each B bases wide. The Y axis corresponds to positions in Q. Ten candidate seeds from Q with length k=4 are used from consecutive positions, starting at position 1. Red dots mark the position of each hit in R, with the ‘tail’ of the dot covering k=4 bases in R and Q showing the extent of the match.

The count for a diagonal is then simply the number of bases in Q covered by seed hits in that band. In the example above, both Bin1 and Bin 3 have three seed hits, but in Bin 1 only 6 bases are covered, whereas in Bin 3 there are 9 covered bases. With the threshold h set to 8, only bin 3 will be chosen as a candidate.

Counting unique bases in a diagonal band allows D-SOFT to be more precise at high sensitivity than strategies that count seed hits, because overlapping bases are not counted multiple times.

Seed lookup is implemented using a seed position table: for each of the 4^k possible seeds, a seed pointer table points to the beginning of a list of hits in a position table.

Accelerator design

The D-SOFT accelerators maps the sequentially accessed seed tables to off-chip DRAM and places the randomly-accessed bin arrays in banks of on-chip SRAM. Incoming seeds are distributed over four DRAM channels, where a seed-position lookup block looks up the hits in the position table and calculates the bin of each hit. Bin-offset pairs are then sent via a network-on-chip to the on-chip SRAM bank associated with the bin.

With 64MB SRAMs divided into 16 banks, up to 32M bins can accommodate at 4Gbp reference sequence with B=128.

GACT alignment

GACT performs the alignment extension part of a seed-and-extend system. It finds an arbitrarily long alignment extension in linear time by following the optimal path within overlapping tiles. GACT uses two parameters, the tile size T and an overlap threshold O (less than T). Empirically, we have found that GACT gives an optimal result (identical to Smith-Waterman) with reasonable values of T and Q.


The algorithm for extending sequences to the left is show below, reverse R and Q for extending to the right:

Accelerator design

The GACT array accelerators handle the compute-intensive align routine in GACT, with the rest of the algorithm implemented in software.

A linear array of N_{pe} processing elements (PEs) exploits wavefront parallelism to compute up to N_{pe} cells of the DP-matrix for the Smith-Waterman algorithm with affine gap penalties. Each clock cycle, each PE computes three scores and one trackback pointer for one cell of the DP-matrix.

The three scores are the affine gap penalties for insertion and deletion, and the overall cell score.

Darwin: putting it all together

The Darwin co-processor combines the hardware-accelerated GACT and D-SOFT algorithms. It uses four 32GB LPDDR4 DRAM memory channels, storing identical copies of the seed position table, the reference sequence, and the query sequences. This helps to keep all channels load-balanced and fully utilised at all times.

For both reference guided and de novo assembly the forward and reverse complement of P reads are used as queries into a reference sequence R. For reference guided assembly an actual reference sequence is used. For de novo assembly the reads are concatenated to form the reference sequence, with each read padded to take up a whole number of bins. Candidate bins from D-SOFT are passed to the GACT array which extends and scores the alignment.

The assembly algorithms are partitioned between hardware and software as follows. Software initializes D-SOFT and GACT parameters and receives candidate alignment positions from D-SOFT, which is accelerated in hardware. Align – the compute extensive step in GACT – is hardware accelerated, the remainder of the alignment step is performed in software. Software also constructs the seed position table in de novo assembly.


The following table compares Darwin’s performance against baseline algorithms for reference-guided assembly of the human genome using 50,000 randomly sampled reads for each of three read sets, as well as de novo assembly of a C. elegans genome and a human genome. The D-SOFT parameters in Darwin are adjusted to match or exceed the sensitivity of the baseline technique. Darwin achieves speedups of up to 15,000x!

The following chart (log scale) shows the breakdown of where all those performance gains are coming from.

ASIC synthesis was carried out using the Synopsys Design Compiler in a TSMC 40nm CMOS process. The area and power breakdown of the different components of the ASIC are shown in this table:

Darwin can operate at 847MHz (3.4x the FPGA frequency) with relatively small power consumption (about 15W). In a modern 14nm process Darwin would be much smaller (about 50mm squared vs 412mm squared) and consume even less power (about 6.4W).

Darwin’s filtering and alignment operations are general and can be applied to rapidly growing sequence alignment applications beyond read assembly, such as whole genome alignments, metagenomics and multiple sequence alignments.

Google workloads for consumer devices: mitigating data movement bottlenecks

April 18, 2018

Google workloads for consumer devices: mitigating data movement bottlenecks Boroumand et al., ASPLOS’18

What if your mobile device could be twice as fast on common tasks, greatly improving the user experience, while at the same time significantly extending your battery life? This is the feat that the authors of today’s paper pull-off, using a technique known as processing-in-memory (PIM). PIM moves some processing into the memory itself, avoiding the need to transfer data from memory to the CPU for those operations. It turns out that such data movement is a major contributor to the total system energy usage, so eliminating it can lead to big gains.

Our evaluation shows that offloading simple functions from these consumer workloads to PIM logic, consisting of either simple cores or specialized accelerators, reduces system energy consumption by 55.4% and execution time by 54.2%, on average across all of our workloads.

Energy as a limiting factor

While the performance requirements of consumer devices increase year on year, and devices pack in power-hungry CPUs, GPUs, special-purpose accelerators, sensors and high-resolution screens to keep pace, lithium ion battery capacity has only doubled in the last 20 years. Moreover, the thermal power dissipation in consumer devices is now a severe performance constraint.

…among the many sources of energy consumption in consumer devices…, data movement between the main memory system and the computation units (e.g., CPUs, GPU, special-purpose accelerators) is a major contributor to the total system energy.

For example, when scrolling through a Google Docs web page, simply moving data accounts for 77% of the system energy consumption.

…the energy cost of moving data is orders of magnitude higher than the energy cost of computation. We find that across all of the applications we study, 62.7% of the total system energy, on average, is spent on data movement between main memory and the compute units.

PIM to the rescue?

It follows therefore that the total system energy requirements can be greatly reduced (which is the main goal of the work, the performance speed-ups are just a nice bonus!) if we can avoid moving data. If we don’t want to move the data, the alternative is to move the compute to where the data already is. Here we’re talking about data in memory (DRAM), and compute on a CPU. How can we offload compute from CPUs to DRAM???

The emergence of 3D-stacked DRAM architectures offers a promising solution. These architectures stack multiple layers of DRAM arrays within a single chip, and use vertical through-silicon vias (TSVs) to provide much greater bandwidth between layers than the off-chip bandwidth available between DRAM and the CPUs. Several 3D-stacked DRAM architectures provide a dedicated logic layer within the stack that can have low-complexity (due to thermal constraints) logic.

Using this logic layer, it is possible to perform processing in memory (PIM), also known as near-data processing. Offloading processing using PIM avoids the energy hungry data movement, and also enables us to take advantage of the high-bandwidth, low-latency data access inside the DRAM. Given the area and energy constraints of consumer devices though, there’s not a lot of room to play with. So the offloaded processing logic needs to be kept relatively simple.

The paper explores two different ways of utilising the available PIM capacity:

  1. By creating a PIM core: a small low-power general-purpose embedded core,
  2. By embedding a small collection of fixed-function accelerators (PIM accelerators).

PIM accelerators can provide larger benefits, but require custom logic to be implemented for each workload.

To study the feasibility and benefits of PIM for consumer devices, the authors study several common workload types using when browsing the web, running TensorFlow models (used by e.g. Google Translate, Google Now, and Google Photos), playing videos (e.g. YouTube), and video capture (e.g. Google Hangouts)….

These workloads are among the most commonly used applications by consumer device users, and they form the core of many Google services that each have over a billion monthly active users.

Browsing the web

The user perception of browser speed is based on page load time, smooth scrolling, and fast switching between tabs. The authors study the energy demands in scrolling and tab switching tasks (both of which also involve page loading).


On average across all pages in the test, 41.9% of page scrolling energy is spent on texture tiling and colour blitting:

For scrolling through a Google Docs web page, fully 77% of the total energy consumption turns out to be due to data movement.

Texture tiling is the biggest culprit. The standard texture tiling process reads linear bitmaps in from memory, converts them to tiled textures, and writes the tiles back to memory (see left-hand side of the figure below). If we can offload texture tiling to PIM, we can save all that data movement back-and-forth:

Our analysis indicates that texture tiling requires only simple primitives: memcopy, bitwise operations, and simple arithmetic operations (e.g. addition). These operations can be performed at high performance on our PIM core, and are amenable to be implemented as a fixed-function PIM accelerator.

19.1% of the total system energy used during page scrolling is attributable to color blitting, and 63.9% of that to data movement. Color blitting also requires only low-cost computations such as memset, simple arithmetic, and shift operations. Thus it can also be moved to PIM.

Switching tabs

When under memory pressure, Chrome compresses the memory pages of inactive tabs and places them into a DRAM-based memory pool called ZRAM. Swapping data in and out and the associated compression and decompression causes a lot of data movement.

Compression and decompression are both good candidates to move to PIM-based execution. Chrome’s ZRAM uses LZO compression which uses simple operations and can execute without performance loss on the PIM core.


The following chart compares the energy costs of the CPU-only approach versus using the general purpose PIM core or special purpose PIM accelerators. On average, PIM core reduces energy consumption by 51.3%, and PIM accelerators by 61.0%. Moreover, PIM-Core and PIM-Acc both significantly outperform the CPU only approach, improving performance on average by 1.6x and 2.0x respectively.

Feeding it forwards (TensorFlow Mobile)

Within TensorFlow mobile, an energy analysis reveals that the most energy-hungry operations are packing/unpacking and quantization (39.3% of total system energy). Together these are responsible for 54.4% of the data movement energy.

They are also responsible for a large chunk of the execution time:

Packing and unpacking reorder elements of matrices to minimise cache misses during multiplication, and quantization converts 32-bit floating point and integer values into 8-bit integers.

Our analysis reveals that packing is a simple data reorganization process and requires simple arithmetic operations to scan over matrices and compute new indices. As a result, packing and unpacking can be performed at high performance on our PIM core, or on a PIM accelerator…

Quantization is also a simple data conversion operation and can be similarly offloaded:

The end result is a 50.9% (54.9%) reduction in energy usage for PIM-Core (PIM-Accelerator) compared to the CPU-only benchmark.

Furthermore, the PIM-based approaches also show speedups of up to 2x, depending on the number of GEMM (Generalized Matrix Multiplication) performed.

Watching and recording videos

The paper contains an analysis of both software and hardware based VP9 decoding (an open-source codec widely used by video-based applications). For higher resolution playback, hardware decoders are essential. Even in the hardware case, the authors still find a significant amount of off-chip data movement generated by the hardware decoder. The main culprit is a process called sub-pixel interpolation, which takes place in the motion compensation (MC) unit and requires reading reference frame data (2.9 reference frame pixels are read from DRAM for every pixel of the current frame).

The MC unit is a good fit for PIM execution, because we can eliminate the need to move reference frames from memory to the on-chip decoder, thus saving significant energy.

Recording (encoding) shows a similar high rate of data movement due to reference frame fetching. The Motion Estimation unit responsible for this can also be moved to PIM execution.

In the video case though, with compression enabled, PIM-Core actually uses 63.4% more energy than the VP9 hardware baseline. But PIM accelerators still bring benefits: reducing energy by 75.1% during decoding, and 68.9% during encoding.

In summary

Our comprehensive experimental evaluation shows that PIM cores are sufficient to eliminate a majority of data movement, due to the computational simplicity and high memory intensity of the PIM targets. On average across all of the consumer workloads that we examine, PIM cores provide a 49.1% energy reduction (up to 59.4%) and a 44.6% performance improvement (up to 2.2x) for a state-of-the-art consumer device. We find that PIM accelerators provide larger benefits, with an average energy reduction of 55.4% (up to 73.5%) and performance improvement of 54.2% (up to 2.5x) for a state-of-the-art consumer device.

Securing wireless neurostimulators

April 17, 2018

Securing wireless neurostimulators Marin et al., CODASPY’18

There’s a lot of thought-provoking material in this paper. The subject is the security of a class of Implantable Medical Devices (IMD) called neurostimulators. These are devices implanted under the skin near the clavicle, and connected directly to the patient’s brain through several leads. They can help to relieve chronic pain symptoms and movement disorders (e.g., Parkinsons). The latest generation of such devices come with remote monitoring and reprogramming capabilities, via an external device programmer. The manufacturers seem to have relied on security through obscurity (when will we ever learn!) with the very predictable result that the interface turns out not be secure at all. So we end up with a hackable device connected directly to someone’s brain. Some of the possible attacks, and also the secure alternative mechanism that Marin et al. develop, sound like they come straight out of a sci-fi novel.

It’s not at all surprising to learn that you can do a lot of harm to a patient by sending signals to their brain. “For example, adversaries could change the settings of the neurostimulator to increase the voltage of the signals that are continuously delivered to the patient’s brain. This could prevent the patient from speaking or moving, cause irreversible damage to their brain, or even worse, be life-threatening.

What I hadn’t foreseen is the possible privacy attacks, especially when future devices support sending of the P-300 wave brain signals (which will allow more precise and effective therapies):

… adversaries could capture and analyze brain signals such as the P-300 wave, a brain response that begins 300ms after a stimulus is presented to a subject. The P-300 wave shows the brain’s ability to recognize familiar information. Martinovic et al. performed a side-channel attack on a Brain Computer Interface (BCI) while connected to several subjects, and showed that the P-300 wave can leak sensitive personal information such as passwords, PINs, whether a person is known to the subject, or even reveal emotions and thoughts.

Implantable medical devices, a sorry story

This isn’t the first time that IMDs have been shown to have poor security. Hei et al. showed a denial of service attack that can deplete an IMD’s resources and reduce battery lifetime from several years to a few weeks (replacement requires surgery). Halperin et al. showed attacks on implantable cardioverter defibrillators, and Marin et al. showed how to do the same using standard hardware without needing to be close to the patient. Li et al. demonstrated attacks against insulin pumps.

IMD devices are resource-constrained with tight power and computational constraints and a limited battery capacity. They lack input and output interfaces, and can’t be physically accessed once implanted.

IMDs need to satisfy several important requirements for proper functioning, such as reliability, availability, and safety. Adding security mechanisms into IMDs is challenging due to inherent tensions between some of these requirements and the desirable security and privacy goals.

Previous work has looked at out-of-band (OOB) channels for exchanging cryptographic keys to secure communication, including the heart-to-heart (H2H) protocol based on the inter-pulse interval. Unfortunately IPI is fairly easy to remotely monitor (even using a standard webcam), and the protocol has a large communication and computation cost.

Reverse engineering a neurostimulator

The authors build a couple of simple antennas, change neurostimulator settings using a device programmer, inspect the format of the transmitted messages.

Using this method, it was possible to reverse engineer the protocol. The devices transmit at 175 KHz using an on-off keying (OOK) modulation scheme. An example signal looks like this:

After a little bit of detective work, we arrive at this:

Communication between a device programmer and the neurostimulator happens in three phase. In the initialization phase a handshake swaps model and serial number of the device programmer and neurostimulator. After this, in the reprogramming phase the device programmer can modify the settings as many times as desired within the some protocol session. The termination phase ends a session, allowing the neurostimulator to switch back to a power saving mode.

Demonstrated attacks

  • Replay attacks: we were able to modify any of the neurotransmitter settings by intercepting and replaying past transmissions sent by legitimate device programmers.
  • Spoofing attacks: we were able to create any arbitrary message and send it to the neurostimulator. This should require knowledge of the neurostimulators serial number (which can be obtained via interception). however, the team found an easier way in that messages were accepted with an empty neurostimulator SN field! “The impact of this attack is quite large, because an adversary could create a valid message, without a SN, and reuse it for all neurostimulators.
  • Privacy attacks:

I generated this encryption key using signals from my brain…

So far we’ve seen that neurostimulators don’t offer any security. How can we do better? We need to establish a secure communication channel between the device programmer and the neurostimulator, which can be accomplished using a shared session key and symmetric encryption. There are two main challenges here: how to generate the key in first place, and how to securely transmit the generated session key to the other party.

Generating the key on the device programmer is appealing in the sense that we have more powerful hardware available to us. Unfortunately, “most OOB channels that allow to transport a key from the device programmer to the IMD, are shown to be vulnerable to security attacks or require to add extra hardware components in the neurostimulator.” So plan B is to generate the key in the neurostimulator itself.

Unfortunately, IMDs typically use microcontroller-based platforms and lack dedicated true random number generators (TRNGs), making it non-trivial to generate random numbers on these devices… Recently, the use of physiological signals extracted from the patient’s body has been proposed for generating cryptographic keys.

The brain produces a signal called the local field potential (LFP), which refers to the electrical potential in the extracellular space around neurons. Neurotransmitters can measure this signal with the existing lead configurations, but it cannot be obtained remotely – you need direct contact with the brain. So that makes it hard to attack! Here’s an example of LFP data obtained from a mouse brain:

The least significant bits of the sampled LFP data turn out to be a good initial source of a raw random number. This is then passed through a two-stage parity filter to increase entropy density. The resulting bits are relatively uniformly distributed. The estimated Shannon-entropy is around 0.91/bit.

Once we’ve generated a session key from the LFP of the patient’s brain, we need a secure way to transmit it to the programming device:

Our technique for key transportation leverages the fact that both the patient’s skin and the neurostimulator’s case are conductive. We propose to apply an electrical signal (with the key bit embedded in it) to the neurostimulator’s case such that the device programmer can measure this signal by touching the patient’s skin.

To prevent eavesdropping attacks, a low transmission power can be used. The signal can still be received with direct contact using an amplitude less than 1mV. Even at a distance of a few centimetres though, eavesdropping is not possible. This comes at the cost of a lower data rate, but that’s not an issue for this use case. In fact, having a requirement for skin contact for a few seconds (currently about 1.5) is an advantage because it’s assumed a patient would notice an adversary contacting their skin in this way. (Attacks such as hacking a smart watch worn by the patient are currently out of scope).

PrivacyGuide: towards an implementation of the EU GDPR on Internet privacy policy evaluation

April 16, 2018

PrivacyGuide: Towards an implementation of the EU GDPR on Internet privacy policy evaluation Tesfay et al., IWSPA’18

(Note: the above link takes you to the ACM Digital Library, where the paper should be accessible when accessed from the blog site. If you’re reading this via the email subscription and don’t have ACM DL access, please follow the link via my blog.)

…if a user was to read the privacy policy of every service she visits on the Internet, she would on average need 244 hours annually which is slightly more than half of the average time a user would spend on the Internet by then.

Studies have shown that only 1% or less of total users click on privacy policies, and those that do rarely actually read them. The GDPR requires clear succinct explanations and explicit consent (i.e., no burying your secrets on page 37 of a 70 page document), but that’s not the situation on the ground right now, and it’s hard to see that changing overnight on May 25th.

So we know that privacy is an important matter, and that a solution involving reading lengthy terms of service to determine the privacy implications of using a particular app/service/site is untenable. What can we do (beyond legislating for shorter ToS’s)? We could try either:

  1. Crowdsourcing interpretations of privacy policies. For example, Terms of Service: Didn’t Read (ToS:DR) is a community based project to evaluate privacy policies by crowdsourcing, and provides a browser add-on. Unfortunately it lacks coverage (68 web sites covered since its launch in 2012, and only 11 of them given a final grade), and is nearly impossible to keep up to date since companies frequently update their policies.
  2. Automating the interpretation of privacy policies to highlight the key information a prospective user really needs to know. That’s what PrivacyGuide does. It turns lengthy terms of service documents into this:

PrivacyGuide breaks down privacy policies into 11 standard categories, based on an analysis of the GDPR requirements. It presents a simple dashboard summary with green (low risk), orange (average risk) and red (high risk) indicators for each. You can drill down into the detail behind the icons to see a short description of the meaning and consequence of the risk class, and the closest original sentence or paragraph on which the assessment was performed. The idea is to present a consistent format across sites, similar to the way nutrition labels are standardised. This would be a big step forward. But even then, ‘The curious case of the pdf converter that liked Mozart’ study that we looked at last year showed that people still tend to just click-through. The dialogs developed in that study showing the privacy implications of permissions granted by a policy remains my favourite.

Understanding privacy policies

To start with, the authors conducted an extensive analysis of the GDPR with the support of legal experts. Comparing the privacy aspects of the GDPR with classification categories used in previous studies, resulted in the following set of 11 categories:


Within each category a group of privacy experts determined three risk levels. Here are examples for the third-party sharing and account deletion categories:


With this framework in hand, a privacy corpus was built using the privacy policies of the 45 most accessed websites in Europe according to the results provided by Alexa. (It’s a shame the authors don’t make this assessment available anywhere that I’m aware of…). 35 participants manually extracted on average 12,666 text passages from a single privacy policy, of which 433 ended up being assigned to a privacy category and classified with a risk level.


The next step is to train a multi-class classifier (11 privacy categories, and three risk levels per category) using this corpus. The team experimented with Naive Bayes, SVM, Decision Trees, and Random Forests: Naive Bayes turned out to work the best for this problem.

The PrivacyGuide workflow looks like this:

Content extraction is done using the Boilerpipe library. The resulting text is then split into individual sentences, and a set of keywords (obtained from the training data) used to identify informative sentences. Fragments not including any of these keywords are filtered out at this stage.

From the resulting set of sentences, stop word removal, tokenisation, and stemming are performed, and WEKA’s StringToWordVector filter is used to create sentence vectors. TF-IDF is also used to consider the most relevant words.

Using these features, classification is done in two steps.

  1. The Privacy Aspect Prediction Engine (PAPE) predicts a privacy category, and is trained using 10-fold cross-validation.
  2. Given a category, the Risk Prediction Engine (RiPE) predicts an associated risk class. In order to provide explanations to the user, the best representative sentence on which the risk classification is based is also captured.


The next 10 most accessed sites in Europe (according to Alexa) are then used for evaluation.

The table below shows the performance of the classifier on these sites.

Not perfect, but considering you weren’t going to read the policy otherwise and PrivacyGuide can produce its results in less than 2 seconds, it’s pretty good! Note the incidental discovery that none of the sites specified data breach notification plans in their policies.

End of Term

April 2, 2018

We’ve reached end of term again, and The Morning Paper will be taking a two-week break, beginning again on Monday 16th April.

I hope you enjoyed the selections from the last three months. So much great research, it’s hard to highlight just a few! But in case you missed them, here’s a small selection covering a diverse range of topics.

See you in a couple of weeks, and thanks for reading!