Skip to content

Bolt: I know what you did last summer… in the cloud

May 24, 2017
tags:

Bolt: I know what you did last summer… in the cloud Delimitrou & Kozyrakis, ASPLOS’17

You get your run-of-the-mill noisy neighbours – the ones who occasionally have friends round and play music a little too loud until a little too late. And then in the UK at least you get what we used to call “Asbos,” named after the Anti-Social Behaviour Orders that were introduced to try and curb deliberate disturbances of the peace. The same thing happens on the public cloud. You have accidental interference from neighbouring processes, and then you have your cloud ASBOs – which make deliberate attacks on neighbouring processes.

… unmanaged contention leads to security and privacy vulnerabilities. This has prompted significant work on side-channel and distributed denial of service attack (DDoS), data leakage exploitations, and attacks that pinpoint target VMs in a cloud system. Most of these schemes leverage the lack of strictly enforced resource isolation between co-scheduled instances and the naming conventions cloud frameworks use for machines to extract confidential information from victim applications, such as encryption keys.

By the time you’ve glanced through the long list of references to such attacks in the related work section (we’ll look at a few attack examples later on), you’ll be quite glad you’re not trying to run a public cloud service yourself! It’s a bit of an arms race of course, with cloud providers also mounting defences. Today’s paper selection, Bolt, shows just how hard that job is becoming. When Bolt pops up as your cloud neighbour, it is able to fingerprint the resources used by your own process to determine what type of process you are running (e.g., memcached, Spark job, …).That’s invaluable information to an attacker.,

Extracting this information enables a wide spectrum of previously impractical cloud attacks, including denial of service attacks (DoS) that increase tail latency by 140x, as well as resource freeing (RFA) and co-residency attacks.

Using the information that Bolt provides, attackers can mount these attacks in a way that circumvents traditional detection and prevention mechanisms. The paper concludes with a look at the isolation mechanisms available to see how well they can protect against Bolt’s process detection.

Existing isolation techniques are insufficient to mitigate security vulnerabilities, and techniques that provide reasonable security guarantees sacrifice performance or cost efficiency, through low utilization. This highlights the need for new fine-grain, and coordinated isolation techniques that guarantee security at high utilization for shared resources.

Let’s now take a look at (a) the process detection mechanism, (b) the security attacks that can take advantage of it, and (c) how well existing isolation mechanisms can defend against them. Ready?

I know what you’re doing in the cloud

Let’s start with the simple case of one Bolt detection process co-scheduled with a victim process. Bolt attempts to figure out how much the victim process depends on different types of shared resources, resulting in a fingerprint of shared resource sensitivity for the victim. It does this by running a series of micro-benchmarks, each targeting a given shared resource (L1 instruction and L1 data cache, L2 and last level cache, memory capacity and memory bandwidth, CPU, network bandwidth, disk capacity and disk bandwidth).

Each microbenchmark progressively increases its intensity from 0 to 100% until it detects pressure from the co-scheduled workload, i.e., until the microbenchmark’s performance is worse than its expected value when running in isolation. The intensity of the microbenchmark at that point captures the pressure the victim incurs that shared resource.

Bolt runs microbenchmarks for 2-5 seconds, with one core benchmark and one ‘uncore’ benchmark (zero pressure will be detected in the core resource if it is not shared). This is long enough to gather all the data Bolt needs.

(Click for larger view)

At this point we’ve essentially got what looks like a classification problem, mapping resource fingerprint to workload type. At the core of this process Bolt uses SVD and Pearson correlation to figure out how similar the victim process resource profile is to workloads that have been seen before. The analysis also reveals information on the resources the victim is most sensitive too, which is useful in crafting attacks.

Here you can see, for example, that high LLC pressure and very high L1-instruction cache pressure indicated a high probability the victim process is memcached.

With multiple co-residents, Bolt needs to disentangle the contention caused by several jobs. If one of the co-resident applications shares a core, we can exploit that. Otherwise Bolt runs a shutter profiling mode using frequent brief profiling phases in uncore resources. Say there are two co-resident victim processes. If Bolt manages to capture a moment when one of them is at low load, the resource pressure mechanism will enable effective detection of the other (highly loaded) process.

How well does the detection process work?

The first evaluation is conducted on a 40-node cluster with 8 core servers and a total of 108 different workloads. The training set comprises 120 diverse applications selected to provide sufficient coverage of the space of resource characteristics (everything from web servers to analytics and databases).

Here we see Bolt’s detection accuracy under a standard ‘least loaded’ scheduling algorithm, and also under Quasar, which also uses job fingerprinting to deliberately try to place jobs where they will see the least interference.

Note that, although in Table 1 we group applications by programming framework or online service, each framework does not correspond to a single resource profile. Profiles of applications within a framework can vary greatly depending on functionality, complexity, and dataset features. Bolt’s recommender system matches resource profiles to specific algorithmic classes and dataset characteristics within each framework.

Here’s an example of two known Hadoop jobs (a recommender, and word count) being compared to a just captured fingerprint of a new unknown process:

When there is only one co-resident, detection accuracy exceeds 95%. Accuracy drops with more co-residents, reaching 67% with 5 co-scheduled applications.

A second evaluation took place on EC2 in a 200-node cluster of c3.8xlarge instances. 20 independent users were asked to submit several applications of their choosing, and Bolt had to figure out what they were. A total of 436 jobs were submitted to the cluster, 277 of them were correctly labelled. Bolt could not of course assign labels to application types it had not seen before (such as email clients), but it still managed to gather resource profiles for these.

Security attacks

The authors discuss how Bolt can help with three classes of attacks or attack steps, internal DoS attacks, resource freeing attacks, and VM co-residency detection.

Internal DoS

Internal DoS attacks take advantage of IaaS and PaaS cloud multi-tenancy to launch adversarial programs on the same host as the victim and degrade its performance.

For these attacks, the information providing by Bolt is used to create custom contentious workloads combining the microbenchmarks in the resources the victim is most sensitive to. Using this approach against the 108 applications in the first evaluation, performance of the victim was degraded by 2.2x on average and up to 9.8x overall.

Degradation is much more pronounced for interactive workloads like key-value stores, with tail latency increasing by 8-140x, a dramatic impact for applications with strict tail latency SLAs.

Typical detection of such attacks relies on detecting resource saturation. But Bolt’s finely-tuned mechanism is able to keep the utilization low enough to evade detection while still negatively impacting the victim.

Resource Freeing Attack

The goal of a resource freeing attack (RFA) is to modify the workload of a victim such that it frees up resources for the adversary, improving its performance.

Bolt adds load in the victim’s critical resources, which causes it to stall, which in turns means it lessens its pressure on other resources. These resources are then gobbled up by the adversary.

Here are the results from Bolt-powered RFA attacks on a selection of victim apps:

(We don’t really care about the beneficiary app (mcf in this case) uplift, it’s the performance degradation column that matters).

Co-residency detection

… a malicious user is rarely interested in a random service running on a public cloud. More often, they target a specific workload, e.g., a competitive e-commerce site.

This requires an efficient means of finding the victims. Good (bad?) news! Bolt can help with that. First you launch a bunch of Bolt processes looking for co-residents of the appropriate type. Having found a candidate, apply appropriate resource contention and at the same time make an external request to the server from a collaborating ‘receiver’. If the external receiver sees a slower response than normal, there’s a good chance you’ve found your target. The paper demonstrates the attack in a 40-node cluster using 10 probes which detect 3 VMs running SQL.

It then introduces memory interference, as the receiver initiates a large number of SQL queries. While these queries have 8.16msec mean latency without interference, latency now becomes 26.14msec. Given a ∼3× latency increase, we can conclude that one of the co-residents is the victim SQL server. Detection required 6sec from instantiation to receiver detection, and 11 adversary VMs.

Better isolation to the rescue?

The authors evaluate baremetal, containerized, and virtualized environments and apply or simulate thread pinning, network isolation, DRAM bandwidth isolation, and last-level cache (LLC) isolation. One isolation mechanism at a time is added, which reveals their contribution to decreasing detection accuracy.

Unfortunately, even when all techniques are on, accuracy is still 50% due to two reasons: current techniques are not fine-grain enough to allow strict and scalable isolation and core resources (L1/L2 caches, CPU) are prone to interference due to contending hyperthreads.

Ensuring hyperthreads of different jobs are never scheduled on the same core does help (grey bars in the figure above), but adds approximately a 34% performance overhead. Over-provisioning to compensate leads to a 45% drop in utilisation.

The last word

We hope that this study will motivate public cloud providers to introduce stricter isolation solutions in their platforms and systems architects to develop fine-grained isolation techniques that provide strong isolation and performance predictability at high utilization.

Determining application-specific peak power and energy requirements for ultra-low power processors

May 23, 2017
tags:

Determining application-specific peak power and energy requirements for ultra-low power processors Cherupalli et al., ASPLOS’17

We’re straying a little bit out of The Morning Paper comfort zone again this morning to look at one of the key hardware issues affecting the design of IoT devices: how much energy they use, and the related question of their peak power requirements. Why is this interesting? Firstly, there are way more ultra-low power (ULP) based devices out there than you might imagine. And secondly, since we can’t accurately gauge the energy and peak-power requirements, we tend to over provision, leading to larger, heavier systems than we could otherwise build. Cherupalli et al. demonstrate a technique that gives much more accurate estimates of peak power and energy requirements, with tighter bounds. By understanding these requirements in detail, the results can also be used to guide optimisations that reduce power usage. We’ll get to all that soon, but first we need to dig a little deeper into the world of ULP devices.

The world of ultra-low power processors

What’s the most widely deployed type of processor in the world? It’s not the processors that power our PCs and laptops of course, nor is it even the processors you’ll find in mobile phones.

Ultra-low power (ULP) processors have rapidly become the most abundant type of processor in production today.

This is driven by new and emerging power- and energy- constrained applications such as IoT, wearables, implantables (I didn’t know that was a word!), and sensor networks. ULPs are projected to continue being the most widely deployed type of processor in the future.

… applications will continue to rely on simple single-core ultra-low power processors, powered by batteries and energy harvesting, and will have even tighter peak power and energy constraints than the ULP systems of today.

In the world of ULPs, power is everything. There are three basic types of systems depending on the source(s) of power:

  1. Type 1 systems are powered directly by energy harvesting (e.g., a solar cell)
  2. Type 2 systems are powered by a battery, which in turn is charged via energy harvesting
  3. Type 3 systems just have a battery

It’s the size of the energy harvesting and/or energy storage (battery) components that ultimately determine the form factor, size, and weight of an ULP-based device. For example:

Going one step further, since the energy harvesting and storage requirements of a ULP system are determined by its power and energy requirements, the peak power and energy requirements of a ULP system are the primary factors that determine critical system characteristics such as size, weight, cost, and lifetime

Peak power and energy consumption impact harvester and battery size calculations in different ways depending on the type of device:

Current approaches for determining peak power and energy

There are three basic approaches for determining the peak power and energy requirements of a ULP processor for a given application.

  1. For a very conservative upper bound, you can just look at the data sheets for the processor, which tell you the peak power than can be consumed by the hardware.
  2. You can run a stressmark: an application that attempts to activate the hardware in a way that maximises peak power or energy. “A stressmark may be less conservative than a design specification, since it may not be possible for an application to exercise all parts of the hardware at once.”
  3. You can perform application profiling on the processor by measuring power consumption while running the target application on the hardware.

A bit more work, but #3 sounds like an obviously good idea. So what’s the catch?

… since profiling is performed with specific input sets under specific operating conditions, peak power or energy bounds determined by profiling might be exceeded during operation if application inputs or system operating conditions are different than profiling.

Things tend not to work out well if a processor does try to operate outside of the peak power and energy bounds available to it, so best practice is to add a guardband (power buffer) to any profiling-based results. A typical guardbanding factor might be 33%. In other words, take the profiling results, add 33%, and provision for that.

Input-independent peak power and energy profiling

Since the peak power and energy requirements of an application can vary based on application inputs, a technique that determines application-specific peak power requirements must bound peak power for all possible inputs. Exhaustive profiling for all possible inputs is not possible for most applications, so we have created a novel approach for activity analysis that uses unknown logic values (Xs) for inputs to efficiently characterize activity for all possible inputs with minimum simulation effort.

The core idea is to run a symbolic simulation of an application binary on ‘the gate-level netlist of a processor.’ (What is a netlist?). Most ULP systems tend to have simple processors and applications, making simulation feasible. For example, even the most complex benchmark analysed in the paper completed full simulation in 2 hours.

During the simulation, a special ‘X’ value is propagated for all signal values that can’t be constrained based on the application. We start with all gates and memory locations not explicitly loaded with the binary set to X. Any input values during simulation are also set to X.

As simulation progresses, the simulator dynamically constructs an execution tree describing all possible execution paths through the application. If an X symbol propagates to the inputs of the program counter (PC) during simulation, indicating an input-dependent control sequence, a branch is created in the execution tree. Normally, the simulator pushes the state corresponding to one execution path onto a stack for later analysis and continues down the other path.

At the end of this process we know the activity of each gate at each point in the execution tree. A gate that is not marked as toggled (0 to 1, or 1 to 0) at a particular location in the tree can never be toggled at that location in the application. We can use this knowledge encoded in the execution tree to generate peak power requirements as follows:

  1. Concatenate all of the execution paths into a single execution trace.
  2. Assign all of the Xs in the trace in such a way that power for each cycle is maximised. Power is maximised when a gate toggles, but a transition requires two cycles: one to prepare and one to make the transition. Since we don’t know the best way to align transitions with cycles, two separate value change dump (VCD) files are produced: one that maximises power in all even cycles, and one that maximises power in odd cycles.
  3. Combine the even and odd power traces into a single peak power trace by taking power values from even cycles in the even trace, and odd cycles in the odd trace.
  4. The peak power requirement of the application is the maximum power cycle value found in the peak power trace.

The peak power trace can be used to generate peak energy requirements.

… the peak energy of an application is bounded by the execution path with the highest sum of per-cycle peak power multiplied by the clock period.

When making this calculation, for input-dependent branches the algorithm always take the most expensive one. For loops where the maximum number of iterations can be determined simply take the energy for one iteration and multiply it by that max. “If neither is possible, it may not be possible to compute the peak energy of the application; however, this is uncommon in embedded applications.”

Results

The chart below shows the peak power requirements determined using the above process as compared to the conventional techniques for determining peak power.

By accounting for all possible inputs using symbolic simulation, our technique can bound peak power and energy for all possible application executions without guardbanding. The peak power requirements reported by our technique are 15% lower than guardbanded application-specific requirements, 26% lower than guardbanded stressmark-based requirements, and 27% lower than design specification-based requirements, on average.

Here are the results of the peak energy calculations:

… the peak energy requirements reported by our technique are 17% lower than guardbanded application-specific requirements, 26% lower than guardbanded stressmark-based requirements, and 47% lower than design specification-based requirements, on average.

The following tables show how this would translate into harvester area and battery volume reductions.

Optimising peak power guided by profile results

The technique can also be used to guide application-based optimisation by analysing the processor’s behaviour during the cycles of peak power consumption. Three different optimisations can then be applied as appropriate:

  • Reduce a complex instruction that induces a lot of activity in one cycle, with a sequence of simpler instructions, thus spreading out the activity over several cycles.
  • Delay the activation of one or more modules, previously activated in a peak cycle, until a later cycle.
  • Avoid the multiplier being active simultaneously with the processor core by inserting a NOP into the pipeline during the cycle in which the multiplier is active.

Taking combined figures across the benchmarks in the paper, these produced peak power reductions of up to 10% (5% on average), with up to 34% (18% on average) reduction in peak power dynamic range.

Typed Architectures: architectural support for lightweight scripting

May 22, 2017

Typed Architectures: architectural support for lightweight scripting Kim et al., ASPLOS’17

JavaScript, Python, Ruby, Lua, and related dynamically typed scripting languages are increasingly popular for developing IoT applications. For example, the Raspberry Pi is associated with Python; Arduino and Intel’s Galileo and Edison are associated with JavaScript. In these constrained hardware environments though, using JITs is often not viable so scripting engines employ interpreter based VMs without JITs. Thus for production grade applications, scripting languages may be too slow.

Dynamic types are one of the major sources of inefficiency for scripting languages… each variable must carry a type tag, and a type guard must be executed before any operation that can be overloaded. This significantly increases dynamic instruction count, memory footprint, and hence energy consumption, compared to statically typed languages.

A recent study showed that around 25% of total execution time in the V8 JavaScript engine is spent on dynamic type checking. Back in the 1980s, LISP machines such as the Symbolics 3600 provided hardware support for runtime type checking. Could we do something similar for dynamically typed languages today?

It turns out that yes we can. Kim et al. introduce Typed Architectures, “a high-efficiency, low-cost execution substrate for dynamic scripting languages.”

Typed Architectures calculate and check the dynamic type of each variable implicitly in hardware, rather than explicitly in software, hence significantly reducing instruction count for dynamic type checking.

An FPGA-based evaluation showed geomean speedups of 11.2% and 9.9% (max speedups 32.6% and 43.5%) for production grade JavaScript and Lua scripting engines respectively. And all this with a 16%-19% improvement in energy-delay product.

Which all goes to prove you can optimise all you like in software, but at the end of the day it’s hard to beat hardware! What’s especially neat about Typed Architectures is that the authors seem to have found a nice sweet spot whereby the hardware support is general enough to efficiently support a wide range of dynamically typed languages.

The overhead of dynamic typing

With dynamic typing, values are tagged with their type, and these type tags are checked when dispatching operations (e.g., a polymorphic ADD operation must be dispatched to type-specific implementations).

The overhead of dynamic type checking is known to be significant.

Here’s a breakdown of dynamic bytecodes in 11 Lua scripts. Lua has 47 distinct bytecodes, but less than 10 dominate the total count.

If we drill into the five most popular bytecodes, we can see the number of dynamic instructions per bytecode for each of these:

These five bytecodes are all polymorphic and require type guards to select the right function depending on the operand types.

Dynamic typing overhead can be broken down into three components: tag insertion (adding tags to values), tag extraction (getting tags for values), and tag checking. The following example of a polymorphic add function in Lua (with corresponding Lua bytecode and RISC-V assembly code) makes the point pretty well.

(Click to enlarge)

Typed Architectures – ISA extensions

The three primary design objectives for Typed Architectures are:

  • High performance – significant speedups by offloading type checking operations to hardware
  • Flexibility – the ISA extension should be flexible enough to support multiple production grade scripting engines
  • Low cost – in terms of area and power

The baseline RISC ISA is extended with a unified register file, tagged ALU instructions, and tagged memory instructions.

The register file gains two new fields (in addition to the existing register value field): an 8-bit type tag (capable therefore of representing 256 distinct types), and a 1-bit flag indicating whether the value is of an integer or floating-point subtype.

There are three new tagged ALU instructions, xadd, xsub, and xmul which perform type checking in parallel with value calculation. A Handler register is used to handle type mispredictions.

When a tagged ALU instruction is executed, Typed Architecture looks up in a Type Rule Table with the two source type tags and the instruction’s op-code as the key. If it hits, the pipeline executes normally to write back the output type tag retrieved from the Type Rule Table together with the output value to the destination register. If not, a type misprediction has happened, and the PC is redirected to the slow path pointed to by the handler register to go through the original software-based type checking.

The Type Rule Table itself is pre-loaded at program launch.

There are two new instructions for memory operations: tld (tagged load) and tsd (tagged store). These load / store a requested value together with its type tag and integer/FP flag bit in a single operation.

A type tag is extracted from an adjacent 64-bit double-word (or the same double-word as the value) by applying shift-and-mask.

Three new registers control the shift-and-mask process: an offset register indicates which double word the tag will be extracted from, a shift register encodes the starting point of the type field within the double work, and the mask register holds an 8-bit mask to extract a type tag of the same width.

There are four additional instructions we’ve not yet mentioned:

  • thdl sets the value of the misprediction handler register
  • tchk performs type checking without value calculation
  • tget reads the type of a register
  • tset writes the type of a register

Using our new super powers, here’s how the original ADD bytecode from above is transformed:

Here’s the full summary of the extended ISA:

(Click for larger view).

Typed Architectures – Pipeline

The extended ISA is implemented with a pipeline structure that looks like as follows:

We add a unified register file, a Type Rule Table, and a tag extract/insert logic to the baseline (shaded in gray).

The execution of, say xadd, differs from a static add instruction in three ways:

  1. It selects the calculation path (integer ALU or FP ALU) at the decode stage
  2. It accesses the Type Rule Table for type checking in hardware. If it hits, the output tag is propagated to the writeback stage and finally to the type tag of the destination register. If not, it goes through the type misprediction path.
  3. There are type mispredictions with tagged ALU instructions. The handler is simply the original code with software-based type checking.

Typed Architectures – Memory access

The data layouts for storing tag-value pairs differs across languages and implementations. The three special purpose registers controlling the shift-and-mask operation can be customised per language to obtain the desired behaviour. As an example, here are the register settings for Lua and SpiderMonkey (FireFox JavaScript engine).

Implementation and evaluation

For both Lua and SpiderMonkey, bytecode profiling is used to identify the top five hot bytecodes which execute type guards. These are then retargeted to Typed Architecture.

Our model is based on open-source 64-bit RISC-V v2 Rocket core with the default RISC-V/Newlib target. We have integrated custom performance counters for performance analysis, such as I-cache miss rate, branch misprediction rate, and so on. It is a fully synthesizable RTL model written in Chisel language. This model is compiled into Verilog RTL, and then synthesized for FPGA emulation and area/power estimation. We use Xilinx ZC706 FPGAs for instruction and cycle counts. Table 6 (below) summarizes the parameters used for evaluation.

Here we see the overall speedups obtained for Lua and JavaScript across a range of benchmarks:

Typed Architecture achieves geomean speedups of 9.9% and 11.2% for Lua and SpiderMonkey, respectively, with maximum speedups of 43.5% and 32.6%.

A key source of performance improvement is the reduction in dynamic instruction count.

Type Architecture also reduces resource pressure on the branch predictor, instruction cache, registers, and so on. “This also brings significant benefits to some benchmarks.”

The total area and power of Rocket Core augmented with TypedArchitecture are increased by 1.6% and 3.7% respectively. Combined with the speedups, the EDP (energy-delay product)is improved by 16.5% for Lua and by 19.3% for JavaScript.

Who controls the Internet? Analyzing global threats using property traversal graphs

May 19, 2017
tags:

Who controls the Internet? Analyzing global threats using property traversal graphs Simeonovski et al., WWW’17

Who controls the Internet? How much influence do they have? And what would happen if one of those parties launched an attack or was compromised and used to launch an attack? Previous works have looked at the individual core services, but this paper focuses on their inter-dependencies along attack propagation paths.

An increasing number of reports and studies are showing that a limited number of players have an important influence on the overall security of the Internet infrastructure… we have a rather limited capability to assess the impact of attacks against, or performed by, core service providers.

What kind of attacks are we talking about? Three large-scale security incidents form the initial motivation:

  1. The Great Cannon DDoS Attack of March 16th 2015, a massive DDoS attack caused by malicious JavaScript code injected into TCP connections crossing Chinese network borders. The injected code aggressively requested resources from the DDoS targets.
  2. The PRISM program (2013), an NSA surveillance program with direct access to Internet communications and stored information. “While the direct involvement of popular tech providers is still unclear, in this paper we make the assumption that establishing the this type of collaboration is possible and can be voluntary, or coerced by authorities by means of law and court orders.”
  3. The DDoS attack against Dyn.com of October 21st 2016. The attack caused Dyn.com customers including Amazon, Netflix, Twitter, Reddit, and Spotify to experience outages on name resolution, affecting hundreds of millions of Internet users who could not access their services.

Four different attack vectors are analysed: email sniffing, redirection via malicious domain resolution, in-path content injection, and hosting malicious content.

Gathering information

The authors crawl the web starting from the top 100K Alexa domains, expanding to server and network information, and then adding in organisations and countries. This ultimately leads to a labeled graph containing 1.8M nodes, of which 350K are unique IP addresses. The nodes are connected by 4.7M relationships.

The following table shows labels (think node and edge types) in the graph:

When considering the impact of an attack nodes can be marked at one of three different compromise levels: comprised, partially compromised, and non-compromised. Taint-style propagation rules can then be written which capture how attacks can spread through the network. For example, if a node n is compromised and there is an edge from n to m labeled as A (name lookup) then m is marked as compromised.

Identifying the most promising attack targets

Before assessing attacks, we use our model to select entities that can be either attack victims or the attackers. The selection criteria are based on metrics that reflect the popularity and the influence of entities.

The most promising attack targets (or viewed from another perspective, the entities with the most power over Internet infrastructure) are identified via six metrics.

Who hosts the most Alexa domains?

The analysis is done by country (giving a lens into the power of nation-state attackers), and also by ‘Autonomous Systems’ (AS) – a collection of IP networks and routers under the control of a given network operator.

Under this metric, these are the most powerful countries:

And these are the most powerful network operators:

Who has the most JavaScript hosting servers?

By country:

And by network operator:

Who hosts the most email servers?

By country:

And by network operator:

Who hosts the most name servers?

By country:

And by network operator:

Who has the most power over JavaScript providers?

This metric measures the number of JS hosting servers whose authoritative name server is hosted in a given country or by a given network operator.

By country:

And by network operator:

Who controls the most email server name servers?

The number of domains of email servers hosted by a given country or network operator.

By country:

And by network operator:

Evaluating the impact of potential attacks

Now we’re in a position to evaluate the potential impact of three different attacks: distribution of malicious JavaScript content, email sniffing, and a DDoS attack against a core service provider. In each case a target can be selected by consulting the tables above.

Distributing malicious JavaScript content

The authors consider three ways to do this: – directly compromising (or colluding with) web servers hosting JS code; injecting malicious JavaScript when JS libraries are accessed over unprotected connections (HTTP instead of HTTPS); and redirecting requests for JS content via compromised name resolution.

Here we see the number of Alexa domains that can be reached via the first two of these:

The attack results show that countries can be very powerful attackers. For example, the United States hosts 47K JS hosting providers, which could distributed malicious code to about 16% of the top 100K Alexa domains. However, ASes are also very powerful and affect a fraction of websites that is even larger than than of individual countries, and even groups of countries. For example, the AS of Google can affect about 9% of Alexa domains.

When we look at JS inclusion over unprotected connections, 1,079 of them cross the Chinese network borders, but the United States, the Netherlands, Russia, Germany, and Japan all have even greater influence.

In malicious name resolution redirection the authoritative name server of a domain hosting JS redirects users to a malicious server. The attack result is the number of websites including a resource hosted on a server whose name server is colluding or compromised.

The United States, Google, and DynDNS stand out here.

Email sniffing

To acquire a large number of emails, an attacker can rely on various techniques. In this paper we consider two. The first one is by acquiring them directly from the email server. The second one is by redirecting an email client toward a malicious mail server, which will accept the email, keep a copy, and forward it to the intended recipient. This attack can be performed by a provider or by a country. Tables 3(c)and 3(d) show the attack results. All values are the number of Alexa domains that will be affected by this attack grouped by technique and attacker.

Email sniffing by a malicious email provider:

The United States alone can acquire emails for 25% of the most popular websites!

Malicious name resolution for email sniffing:

Note how Google has much less power in this attack vector – most websites that use Google’s email servers do so via name servers which are not hosted by Google.

DDoS against a core service provider

What happens if a service provider is the victim of an attack and is made unavailable? The data we already have can be used to figure this out. For example, consider the Dyn.com DoS attack from October 2016. DynDNS does not host a relevant number of mail servers and JS hosting providers, but it does host 364 domain servers.

These name servers are authoritative for 3,570 domains hosting JS that provide JS to 5,559 top 100K Alexa domains (not shown in Table 3), of which 4,331 are unprotected JS inclusion. Furthermore, the name servers hosted by DynDNS are authoritative for 1,523 domains running mail servers which are used by 1,178 top Alexa domains. If the Dyn.com DNS infrastructure is attacked, then a fraction that ranges from 1 to 5% of the top 100K Alexa domains would be affected.

So who controls the Internet?

Our results show that already just a few players may have an extensive power: 14 countries and 14 autonomous systems can, directly or indirectly, affect the security of about 23% of websites… In addition, our results show that little has been learned from past attacks. For example, 70% of JavaScript (JS) inclusion is still done over unprotected connections, i.e., via HTTP URLs, which can be used to mount the Great Cannon attack.

BOAT: Building auto-tuners with structured Bayesian optimization

May 18, 2017

BOAT: Building auto-tuners with structured Bayesian optimization Dalibard et al., WWW’17

Due to their complexity, modern systems expose many configuration parameters which users must tune to maximize performance… From the number of machines used in a distributed application, to low-level parameters such as compiler flags, managing configurations has become one of the main challenges faced by users of modern systems.

The time and expertise required to optimise configuration settings simply doesn’t exist in many DevOps teams focused on rapidly delivering application functionality. That’s a shame, because optimising the configuration (without changing a single line of application code) can have way bigger benefits than I ever would have imagined before I first starting looking into this space. Today’s paper choice, BOAT, provides further evidence of that, reducing 99%-ile latency from 19ms to 7ms for a Cassandra workload for an optimised setting vs the defaults. In addition, two hours spent tuning the configuration of a TensorFlow computation delivered at least a 2x speed-up in the median time per SGD iteration.

I also know this because I work with a company called Skipjaq. I should declare my interest here: Accel is an investor, I’m a personal investor and board member, and the (highly technical!) CEO is a long-standing friend of mine. So I probably couldn’t be more biased if I tried! Like BOAT (and CherryPick that we also looked at recently), Skipjaq use Bayesian Optimisation to auto-tune applications. As a recent example, after putting the Skipjaq-optimised configuration settings into production, one company observed a 50% reduction in 95%-ile page load times. And as many studies have shown, page load times have an inversely proportional relationship to revenue. I.e., the faster your page loads, the better the engagement and conversion rates.

For another data point, you might want to check out Ramki Ramakrishna from Twitter’s talk on “Continuous optimization of microservices using machine learning.” As Rami concludes, “continuous automated optimization of microservices is possible, even inevitable.”

Hopefully you’re now feeling sufficiently motivated to learn a little more about how BOAT, the BespOke Auto-Tuner works!

Challenges in auto-tuning

… we consider the task of automatically tuning configuration parameters associated with system performance. This can include any parameter for which the optimal value is dependent on the hardware or software context, ranging from low level parameters such as compiler flags or configuration files, to higher level ones like the number of machines used in a computation.

Two primary challenges make this difficult:

  1. Each performance run is expensive, which means that evolutionary or hill-climbing approaches often requiring thousands of evaluations cannot be applied. One answer to this, as we saw with Cherry Pick, is to use Bayesian optimisation.
  2. The configuration space of most real world problems is too complex. “Bayesian optimization tends to require fewer iterations to converge than other optimisers, but fails to work in problems with many dimensions (more than 10).”

In the neural network case study included in the parameter, BOAT is applied to a problem with over 30 dimensions. (Skipjaq have also had success using an approach based on Bayesian optimisation, with around 30 dimensions).

Structured Bayesian Optimisation

We covered the basic idea of Bayesian optimisation in the CherryPick write-up a couple of weeks ago, so I’ll just give the briefest of recaps here. Bayesian optimisation incrementally builds a probabilistic model (typically a Gaussian process) of the objective function being minimised. In each iteration it finds a candidate point in the exploration space to explore using an acquisition function (for example, based on expected improvement), and then takes a measurement of the objective function at that point (i.e., runs a performance test in this case). The probabilistic model is then updated using this new measurement.

A key failure mode for Bayesian optimisation is when the probabilistic model fails to accurately capture the objective function landscape after a reasonable number of iterations. This can happen when there are too many dimensions in the model.

Enter structured Bayesian optimisation. The central idea is to improve the accuracy of the probabilistic model (as used by the acquisition function) by exploiting some basic knowledge of the problem at hand instead of starting with a blank slate Gaussian process.

Structured Bayesian optimization (SBO) extends the Bayesian optimization methodology to take advantage of the known structure of the objective function. It uses a structured probabilistic model, provided by the developer, instead of a simple Gaussian process. In BOAT, those models are implemented in our probabilistic programming library.

The SBO process looks just like the standard Bayesian optimisation process, but with the the probabilistic program used to build the underlying model.

Using a bespoke model captures the user’s understanding of the behaviour of the system, which can drastically reduce the number of iterations needed for the model to converge towards the true objective function. It also allows for monitoring of runtime properties that are reflected in the model, so that they can be used for inference.

An example should help at this point. In the GC case study, the goal is to tune the JVM garbage collector for a Cassandra workload. This optimisation considers the three parameters young generation size, survivor ratio, and max tenuring threshold. The model that the team built included the rate and average duration statistics for minor collections, predicting both of these values given the configuration. It then predicted latency as a function of the configuration values, rate, and average duration.

When using the model in BOAT, we collected the true value of these statistics from the GC logs after each evaluation and used them for inference. Further, we declared how we believed each of the three model components behaved as a function of their inputs. For example, we noticed the rate of minor collections was inversely proportional to the size of the eden memory region in the JVM-heap, which is where objects are initially allocated. This intuition was included in the corresponding model by building a semi-parametric model, which can successfully combine a user-defined trend with empirical data.

Here’s what a model component for predicting rate looks like in code:

In practice, adding only a little structure can be sufficient to make the optimization converge in a reasonable time. This is useful as simpler models are able to adapt to a broad variety of behaviors, such as widely different hardware and workloads. This allows the construction of bespoke autotuners providing global performance portability.

Building probabilistic models in BOAT

The authors give two recommendations for building useful models. Firstly, models should consist of a combination of independent components, with each component predicting a single observable value. Secondly, each component should be a semi-parametric model. What’s that all about?

Consider a model predicting the time it takes to insert an element into a sorted vector. Imagine that so far we have taken five observations. If we are using a parametric model such as linear regression, we’ll try to fit a straight line, resulting in underfitting:

If on the other hand we use a non-parametric model such as a Gaussian process we may end up fitting the data points, but missing the general trend (overfitting):

Semi-parametric models combine parametric and non-parametric models. “The non-parametric model is used to learn the difference between the parametric model and the observed data.” For example, if we simply combine the previous two models, we arrive at our Goldilocks moment:

When building a model’s component in BOAT, we recommend starting with a non-parametric model and adding structure until the optimization converges in satisfactory time.

To combine several of these individual semi-parametric models (each predicting a single observable value) into a single global model, developers define the flow of data between them using a DAG. The example Cassandra model below combines rate, duration and latency.

Evaluation results

For the Cassandra optimisation, the BOAT auto-tuner using the model above was run for 10 iterations on YCSB workloads.

Our optimised configuration outperforms the Cassandra default configuration by up to 63%… Each optimization converges to within 10% of the best found performance by the second iteration.

The second case study involves tuning a TensorFlow-based distributed computation.

Our tuner takes as input a neural network architecture, a set of available machines and a batch size. It then performs ten iterations, each time evaluating a distributed configuration, before returning the one with highest performance. The tuning always finishes within two hours which is small compared to the typical training time of neural networks.

The probabilistic model again contains three components:

  1. The individual device (CPU or GPU) computation time needed for a worker to perform its assigned workload
  2. The individual machine computation time for machines with multiple devices.
  3. The communication time based, on learning connection speed per type of machine and predicting communication time given the amount of data that must be transferred in each iteration by a given machine.

In our experiments, we tune the scheduling over 10 machines, setting 30-32 parameters. In our largest experiment there are approximately 10^53 possible valid configurations, most of which lead to poor performance.

Here you can see the results of the optimisation as compared to a uniform devices configuration which simply balances load equally among all devices, and a uniform GPUs configuration which balances load equally among all GPUs.

BOAT is open source and available at https://github.com/VDalibard/BOAT.

A blatant Skipjaq plug

If you like the sound of the performance optimisation that Bayesian optimisation-based auto-tuning of configuration can deliver, but don’t have the time or inclination to set up your own Bayesian optimiser, define load tests that mirror production workloads, and run performance tests in such a way that results give a true indication of likely production performance, then you might want to give Skipjaq a try to see what difference it can make with your own apps. Optimisation without any of the hassle. If you’re quick you might still be able to nab a place in the private beta!

DeepSense: a unified deep learning framework for time-series mobile sensing data processing

May 17, 2017

DeepSense: a unified deep learning framework for time-series mobile sensing data processing Yao et al., WWW’17

DeepSense is a deep learning framework that runs on mobile devices, and can be used for regression and classification tasks based on data coming from mobile sensors (e.g., motion sensors). An example of a classification task is heterogeneous human activity recognition (HHAR) – detecting which activity someone might be engaged in (walking, biking, standing, and so on) based on motion sensor measurements. Another example is biometric motion analysis where a user must be identified from their gait. An example of a regression task is tracking the location of a car using acceleration measurements to infer position.

Compared to the state-of-art, DeepSense provides an estimator with far smaller tracking error on the car tracking problem, and outperforms state-of-the-art algorithms on the HHAR and biometric user identification tasks by a large margin.

Despite a general shift towards remote cloud processing for a range of mobile applications, we argue that it is intrinsically desirable that heavy sensing tasks be carried out locally on-device, due to the usually tight latency requirements, and the prohibitively large data transmission requirement as dictated by the high sensor sampling frequency (e.g., accelerometer, gyroscope). Therefore we also demonstrate the feasibility of implementing and deploying DeepSense on mobile devices by showing its moderate energy consumption and low overhead for all three tasks on two different types of smart device.

I’d add that on-device processing is also an important component of privacy for many potential applications.

In working through this paper, I ended up with quite a few sketches in my notebook before I reached a proper understanding of how DeepSense works. In this write-up I’m going to focus on taking you through the core network design, and if that piques your interest, the rest of the evaluation details etcetera should then be easy to pick up from the paper itself.

Processing the data from a single sensor

Let’s start off by considering a single sensor (ultimately we want to build applications that combine data from multiple sensors). The sensor may provide multi-dimensional measurements. For example, a motion sensor that report motion along x, y and z axes. We collect sensor readings in each of these d dimensions at regular intervals (i.e., a time series), which we can represent in matrix form as follows:

We’re going to process the data in non-overlapping windows of width \tau. Dividing the number of data points in the time series sample by \tau gives us the total number of windows, T. For example, if we have 5 seconds of motion sensor data and we divide it into windows lasting 0.25s each, we’ll have 20 windows.

Finding patterns in the time series data works better in the frequency dimension than in the time dimension, so the next step is to take each of the T windows, and pass them through a Fourier transform resulting in f frequency components, each with a magnitude and phase. This gives us a d \times 2f matrix for each window.

We’ve got T of these, and we can pack all of that data into a d \times 2f \times T tensor.

It’s handy for the implementation to have everything nicely wrapped up in a single tensor at this point, but actually we’re going to process slice by slice in the T dimension (one window at a time). Each d \times 2f window slice is passed through a convolution neural network component comprising three stages as illustrated below:

First we use 2D convolutional filters to capture interactions among dimensions, and in the local frequency domain. The output is then passed through 1D convolutional filter layers to capture high-level relationships. The output of the last filter layer is flatten to yield sensor feature vector.

Combining data from multiple sensors

Follow the above process for each of the K sensors that are used by the application. We now have K sensor feature vectors, that we can pack into a matrix with K rows.

The sensor feature matrix is then fed through a second convolutional neural network component with the same structure as the one we just looked at. That is, a 2D convolutional filter layer followed by two 1D layers. Again, we take the output of the last filter layer and flatten it into a combined sensors feature vector. The window width \tau is tacked onto the end of this vector.

For each convolutional layer, DeepSenses learns 64 filters, and uses ReLU as the activation function. In addition, batch normalization is applied at each layer to reduce internal covariate shift.

Now we have a combined sensors feature vector for one time window. Repeat the above process for all T windows.

Use an RNN to learn patterns across time windows

So now we have T combined sensor feature vectors, each learning intra-window interactions. But of course it’s also important to learn inter-window relationships across time windows. To do this the T feature vectors are fed into an RNN.

At this point I think we’re ready for the big picture.

Instead of using LSTMs, the authors choose to use Gated Recurrent Units (GRUs) for the RNN layer.

… GRUs show similar performance to LSTMs on various tasks, while having a more concise expression, which reduces network complexity for mobile applications.

DeepSense uses a stacked GRU structure with two layers. These can run incrementally when there is a new time window, resulting in faster processing of stream data.

Top it all with an output layer

The output of the recurrent layer is a series of T vectors, \{\mathbf{x}_{t}^{(r)}\}, one for each time window.

For regression-based tasks (e.g., predicting car location), the output layer is a fully connected layer on top of each of those vectors, sharing weights \mathbf{W}_{out} and bias term \mathbf{b}_{out} to learn \mathbf{\hat{y}}_{t} = \mathbf{W}_{out} . \mathbf{x}_{t}^{(r)} + \mathbf{b}_{out} .

For classification tasks, the individual vector are composed into a single fixed-length vector for further processing. You could use something fancy like a weighted average over time learned by an attention network, but in this paper excellent results are obtained simply by averaging over time (adding up the vectors and dividing by T). This final feature vector is fed into a softmax layer to generate the final category prediction.

Customise for the application in hand

To tailor DeepSense for a particular mobile sensing and computing task, the following steps are taken:

  • Identify the number of sensor inputs, K, and pre-process the inputs into a set of d \times 2f \times T tensors.
  • Identify the type of the task and select the appropriate output layer
  • Optionally customise the cost function. The default cost function for regression oriented tasks is mean squared error, and for classification it is cross-entropy error.

For the activity recognition (HHAR) and user identification tasks in the evaluation the default cost function is used. For car location tracking a negative log likelihood function is used (see section 4.2 for details).

Key results

Here’s the accuracy that DeepSense achieves on the car tracking task, versus the sensor-fusion and eNav algorithms. The map-aided accuracy column shows the accuracy achieved when the location is mapped to the nearest road segment on a map.

On the HHAR task DeepSense outperforms other methods by 10%.

And on the user identification task by 20%:

We evaluated DeepSense via three representative mobile sensing tasks, where DeepSense outperformed state of the art baselines by significant margins while still claiming its mobile-feasibility through moderate energy consumption and low latency on both mobile and embedded platforms.

The evaluation tasks focused mostly on motion sensors, but the approach can be applied to many other sensor types including microphone, Wi-Fi signal, Barometer, and light-sensors.

Usage patterns and the economics of the public cloud

May 16, 2017
tags:

Usage patterns and the economics of the public cloud Kilcioglu et al., WWW’17

Illustrating the huge diversity of topics covered at WWW, following yesterday’s look at recovering mobile user trajectories from aggregate data, today’s choice studies usage variation and pricing models in the public cloud. The basis for the study is data from ‘a major provider’s public cloud datacenters.’ Unless Google or Amazon are sending their data to three researchers from Microsoft, it’s a fair bet we’re looking at Azure.

Research in economics and operations management posits that dynamic pricing is critically important when capacity is fixed (at least in the short run) and fixed costs represent a substantial fraction of total costs.

For example, the fixed capacity of a data center (at least in the short run) and the relatively fixed costs of keeping it running.

In these markets, firms change prices so that demand is equal or close to the fixed capacity. The degree to which this can be accomplished depends on demand responsiveness and the firm’s price setting ability, which jointly determine how supply goes unused.

In other words, you make prices cheaper in periods of lower demand to stimulate usage, and you make prices higher in periods of high demand to discourage overuse.

Our scientific understanding of peak-load pricing focuses on three industries: electricity, airlines and hotels. Cloud computing thus seems to offer a fourth large industry where dynamic prices are expected to be important.

And yet cloud providers overwhelmingly use static pricing models, what’s going on? Here’s the short summary: the data shows that there is actually very little variation in demand volatility for cloud datacenters at the moment, thus the current pricing model makes sense. If you look more closely at actual CPU utilisation rates though, you see that behind the constantly powered-on VMs, there are true variations in usage patterns. Therefore as we move to cloud-native applications, and especially to models such as serverless that can much more effortlessly and granularly scale up and down in response to changing demands, we can expect the optimum pricing models to also change. Even then, it appears that having just two price bands, peak and off-peak, with off-peak times set in advance would obtain the majority of the efficiency gains available.

Let’s take a deeper look at what the data tells us about current cloud usage patterns and where we might be heading. The study is based on four months of usage data from a cloud region. Supply is fixed for the duration of the period (no new datacenters were brought online, and any fluctuations in capacity due to failures and replacements etc. are negligible relative to the overall capacity). During this time, neither Amazon, Google, or Azure changed their pricing.

Usage patterns at the regional level

The authors look at the average max-min range, capturing average peak-to-trough variation within a unit of time (day, week, month), and also the standard deviation.

Usage is normalized relative to the maximally observed value, which masks confidential information about datacenter efficiency. So a value of e.g., 0.04 indicates 4% of the maximally observed value).

Daily and weekly ranges are only 2-3% on average, with the majority of variation occurring within a day. Monthly ranges show more variation, but this is explained by growth due to new customers joining, not variation in usage by individual customers. Here you can see visually the relatively low variation in usage overlaid on an underlying growth trend:

…predictable usage enables providers to run datacenters at high utilization effficiency without the need to use prices to shape demand.

We might have expected to see correlated demand spikes – such as increased demand on weekdays during business hours. To understand why we don’t, we need to look deeper into usage patterns at the individual customer level.

Usage patterns at the individual customer level

The authors study two groups of customers, the top 100 customers (capturing the behaviour of large enterprises), and a random set of 100 customers consisting mostly of small businesses.

We start by examining the relationship between each customer’s demand and the entire demand for the region they are deployed in.

This is based on a simple linear regression, ~IndividualUsage = \alpha + \beta \times RegionUsage. The coefficient \beta captures the relationship of the utilisation for an individual customer as compared to the region overall. “All data is de-trended and normalized so that \beta = 1 signifies that when the datacenter demand increases by 1%, the customer’s demand tends to increase by the same percentage“.

The histogram above shows us that customer usage tends to be positively correlated with overall region usage. Values above 1 are rare – very few customers exacerbate regional level fluctuations.

… many customers are close to zero or negative, indicating that they are either negatively correlated or uncorrelated with market-level demand shocks. These findings help explain the relatively smooth utilization at the datacenter level.

Here are the customer volatility metrics for the two groups, as compared to the region overall.

The higher volatility of regions overall is again explained by the fact that regions attract new customers over time, whereas individual customers tend to have more stable demand.

Note the reasonable number of outliers in the above data. The dataset reveals three broad classes of customer. The first (majority) class of customers seems to represent lift-and-shift of existing applications onto public cloud infrastructure. The application stacks are defined to run with a relatively fixed amount of computational resource, and indeed that’s what they do. A second group of customers exhibit usage demand floating around a relatively small range, indicating some dynamic scaling. A minority of customers appear to be deploying cloud-native style applications more fully exploiting dynamic scaling:

These customers are likely using dynamic scaling techniques to minimise costs and thus appear as outliers relative to the typical customer. However, since these customers are not the norm and, as we previously showed, the bursts are not strongly correlated with broader market demand, we do not observe fluctuations of this magnitude at the regional level.

CPU utilisation patterns

So now we know what you probably already suspected: most people simply power on a number of VMs and leave them running.

Since deployments are billed by the minute, there is a financial incentive to re-architect software to make use of these benefits and thus we expect them to be more widely adopted as time goes. Actual CPU utilization offers a lens into a potential future where adoption is widespread.

(AWS Lambda pricing goes down to time slices of 100ms!).

If we look at max CPU utilization in a datacenter over a period of one week we do indeed see much more variation:

Midday has higher CPU utilisation for all weekdays, and weekend utilisation is steady and lower.

Variation in max CPU utilization measured at 5-minute intervals is 10x higher than the variation in VM usage. If CPU usage is indeed a lens into future behavior, then we should expect datacenter utilization to transition from the current regime of low volatility, to much more meaningful swings in demand. This would in turn raise the efficiency gains from using dynamic prices.

The kind of fluctuations seen at the CPU usage level – and that we might see at the region level with more widespread deployment of serverless and cloud-native application architectures, are similar to the demand fluctuations in the classic load pricing industries.

Given that 70% of the variation is predictable based on time alone, a simple predictable time of day pricing mechanism is adequate to obtain most of the efficiency gains. The suggested scheme is simply to have peak and off-peak pricing, with off-peak times set in advance (e.g., evenings and weekends).