Skip to content

Who controls the Internet? Analyzing global threats using property traversal graphs

May 19, 2017
tags:

Who controls the Internet? Analyzing global threats using property traversal graphs Simeonovski et al., WWW’17

Who controls the Internet? How much influence do they have? And what would happen if one of those parties launched an attack or was compromised and used to launch an attack? Previous works have looked at the individual core services, but this paper focuses on their inter-dependencies along attack propagation paths.

An increasing number of reports and studies are showing that a limited number of players have an important influence on the overall security of the Internet infrastructure… we have a rather limited capability to assess the impact of attacks against, or performed by, core service providers.

What kind of attacks are we talking about? Three large-scale security incidents form the initial motivation:

  1. The Great Cannon DDoS Attack of March 16th 2015, a massive DDoS attack caused by malicious JavaScript code injected into TCP connections crossing Chinese network borders. The injected code aggressively requested resources from the DDoS targets.
  2. The PRISM program (2013), an NSA surveillance program with direct access to Internet communications and stored information. “While the direct involvement of popular tech providers is still unclear, in this paper we make the assumption that establishing the this type of collaboration is possible and can be voluntary, or coerced by authorities by means of law and court orders.”
  3. The DDoS attack against Dyn.com of October 21st 2016. The attack caused Dyn.com customers including Amazon, Netflix, Twitter, Reddit, and Spotify to experience outages on name resolution, affecting hundreds of millions of Internet users who could not access their services.

Four different attack vectors are analysed: email sniffing, redirection via malicious domain resolution, in-path content injection, and hosting malicious content.

Gathering information

The authors crawl the web starting from the top 100K Alexa domains, expanding to server and network information, and then adding in organisations and countries. This ultimately leads to a labeled graph containing 1.8M nodes, of which 350K are unique IP addresses. The nodes are connected by 4.7M relationships.

The following table shows labels (think node and edge types) in the graph:

When considering the impact of an attack nodes can be marked at one of three different compromise levels: comprised, partially compromised, and non-compromised. Taint-style propagation rules can then be written which capture how attacks can spread through the network. For example, if a node n is compromised and there is an edge from n to m labeled as A (name lookup) then m is marked as compromised.

Identifying the most promising attack targets

Before assessing attacks, we use our model to select entities that can be either attack victims or the attackers. The selection criteria are based on metrics that reflect the popularity and the influence of entities.

The most promising attack targets (or viewed from another perspective, the entities with the most power over Internet infrastructure) are identified via six metrics.

Who hosts the most Alexa domains?

The analysis is done by country (giving a lens into the power of nation-state attackers), and also by ‘Autonomous Systems’ (AS) – a collection of IP networks and routers under the control of a given network operator.

Under this metric, these are the most powerful countries:

And these are the most powerful network operators:

Who has the most JavaScript hosting servers?

By country:

And by network operator:

Who hosts the most email servers?

By country:

And by network operator:

Who hosts the most name servers?

By country:

And by network operator:

Who has the most power over JavaScript providers?

This metric measures the number of JS hosting servers whose authoritative name server is hosted in a given country or by a given network operator.

By country:

And by network operator:

Who controls the most email server name servers?

The number of domains of email servers hosted by a given country or network operator.

By country:

And by network operator:

Evaluating the impact of potential attacks

Now we’re in a position to evaluate the potential impact of three different attacks: distribution of malicious JavaScript content, email sniffing, and a DDoS attack against a core service provider. In each case a target can be selected by consulting the tables above.

Distributing malicious JavaScript content

The authors consider three ways to do this: – directly compromising (or colluding with) web servers hosting JS code; injecting malicious JavaScript when JS libraries are accessed over unprotected connections (HTTP instead of HTTPS); and redirecting requests for JS content via compromised name resolution.

Here we see the number of Alexa domains that can be reached via the first two of these:

The attack results show that countries can be very powerful attackers. For example, the United States hosts 47K JS hosting providers, which could distributed malicious code to about 16% of the top 100K Alexa domains. However, ASes are also very powerful and affect a fraction of websites that is even larger than than of individual countries, and even groups of countries. For example, the AS of Google can affect about 9% of Alexa domains.

When we look at JS inclusion over unprotected connections, 1,079 of them cross the Chinese network borders, but the United States, the Netherlands, Russia, Germany, and Japan all have even greater influence.

In malicious name resolution redirection the authoritative name server of a domain hosting JS redirects users to a malicious server. The attack result is the number of websites including a resource hosted on a server whose name server is colluding or compromised.

The United States, Google, and DynDNS stand out here.

Email sniffing

To acquire a large number of emails, an attacker can rely on various techniques. In this paper we consider two. The first one is by acquiring them directly from the email server. The second one is by redirecting an email client toward a malicious mail server, which will accept the email, keep a copy, and forward it to the intended recipient. This attack can be performed by a provider or by a country. Tables 3(c)and 3(d) show the attack results. All values are the number of Alexa domains that will be affected by this attack grouped by technique and attacker.

Email sniffing by a malicious email provider:

The United States alone can acquire emails for 25% of the most popular websites!

Malicious name resolution for email sniffing:

Note how Google has much less power in this attack vector – most websites that use Google’s email servers do so via name servers which are not hosted by Google.

DDoS against a core service provider

What happens if a service provider is the victim of an attack and is made unavailable? The data we already have can be used to figure this out. For example, consider the Dyn.com DoS attack from October 2016. DynDNS does not host a relevant number of mail servers and JS hosting providers, but it does host 364 domain servers.

These name servers are authoritative for 3,570 domains hosting JS that provide JS to 5,559 top 100K Alexa domains (not shown in Table 3), of which 4,331 are unprotected JS inclusion. Furthermore, the name servers hosted by DynDNS are authoritative for 1,523 domains running mail servers which are used by 1,178 top Alexa domains. If the Dyn.com DNS infrastructure is attacked, then a fraction that ranges from 1 to 5% of the top 100K Alexa domains would be affected.

So who controls the Internet?

Our results show that already just a few players may have an extensive power: 14 countries and 14 autonomous systems can, directly or indirectly, affect the security of about 23% of websites… In addition, our results show that little has been learned from past attacks. For example, 70% of JavaScript (JS) inclusion is still done over unprotected connections, i.e., via HTTP URLs, which can be used to mount the Great Cannon attack.

BOAT: Building auto-tuners with structured Bayesian optimization

May 18, 2017

BOAT: Building auto-tuners with structured Bayesian optimization Dalibard et al., WWW’17

Due to their complexity, modern systems expose many configuration parameters which users must tune to maximize performance… From the number of machines used in a distributed application, to low-level parameters such as compiler flags, managing configurations has become one of the main challenges faced by users of modern systems.

The time and expertise required to optimise configuration settings simply doesn’t exist in many DevOps teams focused on rapidly delivering application functionality. That’s a shame, because optimising the configuration (without changing a single line of application code) can have way bigger benefits than I ever would have imagined before I first starting looking into this space. Today’s paper choice, BOAT, provides further evidence of that, reducing 99%-ile latency from 19ms to 7ms for a Cassandra workload for an optimised setting vs the defaults. In addition, two hours spent tuning the configuration of a TensorFlow computation delivered at least a 2x speed-up in the median time per SGD iteration.

I also know this because I work with a company called Skipjaq. I should declare my interest here: Accel is an investor, I’m a personal investor and board member, and the (highly technical!) CEO is a long-standing friend of mine. So I probably couldn’t be more biased if I tried! Like BOAT (and CherryPick that we also looked at recently), Skipjaq use Bayesian Optimisation to auto-tune applications. As a recent example, after putting the Skipjaq-optimised configuration settings into production, one company observed a 50% reduction in 95%-ile page load times. And as many studies have shown, page load times have an inversely proportional relationship to revenue. I.e., the faster your page loads, the better the engagement and conversion rates.

For another data point, you might want to check out Ramki Ramakrishna from Twitter’s talk on “Continuous optimization of microservices using machine learning.” As Rami concludes, “continuous automated optimization of microservices is possible, even inevitable.”

Hopefully you’re now feeling sufficiently motivated to learn a little more about how BOAT, the BespOke Auto-Tuner works!

Challenges in auto-tuning

… we consider the task of automatically tuning configuration parameters associated with system performance. This can include any parameter for which the optimal value is dependent on the hardware or software context, ranging from low level parameters such as compiler flags or configuration files, to higher level ones like the number of machines used in a computation.

Two primary challenges make this difficult:

  1. Each performance run is expensive, which means that evolutionary or hill-climbing approaches often requiring thousands of evaluations cannot be applied. One answer to this, as we saw with Cherry Pick, is to use Bayesian optimisation.
  2. The configuration space of most real world problems is too complex. “Bayesian optimization tends to require fewer iterations to converge than other optimisers, but fails to work in problems with many dimensions (more than 10).”

In the neural network case study included in the parameter, BOAT is applied to a problem with over 30 dimensions. (Skipjaq have also had success using an approach based on Bayesian optimisation, with around 30 dimensions).

Structured Bayesian Optimisation

We covered the basic idea of Bayesian optimisation in the CherryPick write-up a couple of weeks ago, so I’ll just give the briefest of recaps here. Bayesian optimisation incrementally builds a probabilistic model (typically a Gaussian process) of the objective function being minimised. In each iteration it finds a candidate point in the exploration space to explore using an acquisition function (for example, based on expected improvement), and then takes a measurement of the objective function at that point (i.e., runs a performance test in this case). The probabilistic model is then updated using this new measurement.

A key failure mode for Bayesian optimisation is when the probabilistic model fails to accurately capture the objective function landscape after a reasonable number of iterations. This can happen when there are too many dimensions in the model.

Enter structured Bayesian optimisation. The central idea is to improve the accuracy of the probabilistic model (as used by the acquisition function) by exploiting some basic knowledge of the problem at hand instead of starting with a blank slate Gaussian process.

Structured Bayesian optimization (SBO) extends the Bayesian optimization methodology to take advantage of the known structure of the objective function. It uses a structured probabilistic model, provided by the developer, instead of a simple Gaussian process. In BOAT, those models are implemented in our probabilistic programming library.

The SBO process looks just like the standard Bayesian optimisation process, but with the the probabilistic program used to build the underlying model.

Using a bespoke model captures the user’s understanding of the behaviour of the system, which can drastically reduce the number of iterations needed for the model to converge towards the true objective function. It also allows for monitoring of runtime properties that are reflected in the model, so that they can be used for inference.

An example should help at this point. In the GC case study, the goal is to tune the JVM garbage collector for a Cassandra workload. This optimisation considers the three parameters young generation size, survivor ratio, and max tenuring threshold. The model that the team built included the rate and average duration statistics for minor collections, predicting both of these values given the configuration. It then predicted latency as a function of the configuration values, rate, and average duration.

When using the model in BOAT, we collected the true value of these statistics from the GC logs after each evaluation and used them for inference. Further, we declared how we believed each of the three model components behaved as a function of their inputs. For example, we noticed the rate of minor collections was inversely proportional to the size of the eden memory region in the JVM-heap, which is where objects are initially allocated. This intuition was included in the corresponding model by building a semi-parametric model, which can successfully combine a user-defined trend with empirical data.

Here’s what a model component for predicting rate looks like in code:

In practice, adding only a little structure can be sufficient to make the optimization converge in a reasonable time. This is useful as simpler models are able to adapt to a broad variety of behaviors, such as widely different hardware and workloads. This allows the construction of bespoke autotuners providing global performance portability.

Building probabilistic models in BOAT

The authors give two recommendations for building useful models. Firstly, models should consist of a combination of independent components, with each component predicting a single observable value. Secondly, each component should be a semi-parametric model. What’s that all about?

Consider a model predicting the time it takes to insert an element into a sorted vector. Imagine that so far we have taken five observations. If we are using a parametric model such as linear regression, we’ll try to fit a straight line, resulting in underfitting:

If on the other hand we use a non-parametric model such as a Gaussian process we may end up fitting the data points, but missing the general trend (overfitting):

Semi-parametric models combine parametric and non-parametric models. “The non-parametric model is used to learn the difference between the parametric model and the observed data.” For example, if we simply combine the previous two models, we arrive at our Goldilocks moment:

When building a model’s component in BOAT, we recommend starting with a non-parametric model and adding structure until the optimization converges in satisfactory time.

To combine several of these individual semi-parametric models (each predicting a single observable value) into a single global model, developers define the flow of data between them using a DAG. The example Cassandra model below combines rate, duration and latency.

Evaluation results

For the Cassandra optimisation, the BOAT auto-tuner using the model above was run for 10 iterations on YCSB workloads.

Our optimised configuration outperforms the Cassandra default configuration by up to 63%… Each optimization converges to within 10% of the best found performance by the second iteration.

The second case study involves tuning a TensorFlow-based distributed computation.

Our tuner takes as input a neural network architecture, a set of available machines and a batch size. It then performs ten iterations, each time evaluating a distributed configuration, before returning the one with highest performance. The tuning always finishes within two hours which is small compared to the typical training time of neural networks.

The probabilistic model again contains three components:

  1. The individual device (CPU or GPU) computation time needed for a worker to perform its assigned workload
  2. The individual machine computation time for machines with multiple devices.
  3. The communication time based, on learning connection speed per type of machine and predicting communication time given the amount of data that must be transferred in each iteration by a given machine.

In our experiments, we tune the scheduling over 10 machines, setting 30-32 parameters. In our largest experiment there are approximately 10^53 possible valid configurations, most of which lead to poor performance.

Here you can see the results of the optimisation as compared to a uniform devices configuration which simply balances load equally among all devices, and a uniform GPUs configuration which balances load equally among all GPUs.

BOAT is open source and available at https://github.com/VDalibard/BOAT.

A blatant Skipjaq plug

If you like the sound of the performance optimisation that Bayesian optimisation-based auto-tuning of configuration can deliver, but don’t have the time or inclination to set up your own Bayesian optimiser, define load tests that mirror production workloads, and run performance tests in such a way that results give a true indication of likely production performance, then you might want to give Skipjaq a try to see what difference it can make with your own apps. Optimisation without any of the hassle. If you’re quick you might still be able to nab a place in the private beta!

DeepSense: a unified deep learning framework for time-series mobile sensing data processing

May 17, 2017

DeepSense: a unified deep learning framework for time-series mobile sensing data processing Yao et al., WWW’17

DeepSense is a deep learning framework that runs on mobile devices, and can be used for regression and classification tasks based on data coming from mobile sensors (e.g., motion sensors). An example of a classification task is heterogeneous human activity recognition (HHAR) – detecting which activity someone might be engaged in (walking, biking, standing, and so on) based on motion sensor measurements. Another example is biometric motion analysis where a user must be identified from their gait. An example of a regression task is tracking the location of a car using acceleration measurements to infer position.

Compared to the state-of-art, DeepSense provides an estimator with far smaller tracking error on the car tracking problem, and outperforms state-of-the-art algorithms on the HHAR and biometric user identification tasks by a large margin.

Despite a general shift towards remote cloud processing for a range of mobile applications, we argue that it is intrinsically desirable that heavy sensing tasks be carried out locally on-device, due to the usually tight latency requirements, and the prohibitively large data transmission requirement as dictated by the high sensor sampling frequency (e.g., accelerometer, gyroscope). Therefore we also demonstrate the feasibility of implementing and deploying DeepSense on mobile devices by showing its moderate energy consumption and low overhead for all three tasks on two different types of smart device.

I’d add that on-device processing is also an important component of privacy for many potential applications.

In working through this paper, I ended up with quite a few sketches in my notebook before I reached a proper understanding of how DeepSense works. In this write-up I’m going to focus on taking you through the core network design, and if that piques your interest, the rest of the evaluation details etcetera should then be easy to pick up from the paper itself.

Processing the data from a single sensor

Let’s start off by considering a single sensor (ultimately we want to build applications that combine data from multiple sensors). The sensor may provide multi-dimensional measurements. For example, a motion sensor that report motion along x, y and z axes. We collect sensor readings in each of these d dimensions at regular intervals (i.e., a time series), which we can represent in matrix form as follows:

We’re going to process the data in non-overlapping windows of width \tau. Dividing the number of data points in the time series sample by \tau gives us the total number of windows, T. For example, if we have 5 seconds of motion sensor data and we divide it into windows lasting 0.25s each, we’ll have 20 windows.

Finding patterns in the time series data works better in the frequency dimension than in the time dimension, so the next step is to take each of the T windows, and pass them through a Fourier transform resulting in f frequency components, each with a magnitude and phase. This gives us a d \times 2f matrix for each window.

We’ve got T of these, and we can pack all of that data into a d \times 2f \times T tensor.

It’s handy for the implementation to have everything nicely wrapped up in a single tensor at this point, but actually we’re going to process slice by slice in the T dimension (one window at a time). Each d \times 2f window slice is passed through a convolution neural network component comprising three stages as illustrated below:

First we use 2D convolutional filters to capture interactions among dimensions, and in the local frequency domain. The output is then passed through 1D convolutional filter layers to capture high-level relationships. The output of the last filter layer is flatten to yield sensor feature vector.

Combining data from multiple sensors

Follow the above process for each of the K sensors that are used by the application. We now have K sensor feature vectors, that we can pack into a matrix with K rows.

The sensor feature matrix is then fed through a second convolutional neural network component with the same structure as the one we just looked at. That is, a 2D convolutional filter layer followed by two 1D layers. Again, we take the output of the last filter layer and flatten it into a combined sensors feature vector. The window width \tau is tacked onto the end of this vector.

For each convolutional layer, DeepSenses learns 64 filters, and uses ReLU as the activation function. In addition, batch normalization is applied at each layer to reduce internal covariate shift.

Now we have a combined sensors feature vector for one time window. Repeat the above process for all T windows.

Use an RNN to learn patterns across time windows

So now we have T combined sensor feature vectors, each learning intra-window interactions. But of course it’s also important to learn inter-window relationships across time windows. To do this the T feature vectors are fed into an RNN.

At this point I think we’re ready for the big picture.

Instead of using LSTMs, the authors choose to use Gated Recurrent Units (GRUs) for the RNN layer.

… GRUs show similar performance to LSTMs on various tasks, while having a more concise expression, which reduces network complexity for mobile applications.

DeepSense uses a stacked GRU structure with two layers. These can run incrementally when there is a new time window, resulting in faster processing of stream data.

Top it all with an output layer

The output of the recurrent layer is a series of T vectors, \{\mathbf{x}_{t}^{(r)}\}, one for each time window.

For regression-based tasks (e.g., predicting car location), the output layer is a fully connected layer on top of each of those vectors, sharing weights \mathbf{W}_{out} and bias term \mathbf{b}_{out} to learn \mathbf{\hat{y}}_{t} = \mathbf{W}_{out} . \mathbf{x}_{t}^{(r)} + \mathbf{b}_{out} .

For classification tasks, the individual vector are composed into a single fixed-length vector for further processing. You could use something fancy like a weighted average over time learned by an attention network, but in this paper excellent results are obtained simply by averaging over time (adding up the vectors and dividing by T). This final feature vector is fed into a softmax layer to generate the final category prediction.

Customise for the application in hand

To tailor DeepSense for a particular mobile sensing and computing task, the following steps are taken:

  • Identify the number of sensor inputs, K, and pre-process the inputs into a set of d \times 2f \times T tensors.
  • Identify the type of the task and select the appropriate output layer
  • Optionally customise the cost function. The default cost function for regression oriented tasks is mean squared error, and for classification it is cross-entropy error.

For the activity recognition (HHAR) and user identification tasks in the evaluation the default cost function is used. For car location tracking a negative log likelihood function is used (see section 4.2 for details).

Key results

Here’s the accuracy that DeepSense achieves on the car tracking task, versus the sensor-fusion and eNav algorithms. The map-aided accuracy column shows the accuracy achieved when the location is mapped to the nearest road segment on a map.

On the HHAR task DeepSense outperforms other methods by 10%.

And on the user identification task by 20%:

We evaluated DeepSense via three representative mobile sensing tasks, where DeepSense outperformed state of the art baselines by significant margins while still claiming its mobile-feasibility through moderate energy consumption and low latency on both mobile and embedded platforms.

The evaluation tasks focused mostly on motion sensors, but the approach can be applied to many other sensor types including microphone, Wi-Fi signal, Barometer, and light-sensors.

Usage patterns and the economics of the public cloud

May 16, 2017
tags:

Usage patterns and the economics of the public cloud Kilcioglu et al., WWW’17

Illustrating the huge diversity of topics covered at WWW, following yesterday’s look at recovering mobile user trajectories from aggregate data, today’s choice studies usage variation and pricing models in the public cloud. The basis for the study is data from ‘a major provider’s public cloud datacenters.’ Unless Google or Amazon are sending their data to three researchers from Microsoft, it’s a fair bet we’re looking at Azure.

Research in economics and operations management posits that dynamic pricing is critically important when capacity is fixed (at least in the short run) and fixed costs represent a substantial fraction of total costs.

For example, the fixed capacity of a data center (at least in the short run) and the relatively fixed costs of keeping it running.

In these markets, firms change prices so that demand is equal or close to the fixed capacity. The degree to which this can be accomplished depends on demand responsiveness and the firm’s price setting ability, which jointly determine how supply goes unused.

In other words, you make prices cheaper in periods of lower demand to stimulate usage, and you make prices higher in periods of high demand to discourage overuse.

Our scientific understanding of peak-load pricing focuses on three industries: electricity, airlines and hotels. Cloud computing thus seems to offer a fourth large industry where dynamic prices are expected to be important.

And yet cloud providers overwhelmingly use static pricing models, what’s going on? Here’s the short summary: the data shows that there is actually very little variation in demand volatility for cloud datacenters at the moment, thus the current pricing model makes sense. If you look more closely at actual CPU utilisation rates though, you see that behind the constantly powered-on VMs, there are true variations in usage patterns. Therefore as we move to cloud-native applications, and especially to models such as serverless that can much more effortlessly and granularly scale up and down in response to changing demands, we can expect the optimum pricing models to also change. Even then, it appears that having just two price bands, peak and off-peak, with off-peak times set in advance would obtain the majority of the efficiency gains available.

Let’s take a deeper look at what the data tells us about current cloud usage patterns and where we might be heading. The study is based on four months of usage data from a cloud region. Supply is fixed for the duration of the period (no new datacenters were brought online, and any fluctuations in capacity due to failures and replacements etc. are negligible relative to the overall capacity). During this time, neither Amazon, Google, or Azure changed their pricing.

Usage patterns at the regional level

The authors look at the average max-min range, capturing average peak-to-trough variation within a unit of time (day, week, month), and also the standard deviation.

Usage is normalized relative to the maximally observed value, which masks confidential information about datacenter efficiency. So a value of e.g., 0.04 indicates 4% of the maximally observed value).

Daily and weekly ranges are only 2-3% on average, with the majority of variation occurring within a day. Monthly ranges show more variation, but this is explained by growth due to new customers joining, not variation in usage by individual customers. Here you can see visually the relatively low variation in usage overlaid on an underlying growth trend:

…predictable usage enables providers to run datacenters at high utilization effficiency without the need to use prices to shape demand.

We might have expected to see correlated demand spikes – such as increased demand on weekdays during business hours. To understand why we don’t, we need to look deeper into usage patterns at the individual customer level.

Usage patterns at the individual customer level

The authors study two groups of customers, the top 100 customers (capturing the behaviour of large enterprises), and a random set of 100 customers consisting mostly of small businesses.

We start by examining the relationship between each customer’s demand and the entire demand for the region they are deployed in.

This is based on a simple linear regression, ~IndividualUsage = \alpha + \beta \times RegionUsage. The coefficient \beta captures the relationship of the utilisation for an individual customer as compared to the region overall. “All data is de-trended and normalized so that \beta = 1 signifies that when the datacenter demand increases by 1%, the customer’s demand tends to increase by the same percentage“.

The histogram above shows us that customer usage tends to be positively correlated with overall region usage. Values above 1 are rare – very few customers exacerbate regional level fluctuations.

… many customers are close to zero or negative, indicating that they are either negatively correlated or uncorrelated with market-level demand shocks. These findings help explain the relatively smooth utilization at the datacenter level.

Here are the customer volatility metrics for the two groups, as compared to the region overall.

The higher volatility of regions overall is again explained by the fact that regions attract new customers over time, whereas individual customers tend to have more stable demand.

Note the reasonable number of outliers in the above data. The dataset reveals three broad classes of customer. The first (majority) class of customers seems to represent lift-and-shift of existing applications onto public cloud infrastructure. The application stacks are defined to run with a relatively fixed amount of computational resource, and indeed that’s what they do. A second group of customers exhibit usage demand floating around a relatively small range, indicating some dynamic scaling. A minority of customers appear to be deploying cloud-native style applications more fully exploiting dynamic scaling:

These customers are likely using dynamic scaling techniques to minimise costs and thus appear as outliers relative to the typical customer. However, since these customers are not the norm and, as we previously showed, the bursts are not strongly correlated with broader market demand, we do not observe fluctuations of this magnitude at the regional level.

CPU utilisation patterns

So now we know what you probably already suspected: most people simply power on a number of VMs and leave them running.

Since deployments are billed by the minute, there is a financial incentive to re-architect software to make use of these benefits and thus we expect them to be more widely adopted as time goes. Actual CPU utilization offers a lens into a potential future where adoption is widespread.

(AWS Lambda pricing goes down to time slices of 100ms!).

If we look at max CPU utilization in a datacenter over a period of one week we do indeed see much more variation:

Midday has higher CPU utilisation for all weekdays, and weekend utilisation is steady and lower.

Variation in max CPU utilization measured at 5-minute intervals is 10x higher than the variation in VM usage. If CPU usage is indeed a lens into future behavior, then we should expect datacenter utilization to transition from the current regime of low volatility, to much more meaningful swings in demand. This would in turn raise the efficiency gains from using dynamic prices.

The kind of fluctuations seen at the CPU usage level – and that we might see at the region level with more widespread deployment of serverless and cloud-native application architectures, are similar to the demand fluctuations in the classic load pricing industries.

Given that 70% of the variation is predictable based on time alone, a simple predictable time of day pricing mechanism is adequate to obtain most of the efficiency gains. The suggested scheme is simply to have peak and off-peak pricing, with off-peak times set in advance (e.g., evenings and weekends).

Trajectory recovery from Ash: User privacy is NOT preserved in aggregated mobility data

May 15, 2017
tags:

Trajectory recovery from Ash: User privacy is NOT preserved in aggregated mobility data Xu et al., WWW’17

Borrowing a little from Simon Wardley’s marvellous Enterprise IT Adoption Cycle, here’s roughly how my understanding progressed as I read through this paper:

Huh? What? How? Nooooo, Oh No, Oh s*@\#!

Xu et al. show us that even in a dataset in which you might initially think there is no chance of leaking information about individuals, they can recover data about individual users with between 73% and 91% accuracy. Even in datasets which aggregate data on tens of thousands to hundreds of thousands of users! Their particular context is mobile location data, but underpinning the discovery mechanism is a reliance on two key characteristics:

  1. Individuals tend to do the same things over and over (regularity) – i.e., there are patterns in the data relating to given individuals, and
  2. These patterns are different across different users (uniqueness).

Therefore, statistical data with similar features is likely to suffer from the same privacy breach. Unfortunately, these two features are quite common in the traces left by humans, which have been reported in numerous scenarios, such as credit card records, mobile application usage, and even web browsing. Hence, such a privacy problem is potentially severe and universal, which calls for immediate attention from both academia and industry.

(Emphasis mine). As we’ll see, there are a few other details that matter which in my mind might make it harder to transfer to other problem domains – chiefly the ability to define appropriate cost functions – but even if it does relate only to location-based data, it’s still a very big deal.

Let’s take a look at what’s in a typical aggregated mobility dataset, and then go on to see how Xu et al. managed to blow it wide open.

Aggregated mobility data (ash)

In an attempt to preserve user privacy, owners of mobility data tend to publish only aggregated data, for example, the number of users covered by a cellular tower at a particular timestamp. The statistical data is of broad interest with applications in epidemic controlling, transportation scheduling, business intelligence, and more. The aggregated data is sometimes called the ash of the original trajectories.

The operators believe that such aggregation will preserve users’ privacy while providing useful statistical information for academic research and commercial usage.

To publish aggregated mobility data, first the original mobility records of individual mobile users are grouped by time slots (windows), then for each window, aggregate statistics are computed (e.g., the number of mobile users covered by each base station).

Two desirable properties are assumed to hold from following such a process:

  1. No individual information can be directly acquired from the datasets, with the aggregated mobility data complying with the k-anonymity privacy model.
  2. The aggregate statistics are accurate.

Although the privacy leakage in publishing anonymized individual’s mobility records has been recognized and extensively studied, the privacy issue in releasing aggregated mobility data remains unknown.

Looking at this from a differential privacy perspective, you might already be wondering about the robustness of being able to detect e.g., whether or not a particular individual is represented in the dataset. But what we’re about to do goes much further. Take a dataset collected over some time period, which covers a total of M locations (places). In each time slot, we have the number of users in each of those places (i.e., an M-dimensional vector). That’s all you get. From this information only, it’s possible to recover between 73% – 91% of all the individual users’ trajectories – i.e., which locations they were in at which times, and hence how they moved between them.

Huh? What? How?

Revealing individual users from aggregate data

The first thing it’s easy to work out is how many individual users there are at any point in time, just sum up all of the user counts for each place at that time. Call that N.

Now we need to consider what other clues might be available in the data. Individual users tend to have fairly coherent mobility trajectories (what they do today, they’re likely to do again tomorrow), and these trajectories are different across different users. See for example the trajectories of five randomly selected selected users over two days in the figure below. Each users’ pattern is unique, with strong similarity across both days.

The key to recovering individual trajectories is to exploit these twin characteristics of regularity for individual users, and uniques across users. Even so, it seems a stretch when all we’ve got is user counts by time and place!

We’re going to build up estimates for the trajectories of each of the N users, one time step at a time. Let the estimate for the trajectory of the ith individual user at time step t , s^{t}_{i} be represented by a sequence of locations [q^{1}_{i}, q^{2}_{i}, q^{3}_{i}, ..., q^{t}_{i}]. For example, here’s a world with M=9 locations:

And here’s a trajectory for a user to time t, represented as a sequence of t locations:

We want to decide on the most likely location for each user at time step t+1, subject to the overall constraint that we know how many users in total there must be at each location at time t+1. The description of how this next part works is very terse in the paper itself, but with a little bit of reverse engineering on my part I believe I’ve reached an understanding. The secret is to formulate the problem as one of completing a decision matrix. This decision matrix takes a special form, it has N rows (one for each user trajectory to date), and N columns. Say we know that there must be two users in location 1, and four users in location 2. Then the first two columns in the matrix will represent the two ‘slots’ available in location 1, and the next four columns will represent the four slots available in location 2, and so on.

In the decision matrix X^t, x^{t}_{i,j} = 1 if next location for trajectory i is the location identified by column j, and zero otherwise. A valid completion of the matrix has every row adding up to one (each trajectory is assigned one and only one next location), and every column adding up to one (each slot is filled by one and only one user).

Thus the example decision matrix above indicates that the next location for the trajectory of user #3 has been assigned as location 5.

When we complete the decision matrix, we don’t just make random assignments of slots of course! That’s where the cost matrix C^c comes in. The cost matrix is also an NxN matrix, with the same row and column structure as the decision matrix. Instead of being filled with 1’s and 0’s though, c^{t}_{i,j} contains a value representing the cost of moving to the location for slot j given the trajectory so far for user i. Take for example the trajectory for a user in the illustration below, which currently finishes in location 6. We might use as the cost of moving to each potential next location simply the number of hops in the grid to get there (red numbers). The actual cost functions used are more complex than this, this example is just to help you get the idea.

Here then is what a subsection of the cost matrix for the user in the above sketch might look like:

We’ll return to to how to define the cost functions in a moment. For now note that the problem has now become the following:

The above formulated problem is equivalent to the Linear Sum Assignment Problem, which has been extensively studied and can be solved in polynomial time with the Hungarian algorithm.

Space prevents me from explaining the Hungarian algorithm, but the Wikipedia link above does a pretty good job of it (it’s actually pretty straightforward, check out the section on the ‘Matrix interpretation’ to see how it maps in our case).

Thus far then, we’ve discovered a way to represent the problem such that we can recover each step of individual trajectories in polynomial time, so long as we can define suitable cost matrices.

Noooooo.

Building cost matrices – it’s night and day

At night time, people tend not to move around very much, as illustrated by these plots from two different datasets (the ‘operator’ dataset and the ‘app’ dataset) used in the evaluation:

Not only that, but the night time location of individual users tends to be one of their top most visited locations (often the top location):

For the night time then, it makes sense to use the distance between the location of the user trajectory at time t and the location being considered for time t+1 as the cost.

In the daytime, people tend to move about.

The key insight is the continuity of human mobility, which enables the estimation of next location using the current location and velocity.

Let the estimated next location using this process be l, then we can use as the cost function the distance between the l and the location being considered for time t+1.

It is worth noting that the Hungarian algorithm is currently the most efficient algorithm to solve [the linear sum assignment problem], but still has computational complexity of O(n^3). To speed things up, we adopt a suboptimal solution to reduce the dimension of the cost matrix by taking out the pairs of trajectories and location points with cost below a predefined threshold and directly linking them together.

Linking trajectories across days

Using the night and day approaches, we can recover mobile users’ sub-trajectories for each day. Now we need to link sub-trajectories together across days. Here we exploit the regularity exhibited in the day-after-day movements of individual users.

Specifically, we use the information gain of connecting two sub-trajectories to measure their similarities.

The entropy in a trajectory is modelled based on the frequency of visiting different locations, and the information gain from linking two sub-trajectories is modelled as the difference between the entropy of the combined trajectory, and the sum of the entropies of the individual trajectories over two. For the same user, we should see relatively little information gain if the two trajectories are similar, whereas for different users we should see much larger information gain. And indeed we do:

To conclude, we design an unsupervised attack framework that utilizes the universal characteristics of human mobility to recover individuals’ trajectories in aggregated mobility datasets. Since the proposed framework does not require any prior information of the target datasets, it can be easily applied on other aggregated mobility datasets.

Once the individual trajectories are separated out, it has been shown to be comparatively easy, with the help of small amounts of external data such as credit card records, to re-identify individual users (associated them with trajectories).

Oh no.

Evaluation

The authors evaluate the technique on two real world datasets. The ‘app’ dataset contains data for 15,000 users collected by a mobile app which records a mobile user’s location when activated, over a two-week period. The ‘operator’ dataset contains data for 100,000 mobile users from a major mobile network operator, over a one week period. Tests are run on aggregate data produced from these datasets, and the recovered trajectories are compared against ground truth.

In the figures that follow, stage #1 represents night time trajectory recovery, stage #2 day time trajectory recovery, and stage #3 the linking of sub-trajectories across days. Here we can see the recovery accuracy for the two datasets:

For the app dataset, 98% of night time trajectories are correctly recovered, failing to 91% accuracy by the final step. The corresponding figures for the operator dataset are 95% and 73%.

For the recovered trajectories, the following chart shows the percentage that can be uniquely identified given just the top-k locations for k=1 to 5.

From the results, we can observe that given the two most frequent locations of the recovered trajectories, over 95% of them can be uniquely distinguished. Therefore, the results indicate that the recovered trajectories are very unique and vulnerable to be reidentified with little external information.

To put that more plainly, given the aggregated dataset, and knowledge of your home and work locations, there’s a very good chance I can recover your full movements!!!

Oh s*@\#!

Decreasing the spatial resolution (i.e., using more coarse-grained locations) actually increases the chances of successful trajectory recovery (but only to the location granularity of course). It’s harder to link these recovered trajectories to individual people though as human mobility becomes less unique in coarser-grained datasets.

Decreasing the temporal resolution (only releasing data for larger time windows) increases both the chances of successful trajectory recovery, and of re-identification.

The best defence for preserving privacy in aggregated mobility datasets should come as no surprise to you – we need to add some carefully designed random noise. What that careful design is though, we’re not told!

… a well designed perturbation scheme can reduce the regularity and uniqueness of mobile users’ trajectories, which has the potential for preserving mobile users’ privacy in aggregated mobility data.

Learning to act by predicting the future

May 12, 2017

Learning to act by predicting the future Dosovitskiy & Koltun, ICLR’17

The Direct Future Prediction (DFP) model won the ‘Full Deathmatch’ track of the Visual Doom AI Competion in 2016. The competition pits agents against each other, with their performance measured by how many ‘frags’ they get. (A frag is a kill, apparently – not really my scene!). The action takes place in environments which the agents have never seen before. DFP outperformed the second best submission by more than 50% even though its model architecture is simpler and it received less supervisory training.

How does DFP perform so well? By re-thinking the fundamental approach to learning gameplay (or ‘sensorimotor control in immersive environments’ as it’s called in the paper!).

Machine learning problems are commonly divided into three classes: supervised, unsupervised, and reinforcement learning. In this view, supervised learning is concerned with learning input-output mappings, unsupervised learning aims to find hidden structure in data, and reinforcement learning deals with goal-directed behavior.

(Very nicely expressed!).

So goal-directed problems such as learning to perform well in a deathmatch seem to most naturally fit with reinforcement learning. And indeed we’ve seen DQN and A3C used for gameplay in earlier editions of The Morning Paper, as well as the deep reinforcement learning approach of “Playing FPS Games with Deep Reinforcement Learning“.

But there’s another view on the best way to tackle the problem, that goes way back:

The supervised learning (SL) perspective on learning to act by interacting with the environment dates back decades. Jordan & Rumelhart (1992) analyze this approach, review early work, and argue that the choice of SL versus RL should be guided by the characteristics of the environment. Their analysis suggests that RL may be more efficient when the environment provides only a sparse scalar reward signal, whereas SL can be advantageous when temporally dense multidimensional feedback is available.

For “temporally dense multidimensional feedback” think “short feedback loops with lots of information.”

Dosovitskiy & Koltun observe that the combination of a sensory stream (e.g., video frames from Doom) and a measurement stream (e.g., health, ammunition levels, number of adversaries in Doom) do indeed provide for the kind of rich supervisory signal that allow the problem to be reformulated as a supervised learning problem.

Our approach departs from the reward-based formalization commonly used in RL. Instead of a monolithic state and a scalar reward, we consider a stream of sensory input {s_t} and a stream of measurements {m_t}. The sensory stream is typically high-dimensional and may include the raw visual, auditory, and tactile input. The measurement stream has lower dimensionality and constitutes a set of data that pertain to the agent’s current state.

Instead of directly training an agent to maximise a reward, the agent is trained to predict the effect of different actions on future measurements, conditioned by the present sensory input, measurements, and goal.

Assuming that the goal can be expressed in terms of future measurements, predicting these provides all the information necessary to support action. This reduces sensorimotor control to supervised learning while supporting learning from raw experience and without extraneous data.

All the agent has to do is act, and observe the effects of its actions: the measurement stream provides ‘rich and temporally dense’ supervision that can stabilise and accelerate training.

High level overview

At every time step t the agent receives an observation \mathbf{o_t} with structure \langle \mathbf{s_t, m_t} \rangle where \mathbf{s_t} is the raw sensory input and \mathbf{m_t} is a set of measurements. The measurement vector differs from the other sensory inputs in two important ways: (i) it is what the agent will try to predict, and (ii) goals are expressed in terms of it. More specifically, goals are defined in terms of differences between current and future measurements over time. Any parametric function can be used, and the experiments in this paper use goals expressed as a simple linear combination of future measurements.

A parameterized function approximator, F is used to predict future measurements. The prediction is a function of the current observation, the considered action, and the goal. Once trained, a deployed agent can make predictions for all possible actions, and choose the action that yields the best predicted outcome in accordance with its current goal (which need not be identical to any goal seen during training).

The predictor F is trained on experiences collected by the agent itself, through interacting with the environment. The loss function is based on sum of the squared distances between the predicted measurement values and the actually observed values.

Two training regimes were evaluated: one in which the goal vector is fixed throughout the training process, and one in which the goal vector for each training episode is generated at random.

In both regimes, the agent follows an ε-greedy policy: it acts greedily according to the current goal with probability 1 – ε, and selects a random action with probability ε. The value of ε is initially set to 1 and is decreased during training according to a fixed schedule.

The predictor network

The predictor F is a deep network that looks like this:

There are three input modules: a perception module S, which processes the sensory input (in the experiment, S is a convolutional network processing an input image); a measurement module M, and a goal module G. The outputs of all three are concatenated.

Future measurements are predicted based on this input representation. The network emits predictions of future measurements for all actions at once… we build on the ideas of Wang et al. (2016) and split the prediction module into two streams: an expectation stream and an action stream.

The expectation stream predicts the average of the future measurements across all potential actions. The action stream considers actions individually. A normalisation layer following the action stream subtracts the average future action value from all of the predicted action values. In this way, when the two are recombined the expectation stream is forced to compensate by predicting the average.

The final stage of the network sums the expectation stream and the action stream predictions, to give a prediction of future measurements for each action.

Let’s play Doom!

The ViZDoom platform is used for evaluation.

We compare the presented approach to state-of-the-art deep RL methods in four scenarios of increasing difficulty…

  1. Basic: gathering health kits in a square room
  2. Navigation: gathering health kits and avoiding poison vials in a maze
  3. Battle: defending against adversaries while gathering health and ammunition in a maze
  4. Battle 2: as above, but with a more complex maze.

The agent can move forward or backward, turn left or right, strafe left or right, run, and shoot. Any combination of these eight actions can be used, resulting in 256 possible actions. The agent is given three measurements: health, ammunition, and frag count.

DFP (this paper) is compared against DQN (baseline for performance on Atari games), A3C (regarded as state-of-the-art in this area), and DSR (described in a recent technical report and also evaluated in ViZDoom). Here we can see a comparison of the performance levels reached during training:

And here’s how well they do after training:

Far more interesting than staring at the table though, is to watch them play via this video:

We now evaluate the ability of the presented approach to learn without a fixed goal at training time, and adapt to varying goals at test time. These experiments are performed in the Battle scenario. We use three training regimes: (a) fixed goal vector during training, (b) random goal vector with each value sampled uniformly from [0, 1] for every episode, and (c) random goal vector with each value sampled uniformly from [−1, 1] for every episode.

Here we see the results:

Models trained without knowing the goal in advance (b,c) perform nearly as well as a dedicated model trained for the eventual goal (a). All models generalise to new goals, but those trained with a variety of goals (b,c) generalise much better.

An ablation study showed that:

… predicting multiple measurements significantly improves the performance of the learned model, even when it is evaluated by only one of those measurements. Predicting measurements at multiple futures is also beneficial. This supports the intuition that a dense flow of multivariate measurements is a better training signal than a scalar reward.

Where next?

Our experiments have demonstrated that this simple approach outperforms sophisticated deep reinforcement learning formulations on challenging tasks in immersive environments… The presented work can be extended in multiple ways that are important for broadening the range of behaviors that can be learned.

  1. The present model is purely reactive, a memory component could be integrated and ‘may yield substantial advances.’
  2. Temporal abstraction and hierarchical organisation of learned skills will likely be necessary for significant progress in behavioural sophistication.
  3. The ideas could be applied to continuous actions (not just discrete action spaces as in this work)
  4. Predicting features learned directly from rich sensory input could blur the distinction between sensory and measurement streams.

Understanding deep learning requires re-thinking generalization

May 11, 2017

Understanding deep learning requires re-thinking generalization Zhang et al., ICLR’17

This paper has a wonderful combination of properties: the results are easy to understand, somewhat surprising, and then leave you pondering over what it all might mean for a long while afterwards!

The question the authors set out to answer was this:

What is it that distinguishes neural networks that generalize well from those that don’t? A satisfying answer to this question would not only help to make neural networks more interpretable, but it might also lead to more principled and reliable model architecture design.

By “generalize well,” the authors simply mean “what causes a network that performs well on training data to also perform well on the (held out) test data?” (As opposed to transfer learning, which involves applying the trained network to a related but different problem). If you think about that for a moment, the question pretty much boils down to “why do neural networks work as well as they do?” Generalisation is the difference between just memorising portions of the training data and parroting it back, and actually developing some meaningful intuition about the dataset that can be used to make predictions. So it would be somewhat troubling, would it not, if the answer to the question “why do neural networks work (generalize) as well as they do?” turned out to be “we don’t really know!”

The curious case of the random labels

Our story begins in a familiar place – the CIFAR 10 (50,000 training images split across 10 classes, 10,000 validation images) and the ILSVRC (ImageNet) 2012 (1,281,167 training, 50,000 validation images, 1000 classes) datasets and variations of the Inception network architecture.

Train the networks using the training data, and you won’t be surprised to hear that they can reach zero errors on the training set. This is highly indicative of overfitting – memorising training examples rather than learning true predictive features. We can use techniques such as regularisation to combat overfitting, leading to networks that generalise better. More on that later.

Take the same training data, but this time randomly jumble the labels (i.e., such that there is no longer any genuine correspondence between the label and what’s in the image). Train the networks using these random labels and what do you get? Zero training error!

In [this] case, there is no longer any relationship between the instances and the class labels. As a result, learning is impossible. Intuition suggests that this impossibility should manifest itself clearly during training, e.g., by training not converging or slowing down substantially. To our suprise, several properties of the training process for multiple standard architectures is largely unaffected by this transformation of the labels.

As the authors succinctly put it, “Deep neural networks easily fit random labels.” Here are three key observations from this first experiment:

  1. The effective capacity of neural networks is sufficient for memorising the entire data set.
  2. Even optimisation on random labels remains easy. In fact, training time increases by only a small constant factor compared with training on the true labels.
  3. Randomising labels is solely a data transformation, leaving all other properties of the learning problem unchanged.

If you take the network trained on random labels, and then see how well it performs on the test data, it of course doesn’t do very well at all because it hasn’t truly learned anything about the dataset. A fancy way of saying this is that it has a high generalisation error. Put all this together and you realise that:

… by randomizing labels alone we can force the generalization error of a model to jump up considerably without changing the model, its size, hyperparameters, or the optimizer. We establish this fact for several different standard architectures trained on the CIFAR 10 and ImageNet classification benchmarks. (Emphasis mine).

Or in other words: the model, its size, hyperparameters, and the optimiser cannot explain the generalisation performance of state-of-the-art neural networks. This must be the case because the generalisation performance can vary significantly while they all remain unchanged.

The even more curious case of the random images

What happens if we don’t just mess with the labels, but we also mess with the images themselves. In fact, what if just replace the true images with random noise?? In the figures this is labeled as the ‘Gaussian’ experiment because a Gaussian distribution with matching mean and variance to the original image dataset is used to generate random pixels for each image.

In turns out that what happens is the networks train to zero training error still, but they get there even faster than the random labels case! A hypothesis for why this happens is that the random pixel images are more separated from each other than the random label case of images that originally all belonged to the same class, but now must be learned as differing classes due to label swaps.

The team experiment with a spectrum of changes introducing different degrees and kinds of randomisation into the dataset:

  • true labels (original dataset without modification)
  • partially corrupted labels (mess with some of the labels)
  • random labels (mess with all of the labels)
  • shuffled pixels (choose a pixel permutation, and then apply it uniformly to all images)
  • random pixels (apply a different random permutation to each image independently)
  • Guassian (just make stuff up for each image, as described previously)

All the way along the spectrum, the networks are still able to perfectly fit the training data.

We furthermore vary the amount of randomization, interpolating smoothly between the case of no noise and complete noise. This leads to a range of intermediate learning problems where there remains some level of signal in the labels. We observe a steady deterioration of the generalization error as we increase the noise level. This shows that neural networks are able to capture the remaining signal in the data, while at the same time fit the noisy part using brute-force.

For me that last sentence is key. Certain choices we make in model architecture clearly do make a difference in the ability of a model to generalise (otherwise all architectures would generalise the same). The best generalising network in the world is still going to have to fallback on memorisation when there is no other true signal in the data though. So maybe we need a way to tease apart the true potential for generalisation that exists in the dataset, and how efficient a given model architecture is at capturing this latent potential. A simple way of doing that is to train different architectures on the same dataset! (Which we do all the time of course). That still doesn’t help us with the original quest though – understanding why some models generalise better than others.

Regularization to the rescue?

The model architecture itself is clearly not a sufficient regulariser (can’t prevent overfitting / memorising). But what about commonly used regularisation techniques?

We show that explicit forms of regularization, such as weight decay, dropout, and data augmentation, do not adequately explain the generalization error of neural networks: Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error.

Explicit regularisation seems to be more of a tuning parameter that helps improve generalisation, but its absence does not necessarily imply poor generalisation error. It is certainly not the case that not all models that fit the training data generalise well though. An interesting piece of analysis in the paper shows that we pick up a certain amount of regularisation just through the process of using gradient descent:

We analyze how SGD acts as an implicit regularizer. For linear models, SGD always converges to a solution with small norm. Hence, the algorithm itself is implicitly regularizing the solution… Though this doesn’t explain why certain architectures generalize better than other architectures, it does suggest that more investigation is needed to understand exactly what the properties are that are inherited by models trained using SGD.

The effective capacity of machine learning models

Consider the case of neural networks working with a finite sample size of n. If a network has p parameters, where p is greater than n, then even a simple two-layer neural network can represent any function of the input sample. The authors prove (in an appendix), the following theorem:

There exists a two-layer neural network with ReLU activations and 2n + d weights that can represent any function on a sample of size n in d dimensions.

Even depth-2 networks of linear size can already represent any labeling of the training data!

So where does this all leave us?

This situation poses a conceptual challenge to statistical learning theory as traditional measures of model complexity struggle to explain the generalization ability of large artificial neural networks. We argue that we have yet to discover a precise formal measure under which these enormous models are simple. Another insight resulting from our experiments is that optimization continues to be empirically easy even if the resulting model does not generalize. This shows that the reasons for why optimization is empirically easy must be different from the true cause of generalization.