Skip to content

Spectre attacks: exploiting speculative execution

January 16, 2018

Spectre attacks: exploiting speculative execution Kocher et al., 2018

Yesterday we looked at Meltdown and some of the background on how modern CPUs speculatively execute instructions. Today it’s the turn of Spectre of course, which shares some of the same foundations but is a different attack, not mitigated by KAISER. On a technical front, Spectre is as fascinating as it is terrifying, introducing a whole new twist on ROP.

This paper describes practical attacks that combine methodology from side-channel attacks, fault attacks, and return-oriented programming that can read arbitrary memory from the victim’s process… These attacks represent a serious threat to actual systems, since vulnerable speculative execution capabilities are found in microprocessors from Intel, AMD, and ARM that are used in billions of devices.

Spectre is a step-up from the already very bad Meltdown:

  • it applies to AMD and ARM-based processors (so works on mobile phones, tablets, and so on)
  • it only assumes that speculatively executed instructions can read from memory the victim process could access normally (i.e., no page fault or exception is triggered)
  • the attack can be mounted in just a few lines of JavaScript – and a few lines of JavaScript can easily be delivered to your system from, for example, an ad displayed on a site that you would otherwise trust.

… in order to mount a Spectre attack, an attacker starts by locating a sequence of instructions within the process address space which when executed acts as a covert channel transmitter which leaks the victim’s memory or register contents. The attacker then tricks the CPU into speculatively and erroneously executing this instruction sequence…

Let’s play tennis

You’re watching an epic clay-court tennis match with long rallies from the baseline. Every time the ball is received deep in the forehand side of the court, our would-be-champion plays a strong cross-court forehand. Her opponent plays the ball deep into the forehand side of the court once more, and almost reflexively starts moving across the court to get into position for the expected return. But this time her speculative execution is in vain — the ball is played down the line at pace leaving her wrong-footed. By forcing a branch misprediction the now champion tennis player has won the point, game, set and match!

Spectre exploits speculative execution of instructions following a branch. Inside the CPU, a Branch Target Buffer (BTB) keeps a mapping from addresses of recently executed branch instructions to destination addresses. The BTB is used to predict future code addresses (“it’s going to be a cross-court forehand”) even before decoding the branch instructions. Speculative execution of the predicted branch improves performance.

Attack foundations

Consider this simple code fragment:

	if (x < array1_size)
	  y = array2[array1[x] * 256];

On line 1 we test to see whether x is in bounds for array1, and if it is, we use the value of x as in index into array1 on line 2. We can execute this code fragment lots of times, always providing a value of x that is in bounds (the cross-court forehand). The BTB learns to predict that the condition will evaluate to true, causing speculative execution of line 2. Then, wham, we play the forehand down-the-line, passing a value of x which is out of bounds. Speculative execution of line 2 goes ahead anyway, before we figure out that actually the condition was false this time.

Suppose we want to know the value of sensitive memory at address $a$. We can set x = a - (base address of array1) and then the lookup of array1[x] on line 2 will resolve to the secret byte at $a$. If we ensure that array1_size and array_2 are not present in the processor’s cache (e.g., via a clflush) before executing this code, then we can leak the sensitive value.

When the code runs…

  1. The processor compares the malicious value of x against array1_size. This results in a cache miss.
  2. While waiting for array1_size to be fetched, speculative execution of line 2 occurs, reading the data, $k$, at the target address.
  3. We then compute k * 256 and use this as an index into array2. That’s not in the cache either, so off we go to fetch it.
  4. Meanwhile, the value of array1_size arrives from DRAM, the processor realises that the speculative execution was erroneous in this case, and rewinds its register state. But crucially, the access of array2 affects the cache as we saw yesterday with Meltdown.
  5. The attacker now recovers the secret byte $k$, since accesses to array2[n*256] will be fast for the case when $n = k$, and slow otherwise.

You can even do this from JavaScript, using code like this:


Note that there’s no access to the clflush instruction from JavaScript, but reading a series of addresses at 4096-byte intervals out of a large array does the same job. There’s also a small challenge getting the timer accuracy needed to detect the fast access in array2. The Web Workers feature of HTML5 provides the answer – creating a separate thread that repeatedly decrements a value in a shared memory location yields a timer that provides sufficient resolution.

For a C code example, see appendix A in the paper. Unoptimised, this code can read about 10KB/second an an i7 Surface Pro 3.

The art of misdirection

With indirect branches we can do something even more special. An indirect branch is one that jumps to an address contained in a register, memory location, or on the stack.

If the determination of the destination address is delayed due to a cache miss and the branch predictor has been mistrained with malicious destinations, speculative execution may continue at a location chosen by the adversary.

This means we can make the processor speculatively execute instructions from any memory location of our choosing. So all we have to do, as in the gadgets of Return-Oriented Programming (ROP), is find some usable gadgets is the victim binary.

… exploitation is similar to return-oriented programming, except that correctly-written software is vulnerable, gadgets are limited in their duration, but need not terminate cleanly (since the CPU will eventually recognize the speculative error), and gadgets must exfiltrate data via side channels rather than explicitly.

In other words, this is an even lower barrier than regular ROP gadget chaining. Code executing in one hyper-thread of x86 processors can mistrain the branch predictor for code running on the same CPU in a different hyper-thread. Furthermore, the branch predictor appears to use only the low bits of the virtual address: “as a result, an adversary does not need to be able to even execute code at any of the memory addresses containing the victim’s branch instruction.

A Windows example

A proof-of-concept program was written that generates a random key the goes into a infinite loop calling Sleep(0), loading the first bytes of a file, calling Windows crypto functions to compute the SHA-1 hash of the key and file header, and then prints out the hash whenever the header changes. When compiled with optimisation flags, the call to Sleep(0) will be made with file data in registers ebx and edi (normal behaviour, nothing special was done to cause this).

In ntdll.dll we find the byte sequence 13 BC 13 BD 13 BE 13 12 17, which when executed corresponds to:

    adc  edi, dword ptr [ebx+edx+13BE13BDh]
    adc  dl, byte ptr [edi]

If we control ebx and edi (which we do, via the file header), then we can control which address will be read by this code fragment.

Now the first instruction of the sleep function is jmp dword ptr ds:[76AE0078h], which we can target for branch mistraining. (The actual destination changes per reboot due to ASLR).

  • Simple pointer operations were used to locate the indirect jump at the entry point for sleep, and the memory location holding the destination for the jump
  • The memory page containing the destination for the jump was made writable using copy-on-write, and modified to change the jump destination to the gadget address. Use the same method, a ret 4 instruction was written at the location of the gadget. (These changes are only visible to the attacker, not the victim).
  • A set of threads are launched to mistrain the branch predictor. This is a bit fiddly (see the details in section 5.2), but once an effective mimic jump sequence is found the attacker is able to read through the victim’s address space.

Variations and mitigations

Spectre is not really so much an individual attack, as a whole new class of attacks: speculative execution may affect the state of other microarchitectural components, and virtually any observable effect of speculative execution can lead to leaks of sensitive information. Timing effects from memory bus contention, DRAM row address selection status, availability of virtual registers, ALU activity, and the state of the branch predictor itself need to be considered.

The conditional branch vulnerability can be mitigated if speculative execution can be halted on potentially-sensitive execution paths… Indirect branch poisoning is even more challenging to mitigate in software. It might be possible to disable hyperthreading and flush branch prediction state during context switches, although there does not appear to be any architecturally-defined method for doing this.

Anything we can do in software or microcode though, should at best be seen as stop-gap countermeasures pending further research.

The last word

The vulnerabilities in this paper, as well as many others, arise from a longstanding focus in the technology industry on maximizing performance. As a result, processors, compilers, device drivers, operating systems, and numerous other critical components have evolved compounding layers of complex optimizations that introduce security risks. As the costs of insecurity rise, these design choices need to be revisited, and in many case alternate implementations optimized for security will be required.


January 15, 2018

Meltdown Lipp et al., 2018

I’m writing this approximately one week ahead of when you get to read it, so it’s entirely possible by this time that you’ve already heard more than you can stand about Meltdown and Spectre! Behind the news headlines though, there’s a lot of good information in the accompanying papers, and I certainly wanted to read them, so that’s what we’re going to be starting out with this week.

Meltdown breaks all security assumptions given by the CPU’s memory isolation capabilities. We evaluated the attack on modern desktop machines and laptops, as well as servers in the cloud. Meltdown allows an unprivileged process to read data mapped in the kernel address space… this may include physical memory of other processes, the kernel, and in case of kernel-sharing sandbox solutions (e.g., Docker, LXC) or Xen in paravirtualization mode, memory of the kernel (or hypervisor) and other co-located instances.

Depending on the capabilities of the target machine, it’s possible to dump memory at up to 503 KB/s. There’s an interesting footnote on the first page of the paper that says “This attack was independently found by the authors of this paper and Jann Horn from Google Project Zero.” Two independent discoveries of the same vulnerability, which it turns out has existed on Intel microarchitectures since 2010, certainly makes you wonder how many other ‘discoveries’ of this there might have been in the intervening period.

You’re bang out-of-order

(That might be just a British expression, if it doesn’t make sense to you, just ignore it!).

The roots of the problem go way back to 1967, when Tomasulo first developed an algorithm to allow out-of-order execution of instructions. Inside a CPU there are multiple execution units. To keep things moving at pace, while the CPU is busy executing the current instruction in one execution unit, other units can run ahead, speculatively executing the instructions that follow. The particular case of out-of-order execution this paper concerns itself with is an instruction sequence following a branch.

If we look inside a Skylake core, we’d find something like this:

x86 instructions are fetched from memory by the front-end, and decoded to micro-operations which are continuously sent to the execution engine. The reorder buffer is responsible for swizzling the registers so that instructions operating on the same physical registers use the last logical value. Execution units are connected via a common data bus, and if a required operand is not immediately available then it’s possible to listen on the bus and directly begin instruction execution the moment it appears.

Since CPUs usually do not run linear instruction streams, they have branch prediction units that are used to obtain an educated guess of which instruction will be executed next. Branch predictors try to determine which direction of a branch will be taken before its condition is actually evaluated. Instructions that lie on that path and do not have any dependencies can be executed in advance and their results immediately used if the prediction was correct.

If the guess was wrong, the results can just be rolled back and it’s as if they never happened. Or at least that’s the plan! But as we all now know, unfortunately some traces do get left behind…

Although the instructions executed out of order do not have any visible architectural effect on registers or memory, they have microarchitectural side effects. During the out-of-order execution, the referenced memory is fetched into a register and is also stored in the cache. If the out-of-order execution has to be discarded, the register and memory contents are never committed. Nevertheless, the cached memory contents are kept in the cache.

Consider the following simple example:

// the line below is never reached
access(probe_array[data * 4096]);

Although we know the code on line 3 will never be reached, it still might be speculatively executed. We’d like to know what the value of data was during that execution. Since we multiply data by 4096, every access to the probe array is one memory page (4kB) apart: therefore each unique value of data maps to a different unique page. If we ensure that the probe array is not in cache (is flushed) before the execution, then after the speculative execution, the only page of the array that will (now) be in the cache is the page indexed by the data value.

Now all we have to do is iterate over all of the 256 possible values of the data byte and observe how many cycles it takes to read each page of the probe array. This combination is known as a ‘Flush+Reload’ cache side channel.

Figure 4 shows the result of a Flush+Reload measurement iterating over all pages, after executing the out-of-order snipped with data = 84. Although the array access should not have happened due to the exception, we can clearly see that the index which would have been accessed is cached. Iterating over all pages (e.g. in the exception handler) shows only a cache hit for page 84.

Let us call instructions executed out-of-order which leave behind measurable side effects transient instructions. Such instructions introduce an exploitable side channel if their operation depends on a secret value. We’d like to, for example, dump the memory of the kernel address space (which will include all of the physical memory on Linux and OS X, and pretty much all of it on Windows). The kernel memory is protected of course, and if we try to access it that will trigger an exception. But not to worry – if the exception occurs after the execution of the transient instruction sequence the damage is already done, and we can just handle it. On more modern processors we can wrap everything up in hardware transactions using Intel TSX and suppress the exception altogether (which happens automatically as part of transaction rollback). One trivial exception handling technique is to fork the attacking application before accessing the invalid memory location, and only access that location in the child process. The parent process can then recover the secret.

I’m having a Meltdown

The full Meltdown attack consists of three steps:

  1. The content of an attacker-chosen memory location, which is inaccessible to the attacker, is loaded into a register
  2. A transient instruction accesses a cache line based on the secret content of the register
  3. The attacker uses Flush+Reload to determine the accessed cache line and hence the secret stored at the memory location.

The core implementation of the transient instruction sequence and sending part of the covert channel looks like this:

Line 4 loads a byte value located at a target kernel address. While this is going on, the subsequent instructions on lines 5-7 are likely to be speculatively executed. Line 5 multiplies the secret value by the page size (4 KB, or 2^12 — 0xc), and line 7 contains an indirect memory access based on the result. This will affect the cache state based on the secret value. Iterating over all 256 pages of the probe array and measuring the access time reveals the secret as before.

It just remains to explain what that jz retry instruction is doing on line 6. There’s a race condition between the speculative execution and cache probing, and the triggering of an exception caused by trying to read from an inaccessible kernel address on line 5. When the exception is triggered, the register where the data should be stored gets zero’d out. But if the zeroing out of the register happen before the execution of line 5, we’d read a false zero as the value of the secret.

To prevent the transient instruction sequence from continuing with a wrong value, i.e., ‘0’, Meltdown retries reading the address until it encounters a value different from ‘0’ (line 6). As the transient instruction sequence terminates after the exception is raised, there is no cache access if the secret value (truly is) 0. Thus, Meltdown assumes that the secret value is indeed ‘0’ if there is no cache hit at all. The loop is terminated by either the read value not being ‘0’ or by the raised exception of the invalid memory access.

Meltdown was successfully reproduced on the following systems:

Using the exception handling strategy, it was possible to leak kernel memory at a rate of 122 KB/s. Using Intel TSX to suppress exceptions that goes up to 502 KB/s.

Meltdown was also demonstrated within containers (Docker, LXC, OpenVZ). This gives access to information both from the underlying kernel and also all other containers running on the same physical host.

… the isolation of containers sharing a kernel can be fully broken using Meltdown.


Kernal address space layout randomisation (KASLR) makes it harder to find the kernel memory, but not by much it turns out. An attacker can simply search through the address space, or de-randomise the direct-physical map: “assuming that the target system has at least 8GB of physical memory, the attacker can test defences in steps of 8GB, resulting in a maximum of 128 memory locations to test.

The KAISER patch by Gruss et al. (or its equivalents) seems to be what is going into the OS patches (mitigations really – this isn’t a software vulnerability remember). It provides stronger isolation between kernel and user space by mapping only the bare minimum of kernel memory in user space (and still with the protection bit set of course).

Consequently, Meltdown cannot leak any kernel or physical memory except for the few memory locations which have to be mapped in user space. We verified that KAISER indeed prevents Meltdown, and there is no leakage of any kernel or physical memory.

That sounds alright then? Apply the patches, pay the performance penalty, forget about Meltdown and move onto worrying about Spectre? Not quite… it’s important to keep reading into section 7.2 of the paper where we find this:

Although KAISER provides basic protection against Meltdown, it still has some limitations. Due to the design of the x86 architecture, several privileged memory locations are required to be mapped in user space. This leaves a residual attack surface for Meltdown… Even though these memory locations do not contain any secrets, such as credentials, they might still contain pointers. Leaking one pointer can be enough to again break KASLR, as the randomization can be calculated from the pointer value. Still, KAISER is the best short-term solution currently available and therefore be deployed on all systems immediately.

The authors were unable to reproduce Meltdown on ARM and AMD CPUs, though that’s not quite the same thing as saying that variations of the attack will not be possible.

The world will never be the same again…

Meltdown fundamentally changes our perspective on the security of hardware optimizations that manipulate the state of microarchitectural elements. The fact that hardware optimizations can change the state of microarchitectural elements, and thereby imperil secure software implementations, is known since more than 20 years. Both industry and the scientific community so far accepted this as a necessary evil for efficient computing… Meltdown changes the situation entirely… it is nothing any (cryptographic) algorithm can protect itself against. KAISER is a short-term software fix, but the problem we uncovered is much more significant.

One of the sobering things here is that improved performance is an observable effect which creates a side-channel to leak information. (That’s not a new observation, witness all the variations on the Flush+Reload cache attacks for example). But what would it mean to eliminate that side-channel? The whole point of performance improvements is that they’re observable! If you don’t see improved performance, it’s not a performance improvement. Moreover, if we think about adding noise we can only slow down discovery but not eliminate it — it’s possible to add noise that makes the occasional cached data be returned more slowly, but not the other way round. Making everything always go as slowly as the slowest possible route doesn’t seem all that appealing either!

And then there’s this:

Even if Meltdown is fixed, Spectre will remain an issue. Spectre and Meltdown need different defenses. Specifically mitigating only one of them will leave the security of the entire system at risk.

You locked the front door, but you left the backdoor wide open.

One model to learn them all

January 12, 2018

One model to learn them all Kaiser et al., arXiv 2017

You almost certainly have an abstract conception of a banana in your head.

Suppose you ask me if I’d like anything to eat. I can say the word ‘banana’ (such that you hear it spoken), send you a text message whereby you see (and read) the word ‘banana,’ show you a picture of a banana, and so on. All of these different modalities (the sound waves, the written word, the visual image) tie back to the same concept – they are different ways of ‘inputting’ the banana concept. Your conception of bananas is independent of the way the thought popped into your head. Likewise, as an ‘output’ I could ask you to say the word banana, write the word banana, draw a picture of a banana, and so on. We are able to reason about such concepts independently of the input and output modalities. And we seem able to reuse our conceptual knowledge of bananas in many different contexts (i.e., across many different tasks).

Deep neural networks are typically designed and tuned for the problem at hand. Generalisation helps such a network to do well on new instances of the same problem not seen before, and transfer learning sometimes gives us a leg up by reusing e.g., learned feature representations from within the same domain. There do exist multi-task models, “but all these models are trained on other tasks from the same domain: translation tasks are trained with other translation tasks, vision tasks with other vision tasks, speech tasks with other speech tasks.” It’s as if we had one concept for the written word ‘banana’, another concept for pictures of bananas, and another concept for the spoken word ‘banana’ – but these weren’t linked in any way. The central question in today’s paper choice is this:

Can we create a unified deep learning model to solve tasks across multiple domains?

What would we need in order to be able to do that? We’d need to be able to support different input and output modalities (as required by the task in hand), we’d need a common representation of the learned knowledge that was shared across all of these modalities, and we’d need sufficient ‘apparatus’ such that tasks which need a particular capability (e.g. attention) are able to exploit it. ‘One model to rule them all’ introduces a MultiModel architecture with exactly these features, and it performs impressively well.

A single instance of the MultiModel architecture is trained simultaneously on eight different different tasks based on the following datasets:

  1. WSJ speech corpus
  2. ImageNet
  3. COCO image captioning dataset
  4. WJS parsing dataset
  5. WMT English-German translation corpus
  6. The reverse of the above, German-English
  7. WMT English-French translation corpus
  8. The reverse of the above, French-English (the paper says ‘German-French’ here, but that’s not the reverse, and looks to be a cut-and-paste error?)

Here are some examples of the single trained model performing a variety of different tasks:

… it is clear that it can caption images, categorize them, translate to French and German and construct parse trees.

It may not achieve state-of-the-art results on all of these tasks, but it does beat many recently studied task-specific models.

MultiModel under the hood

At a high level, the MultiModel architecture looks like this:

There are small, modality-specific sub-networks that convert into a unified representation and back from it.

We call these sub-networks modality nets as they are specific to each modality (images, speech, text) and define transformations between these external domains and a unified representation. We design modality nets to be computationally minimal, promoting heavy feature extraction and ensuring that the majority of computation is performed within the domain-agnostic body of the model.

Different tasks from the some domain (e.g., different speech tasks) share the same modality nets. We do not have one modality net per task, merely one modality net per modality. Another important design decision was to allow the unified representation to be variable in size (as opposed to a fixed-size representation which ended up creating a bottleneck and limiting performance).

The outputs of the modality nets become the inputs to a shared encoder which creates the unified representation. An I/O mixer combines the encoded inputs with the previous outputs (the MultiModel is autoregressive, i.e., it uses past output values to help predict the next output), and a decoder processes the inputs and the mixture to generate new outputs.

To allow the decoder to produce outputs for different tasks even with the same modality, we always start decoding with a command-token, such as ‘To-English’ or ‘To-Parse-Tree.’ We learn an embedding vector corresponding to each of the tokens during training.

As we observed previously, to ensure good performance across a variety of tasks, the MultiModel needs the right apparatus at its disposal. To this end, the MultiModel incorporates building blocks from multiple domains including separable convolutions (first introduced in the context of image problems), an attention mechanism, and sparsely-gated mixture-of-experts layers (first introduced for language processing).

We find that each of these mechanisms is indeed crucial for the domain it was introduced, e.g., attention is far more important for language-related tasks than for image-related ones. But, interestingly, adding these computational blocks never hurts performance, even on tasks they were not designed for. In fact, we find that both attention and mixture-of-experts layers slightly improve performance of MultiModel on ImageNet, the task that needs them least.

Putting all these pieces together we end up with an architecture that looks like this:


The encoder, mixer and decoder are structurally similar to previous fully convolutional sequence models, but use different computational blocks. The encoder has 6 repeated convolutional blocks with a mixture-of-experts layer in the middle. The mixer has an attention block and 4 convolutional blocks. The decoder has 4 blocks of convolution and attention, with a mixture-of-experts layer in the middle.

MultiModel in action

After being simultaneously trained on the eight tasks, the authors set out to determine:

  1. How close the MultiModel gets to state-of-the-art results in each task
  2. How training on 8 tasks simultaneously compares to training on each task separately, and
  3. How the different computational blocks influence different tasks.

The results achieved by MultiModel are similar to the ones that task-specific models get without heavy tuning (‘E.g., on English-French translation we improve on the Extended Neural GPU results reported last year’). Since there wasn’t much tuning done on the MultiModel, it is reasonable to expect the gap to close further.

The jointly trained model turns out to perform similarly to individually trained models on tasks where large amounts of data are available. But most interestingly, it performs better, sometimes significantly, on tasks where less data is available, such as parsing.

Further investigation reveals that…

…it seems there are computational primitives shared between different tasks that allow for some transfer learning even between such seemingly unrelated tasks as ImageNet and parsing.

This ability to learn from domains with large amounts of data available and give a boost in performance in domains where less data is available feels like it has a lot of potential.

Regarding the third question, by including or excluding different block types it is possible to understand their effect. Both attention and mixture-of-experts mechanisms were designed with machine translation in mind, and in theory ImageNet is the problem that should benefit the least from these blocks. But the results show that even on the ImageNet task, the presence of such blocks does not detract from performance, and may even slightly improve it.

This leads us to conclude that mixing different computation blocks is in fact a good way to improve performance on many various tasks.

The last word

We demonstrate, for the first time, that a single deep learning model can jointly learn a number of large-scale tasks from multiple domains. The key to success comes from designing a multi-modal architecture in which as many parameters as possible are shared and from using computational blocks from different domains together. We believe that this treads a path towards interesting future work on more general deep learning architectures, especially since our model shows transfer learning from tasks with a large amount of available data to ones where data is limited.

Emergent complexity via multi-agent competition

January 11, 2018

Emergent complexity via multi-agent competition Bansal et al., Open AI TR, 2017

(See also this Open AI blog post on ‘Competitive self-play’).

Today’s action takes place in 3D worlds with simulated physics (using the MuJoCo framework). There are two types of agents, ants:

And humanoids:

These learn to play against each other (ant vs ant, and humanoid vs humanoid) in four different games:

  1. Run to goal: agents start of facing each other in a 3D world, with goals each agent has a goal location on the opposite side of the world. The agent that reaches its goal first wins.
  2. You shall not pass: the same setup as run to goal, but now one agent (the blocker) is set the goal of blocking the other agent from reaching it’s goal. If the blocker is successful in this task and is still standing at the end of an episode then it wins. The opponent of course wins by reaching it’s goal.
  3. Sumo: The agents compete on a round arena, with the goal of either knocking the other agent to the ground or pushing it out of the arena.
  4. Kick and defend: a penalty shootout-like scenario in which one agent must kick a ball through a goal while the other agent defends.

The agents are trained using reinforcement learning with policy gradients using proximal policy optimisation. We’ll get into some of the details shortly. As with Alpha[Go] Zero that we looked at yesterday, the agents learn through self-play: training consists of playing lots of one-on-one contests between agents. Just as AlphaZero is taught nothing about chess openings, strategies, or tactics, so the agents in the 3D world are taught nothing of tactics, they simply know the reward they obtain at the end of play for meeting their objective (or not).

… a competitive multi-agent environment trained with self-play can produce behaviors that are far more complex than the environment itself.

The cool part of this paper is the behaviours that the agents discover all by themselves. Seeing is better than reading here, so if you have the time, take 2 minutes to watch this highlights reel:

  • On Run-to-Goal, the ants demonstrate behaviours such as blocking, standing against opposing forces, using legs to topple the opponent and running towards the goal. The smaller humanoids try to avoid each other and run towards their goal really fast.
  • On You-shall-not-pass the blocking humanoid learned to block by raising its hand, while its opponent eventually learned to duck underneath in order to cross to the other side.
  • In Sumo, humanoids learn a stable fighting stance — as well as using their heads to knock into the opponent (no penalty for that in 3D world!).
  • On Kick-and-defend the kicker learns to use its feet to kick the ball high and try to avoid the defender. It also makes dummy movements to try and fool the defender. In turn, the defender learns use its hands and legs to obstruct the ball.

There are no rewards for any of these behaviours, beyond the ultimate reward that is the outcome of the game. And yet once more, watching the agents in action, they feel quite ‘human’ in their strategies.

These discovered skills show a degree of transference to other tasks too: taking an agent trained on Sumo and subjecting it to strong wind forces, the agent manages to stay upright despite never experiencing windy environments before.

So that’s the fun stuff. Now lets take a look at some of training techniques used under the covers.

Training agents using self-play

When trained with self-play, the competitive multi-agent environment provides the agents with a perfect curriculum. This happens because no matter how weak or strong an agent is, an environment populated with other agents of comparable strength provides the right challenge to the agent, facilitating maximally rapid learning and avoiding getting stuck.

Agents are trained using a distributed training system. The policy network produces a probability distribution over possible actions, and then an action to take is sampled from this distribution. Rollouts (self-play) continue from this point to figure out the ultimate outcome, and the PPO (proximal policy optimisation) gradient algorithm is used update the network. (Andrej Karpathy has a good overview of deep reinforcement learning with policy networks and policy gradients on his blog).

Essentially, we do multiple rollouts in parallel for each agent and have separate optimizers for each agent. We collect a large amount of rollouts from the parallel workers and for each agent optimize the objective with the collected batch on 4 GPUs… we estimate the generalize advantage estimate (GAE) from the full rollouts. This is important as the competitive reward is a sparse reward, which is only given at the termination of the episode.

A number of other factors needed to be taken into account during training: encouraging the agents to explore, stopping them from getting stuck in a rut with particular opponents, and keeping enough randomisation in the world to prevent overfitting. We’ll look at each of these in turn next.

Encouraging exploration

The agents aren’t likely to achieve much success unless they learn some basic motor skills such as the ability to walk. Hence they only very small chances of receiving rewards. “This is the general reinforcement learning problem of training from sparse reward which is ac active area of current research.

The problem is solved by adding extra rewards (dense rewards) in the beginning phase of training to allow agents to learn basic motor skills. These exploration rewards are gradually reduced to zero as training progresses, in favour of the overall competition reward, using a simple linear annealing factor. Typically the dense reward is a factor in about 10-15% of the training epoch.

Agents training without this exploration curriculum exhibit non-optimal behaviours. For example, in Sumo the agents just learn to stand and move towards the center of the arena. It also takes more samples to train them.

Mixing up opponents

We found that training agents against the most recent opponent leads to imbalance in training where one agent becomes more skilled than the other agent early in training and the other agent is unable to recover.

The fix for this is to train against randomly selected older versions of the opponent rather than simply always using the most recent. The following tables show the results in Sumo training for the humanoid and ant agents ($\delta = 1.0$ corresponds to the most recent agent, and \delta = 0.0 corresponds to uniform sampling over the whole history).

Sampling over the whole history works best for ants, but for humanoids \delta = 0.5 gives the best win rate.

The differences in these win-rates for different sampling strategies show that the choice of the opponent during sampling is important and care must be taken while designing training algorithms for such competitive environments.

Another related problem is overfitting to the behaviour of the opponent when trained for long periods, resulting in policies that don’t generalise well to new opponents. To overcome this, multiple policies (an ensemble) are learned simultaneously — in each rollout one of the other policies is selected at random as the opponent.

We find that training in ensemble performs significantly better for the humanoid body, whereas for ant the performance is similar to training a single policy. Again we suspect this is because there is not enough variability in the behavior of ant across different runs.

Mixing up the world

With little or no variation in the environment agents can also overfit. For example, running straight to the expected location of the ball in a game of Kick-and-Defend. Too much randomisation is a problem in the other direction: with a lot of randomisation in ball and agent positions, agents are unable to learn to kick. The solution is to start with a small amount of randomisation, which is easier for the agents to solve, and then gradually increase the randomisation during training.

We found this curriculum to work well for all the environments.

In conclusion

By adding a simple exploration curriculum to aid exploration in the environment we find that agents learn a high level of dexterity in order to achieve their goals, in particular we find numerous emergent skills for which it may be difficult to engineer a reward. Specifically, the agents learned a wide variety of skills and behaviors that include running, blocking, ducking, tackling, fooling opponents, kicking, and defending using arms and legs.

Mastering chess and shogi by self-play with a general reinforcement learning algorithm

January 10, 2018

Mastering chess and shogi by self-play with a general reinforcement learning algorithm Silver et al., arXiv 2017

We looked at AlphaGo Zero last year (and the first generation of AlphaGo before that), but this December 2017 update is still fascinating in its own right. Recall that AlphaGo Zero learned to play Go with only knowledge of the rules and self-play. The obvious question is can you generalise that approach to other games or situations? In this paper, the DeepMind team make another small tweak to their self-play reinforcement learning engine to give us AlphaZero, and train it to play chess, shogi, and also Go once more. It’s the chess story that grabs me here – I vividly remember DeepBlue playing Gary Kasparov back in 1997, and my understanding of chess play is much better than my understanding of Go strategies and tactics. So I can have a better appreciation for AlphaZero’s play in this domain.

The game of chess represented the pinnacle of AI research over several decades. State-of-the-art programs are based on powerful engines that search many millions of positions, leveraging handcrafted domain expertise and sophisticated domain adaptations. AlphaZero is a generic reinforcement learning algorithm — originally devised for the game of Go — that achieved superior results within a few hours, searching a thousand times fewer positions, given no domain knowledge except the rules of chess.

But isn’t chess old news?

Computers achieved super-human performance in chess 20 years ago, so what’s the big deal about AlphaZero learning to play chess well? As with AlphaGo Zero, AlphaZero taught itself through self-play with only knowledge of the rules – no opening book, no endgame database, no chess specific tactical knowledge etc.. And while the game of Go has a natural match between the structure of convolutional networks and the translation-invariant nature of Go, the fit with chess and shogi is not so clear cut:

Chess and shogi are, arguably, less innately suited to AlphaGo’s neural network architectures. The rules are position-dependent (e.g., pawns may move two steps forward from the second rank and promote on the eight rank) and asymmetric (e.g., paws only move forward and castling is different on kingside and queenside). The rules include long-range interactions (e.g., the queen may traverse the board in one move, or checkmate the king from the far side of the board..

AlphaZero playing chess

Trained AlphaZero engines played chess against Stockfish, shogi against Elmo, and Go against the previous version of AlphaGo Zero. In each case 100 game matches were played with one minute per move time limits. AlphaZero used four TPUs, as did AlphaGo Zero, Stockfish and Elmo played at their strongest levels using 64 threads and a 1GB hash size.

AlphaZero convincingly defeated all opponents, losing zero games to Stockfish and eight games to Elmo, as well as defeating the previous version of AlphaGo Zero.

It took just 4 hours for AlphaZero to outperform Stockfish, and 8 hours to beat AlphaGo Lee. That’s elapsed hours of course. 5000 TPUs were used to generate self-play games, and 64 TPUs were used to train the networks. So perhaps a more accurate statement would be “it took approximately 20,000 TPU-hours” for AlphaZero to outperform Stockfish.

So the headlines are impressive. But it’s the way that AlphaZero plays chess that really captures my attention. AlphaZero seems to play in a much more ‘human’ style. It has élan, it has flair. To see what I mean, it’s most instructive to look at some moments from the games.

Here’s IM Daniel Rensch’s commentary on some of the match highlights:

I also really enjoyed agadmator’s analysis of this game where AlphaZero trades a pawn to leave black’s bishop in a weak position, crippling black for much of the game:

Bearing in mind AlphaZero has no openings database, it’s also interesting to see that it rediscovers popular openings all by itself – seeming to end up with a preference for the English Opening and the Queens Gambit.

Each of these openings is independently discovered and played frequently by AlphaZero during self-play training. When starting from each human opening, AlphaZero convincingly defeated Stockfish, suggesting it has indeed mastered a wide spectrum of chess play.

There’s some evidence to back my assertion that AlphaZero’s play feels more like human play too: AlphaZero uses a Monte-Carlo tree search (MCTS) algorithm which searches 80 thousand positions per second in chess. Granted that’s many more positions than any human! But compared to the 70 million positions per second Stockfish crunches through, it’s tiny. Three orders of magnitude fewer! So AlphaZero must genuinely have (if you’ll excuse the pun), a deeper understanding of the chess positions in order to focus that much more limited energy in the places where it matters the most:

AlphaZero compensates for the lower number of evaluations by using its deep neural network to focus more selectively on the most promising variations — arguably a more ‘human-like’ approach to search…

Under the hood: from AlphaGo Zero to AlphaZero

So how does AlphaZero do it? The heart of AlphaZero is very similar to AlphaGo Zero that we looked at last year, but in an even more general form. There is the same single network structure which outputs both a vector of move probabilities and a scalar estimating the expecting outcome from the current position. And the network is similarly trained from self-play.

Games are played by selecting moves for both players using Monte-Carlo tree search… At the end of the game, the terminal position is scored according to the rules of the game to compute the game outcome… The neural network parameters \theta are updated so as to minimise the error between the predicted outcome and the game outcome, and to maximise the similarity of the policy vector to the search probabilities.

The main differences to AlphaGo Zero are as follows:

  • AlphaGo Zero assumes binary win/loss outcomes, AlphaZero also takes draws into account.
  • The AlphaGo family exploit Go’s invariance to rotation and reflection, generating 8 symmetries for each position and randomly selecting rotations or reflections before training. AlphaZero cannot do this as chess (and shogi) do not have the same invariance properties.
  • AlphaZero continually updates a single neural network during self-play. In contrast AlphaGo Zero moved forward using a batch process whereby the performance of a new player after a training iteration was measured against the previous best and only took over as the new generation source for self-play games if it won by a margin of 55%.
  • AlphaGo Zero tuned its hyperparameters using Bayesian optimisation. AlphaZero simply reuses the same hyper-parameters for all games without game-specific tuning.

And of course, AlphaZero needs a board input representation suitable for the games it is playing. For chess, the board is represented by 16 8×8 planes. There are six 8×8 planes representing the position of the white pieces (one for the pawn positions, one for knights, one for bishops, one for rooks, one for the king, and one for the queen), and a similar set to represent the positions of the black pieces. In addition there are a number of constant-value inputs for the player’s colour, total move count, legality of castling, move repetition count (3 repetitions is a draw in chess), and the number of moves without progress (50 moves without progress is a draw).

The policy encodes a probability distribution over 4,672 possible moves in an 8 x 8 x 73 stack of planes.

Each of the 8 x8 positions identifies the square from which to ‘pick up’ a piece. The first 56 planes encode possible ‘queen moves’ for any piece: a number of squares [1..7] in which the piece will be moved, along one of the eight relative compass directions {N, NE, E, SE, S, SW, W, NW }. The next 8 planes encode possible knight moves for that piece. The final 9 planes encode possible underpromotions for pawn moves or captures in two possible diagonals, to knight, bishop, or rook respectively. Other pawn moves or captures from the seventh rank are promoted to a queen.

Tomorrow we’ll take a look at another impressive application of self-play in reinforcement learning…

The case for learned index structures – Part II

January 9, 2018

The case for learned index structures Kraska et al., arXiv Dec. 2017

Yesterday we looked at the big idea of using learned models in place of hand-coded algorithms for select components of systems software, focusing on indexing within analytical databases. Today we’ll be taking a closer look at range, point, and existence indexes built using this approach. Even with a CPU-based implementation the results are impressive. For example, with integer datasets:

As can be seen, the learned index dominates the B-Tree index in almost all configurations by being up to 3x faster and being up to an order-of-magnitude smaller.

As we left things yesterday, the naive learned index was 2-3x slower! Let’s take a look at what changed to get these kinds of results.

Learned range indexes

Recall that a single learned CDF has difficulty accurately modelling the fine structure of a data distribution. Replacing just the first two layers of a B-Tree though, is much easier to achieve even with simple models. And once we’re dealing with a narrower portion of the data set, learning the fine structure of that becomes much more tractable. So the overall solution is to layer models in a recursive regression model:

…we build a hierarchy of models, where at each stage the model takes the key as an input and based on it picks another model, until the final stage predicts the position.

The recursive models are not trees though – different models at the same stage may map to the same models at the next stage. Nor do the models necessarily uniformly divide the key space. Think of model selection more like ‘picking an expert with better knowledge about certain keys.’

Note that we also have the freedom to use different models in the different stages:

For example, whereas on the top-layer a small ReLU neural net might be the best choice as they are usually able to learn a wide-range of complex data distributions, the models at the bottom of the hierarchy might be thousands of simple linear regression models as they are inexpensive in space and execution time.

The entire index (all stages) can be represented as a sparse matrix multiplication for a TPU/GPU, and can be trained end-to-end (see algorithm 1 in §3.3 of the paper).

The output of this Recursive Model Index (RMI) is a page. In absence of any knowledge of the data distribution, traditionally either binary search or scanning (for small page sizes) have been shown to be the fastest strategies. But hold on…

Yet again, learned indexes might have an advantage here as well: the models actually predict the position of the key, which likely to be much closer to the actual position of the record, while the min- and max-error is presumably larger.

Based on this insight, the authors develop a new biased quaternary search strategy. In a quaternary search, instead of picking one new middle point to to test as in binary search, three new test points are chosen (thus dividing the dataset into four quarters, as opposed to two halves). The three initial middle points for the quaternary search are set to be pos - \sigma, pos, and pos + \sigma, where pos is the predicted position. “That is we make a guess that most of our predictions are accurate and focus our attention first around the position estimate and then we continue with traditional quaternary search.

For evaluation purposes 4 secondary indexes are created over 3 real-world datasets (with up to 200M rows) and one synthetic dataset designed to exercise heavy-tailed distributions. For all datasets a comparison is done between a very competitive B-Tree implementation, using a variety of page sizes, and 2-stage RMI models with varied second-stage sizes (10K, 50K, 100K, and 200K). Training a full RMI model takes just a few seconds.

Let’s look at the map, weblog, and synthetic log-normal dataset results first, since these all use integer keys.

For the maps dataset, an index is created for the longitude of approximately 200M map features (roads, museums, coffee shops, …). Here’s how the B-Tree and RMI models faired, with total lookup time broken down into model execution time (to find the page) and search time within the page. Results are normalised compared to a B-Tree using 128 page size.

The learned index dominates the B-Tree index: it is up to three times faster and uses significantly less space.

The weblog dataset contains every request to a major university website over several years, and is indexed by timestamp. This is a challenging dataset for the learned model, and yet the learned index still dominates the B-Tree.

The story is similar with the synthetic log-normal dataset.

The web-documents dataset contains 10M non-continuous document-ids of a large web index. The keys here are strings, and the approach is to turn each string into a max-input length vector with one element per character. Other more sophisticated encodings could be used in the future, exploiting the ability of e.g. RNNs to learn interactions between characters. A hybrid model is included in this set of results, that replaces 2nd-stage models with errors above a threshold with a local B-Tree.

Here the learned index is less obviously better (though it always gives good space savings). Model execution with the vector input is more expensive – a problem that using a GPU/TPU rather than a CPU would solve.

The current design, even without any signification modifications, is already useful to replace index structures as used in data warehouses, which might only be updated once a day, or BigTable where B-Trees are created in bulk as part of the SStable merge process.

It’s possible that learned models have an advantage for certain types of update workloads too – especially if the updates follow the existing data distribution. If the distribution does change, one option is to build a delta-index of inserts and periodically merge it with the main index, including a potential retraining.

Learned point indexes

The key challenge for hash maps is preventing too many distinct keys from mapping to the same position inside the map (a conflict). With e.g., 100M records and 100M slots, there’s roughly a 33% chance of conflict. At this point, we have to use a secondary strategy (e.g., scanning a linked list) to search with the slot.

Therefore, most solutions often allocate significantly more memory (i.e., slots) than records to store… For example, it is reported that Google’s Dense-hashmap has a typical overhead of about 78% of memory.

But if we could learn a model which uniquely maps every key into a unique position inside the array we could avoid conflicts. If you have plenty of memory available, it’s going to be hard to beat the performance of traditional hashmaps. But at higher occupancy rates (e.g., less than 20% overhead) performance starts to slow down significantly and learned models may be able to do better.

Surprisingly, learning the CDF of the key distribution is one potential way to learn a better hash function… we can scale the CDF by the targeted size M of the hash-map and us h(K) = F(K)*M, with key K as our hash-function. If the model F perfectly learned the CDF, no conflicts would exist.

This strategy can be used to replace the hash function in any existing hash-map implementation. Here’s an evaluation comparing two-stage RMI models with 100K models in the second stage, against a fast hash function using 3 bitshifts, 3 XORs, and 2 multiplications. For each dataset, the two approaches are evaluated with the number of available slots in the map able to fit from 75% to 125% of the data.

The index with the model hash function overall has similar performance while utilizing the memory better.

Learned existence indexes

Bloom-filters are used to test whether an element is a member of a set, and are commonly used to determine if for example a key exists on cold storage. Bloom filters guarantee that there are no false negatives (if a Bloom filter says the key is not there, then it isn’t there), but do have potential for false positives. Bloom filters are highly space efficient, but can still occupy quite a bit of memory. For 100M records with a target false positive rate of 0.1%, we need about 14x more bits than records. And once more…

… if there is some structure to determine what is inside versus outside the set, which can be learned, it might be possible to construct more efficient representations.

From an ML perspective, we can think of a Bloom-filter as a classifier, which outputs the probability that a given key is in the database. We can set the probability threshold \tau for considering a key to be in the set such that we can achieve any given target false positive rate. Unlike Bloom filters though, we can’t guarantee the absence of false negatives. The solution to this problem is neat: take the set of false negatives produced by the learned function f , and create a Bloom filter just for this subset of keys. Now on a lookup if f says the key is present with probability greater than \tau we can respond in the affirmative. If the probability is less than this threshold, a second check to the false negatives Bloom filter is added before returning the final decision.

The approach was tested using a dataset of 1.7M unique blacklisted phishing URLs together with a negative set of valid URLs. A character-level RNN (GRU) is trained as the classifier to predict which set a URL belongs to. Compared to a normal Bloom filter, the model + spillover Bloom filter gives a 47% reduction in size for the same false positive rate (1%). At 0.01%, the space reduction is 28%.

The more accurate the model, the better the savings in Bloom filter size. There’s nothing stopping the model using additional features to improve accuracy, such as WHOIS data or IP information.

The last word

…we have demonstrated that machine learned models have the potential to provide significant benefits over state-of-the-art database indexes, and we believe this is a fruitful direction for future research.

I look forward to seeing more work along on this line!

The case for learned index structures – part I

January 8, 2018

The case for learned index structures Kraska et al., arXiv Dec. 2017

Welcome to another year of papers on The Morning Paper. With the rate of progress in our field at the moment, I can’t wait to see what 2018 has in store for us!

Two years ago, I started 2016 with a series of papers from the ‘Techniques everyone should know’ chapter of the newly revised ‘Readings in Database Systems.’ So much can happen in two years! I hope it doesn’t take another ten years for us to reach the sixth edition of the ‘Red Book,’ but if it does, in today’s paper choice Kraska et al., are making a strong case for the inclusion of applied machine learning in a future list of essential techniques for database systems. I can’t think of a better way to start the year than wondering about this blend of old and new — of established database systems research and machine learning models — and what it all might mean for systems software of the future. I’m going to break my summary up into two posts (today and tomorrow) so that we can focus on the big picture today and dive into the details tomorrow.

The thing I like best about this paper is that it opens the door to a different way of thinking about components of systems software. By cleverly framing the problem of indexing, the authors show the potential to use machine learning deep inside a data management system. There are some constraints in applicability of the current implementation (which we’ll get to), but even then results which show up to 70% speed improvements while simultaneously saving an order-of-magnitude in memory are worth paying attention to. Remember, these improvements are obtained against algorithms which have been tuned over decades. Rather than getting too hung on up the particulars of the indexing use case straight away though, I want to dwell on the bigger picture for a few moments:

…we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible.

When does it make sense (in systems software) to replace a hand-engineered data structure and/or algorithm with a machine-learned model? The best analogy I can come up with at present to address this question is personalisation. Early web sites and apps showed the same content to everyone. But today, most sites offer you a highly personalised experience, having found that this works better both for them and (hopefully!) for you. The generic experience might be good, but the personalised experience is better. In this analogy, the equivalent of a user in the world of systems software, is I think, a workload — where a workload is a combination of data & access patterns. And one day we might say something like this: “Early systems software used the same generic configuration, data structures, and algorithms for all workloads. But today, most systems learn from and optimise for the specific workloads they are running.”

An easy to understand example that we looked at a few times last year is configuration: most systems ship with a default configuration that’s ok across a broad set of workloads, but tuning the configuration to suit your specific workload can yield big performance improvements. That’s the pattern I think we’re looking for: workload personalisation makes sense to explore when the performance of the system is sensitive to the characteristics of the particular workload it’s running, and if we had known those characteristics a priori we could have optimised the system for them. Frankly, pretty much all systems software is sensitive to the characteristics of the workload running on it! This puts me very much in mind of the ‘no free lunch theorem’ too: “…if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems.

A couple of other considerations come to mind: the workload personalisation needs to occur within some bounded component of the system (one that can be abstracted as a function), and we need a way of preserving guarantees / honouring constraints. Consider the use of probabilistic data structures today, where we’re already comfortable with the fact that they’re approximations (as a learned function will be). Perhaps the best known is the Bloom filter, which tells us whether or not an element is likely to be in a set. Bloom filters come with an important guarantee – there are no false negatives. Learning a function that has ‘only a small probability’ of a false negative is not the same at all in many use cases. It turns out the authors have a really neat solution to this problem that we’ll look at tomorrow!

Unlocking the full potential of workload personalisation needs an online learning/optimisation approach, since:

the engineering effort to build specialized solutions for every use case is usually too high… machine learning opens up the opportunity to learn a model that reflects the patterns and correlations in the data.

In this paper, the approach is applied to the automatic synthesis of specialized indexed structures, which the authors term learned indexes.

On CPUs, GPUs, and TPUs

The implementations in this paper were done using CPUs, but there’s another reason why neural net style models will become increasingly attractive in the future: they can run very efficiently on GPUs and TPUs. While CPUs are plateauing in performance, NVIDIA predict a 1000x increase in GPU speed by 2025. With closer integration of CPU/GPU/TPU units to reduce the handover costs, the GPU/TPU performance curve looks to be the one to ride to achieve maximum performance into the next decade.

Introducing learned (range) indexes

This paper is primarily concerned with read-only in-memory analytical workloads. In a typical B-Tree index found in an analytics database today, the B-Tree provides a mapping from a lookup key to a position inside a sorted array of records. The index contains entries for every nth key, for example, the first key in a page. The guarantee from such a B-Tree is that the key will be found within ‘page-size’ of the returned location, if it exists. It’s a function from key to location, with error bound guarantees. We can learn a function, but what about the guarantees?

At first sight it may be hard to provide the same error guarantees with other types of ML models, but it is actually surprisingly simple. The B-tree only provides this guarantee over the stored data, not for all possible data. For new data, B-Trees need to be re-balanced, or in machine learning terminology re-trained, to still be able to provide the same error guarantees. This significantly simplifies the problem: the min- and max-error is the maximum error of the model over the training (i.e. the stored) data.

If we execute the model for every key, and remember the worst over- and under-predictions, then we know that a lookup for any key that exists must fall within these bounds. Using a machine learned model becomes possible therefore, and transforms an O(log n) B-Tree look-up into a constant operation.

Traversing a single B-Tree page of size 100 takes roughly 50 cycles (with any cache-misses), and we would need log100N of those traversals for a lookup in a table with N keys. In comparison, you can do 1 million neural net operations in 30 cycles on NVIDIAs latest Tesla V100 GPU. On modern CPUs we have a budget of about log100N 400 arithmetic operations to beat the B-Tree.

A model that predicts the position of a key within a sorted array is effectively approximating the cumulative distribution function (CDF):

…estimating the distribution for a data set is a well-known problem and learned indexes can benefit from decades of research.

A naive first implementation using a two-layer fully-connected neural network with 32 neurons per layer achieved approximately 1250 predictions per second. The B-Tree was still 2-3x faster though! Why?

  1. The Tensorflow invocation overhead (especially with a Python front-end) is too great.
  2. B-Trees ‘overfit’ the data very effectively. In contrast, a learned CDF works well to approximate the general shape, but doesn’t do so well at the individual data instance level (as shown in the callout in the figure above). “Many datasets have exactly this behaviour: from the top the data distribution appears very smooth, whereas as we zoom in more, the harder it is to approximate the CDF because of the randomness on the individual level.” This makes the ‘last mile’ tougher for single neural nets.
  3. Instead of minimising the average error (typical for ML optimisation), we really want to minimise the min- and max-error bounds.
  4. B-Trees are extremely cache efficient, whereas standard neural nets are less so.

If we stopped here of course, it wouldn’t be much of a story. But in tomorrow’s post we’ll look at the learning index framework (LIF) which the authors develop to overcome these problems. In addition to range indexes, we’ll also take a look at point indexes (essentially, learned alternatives to hash maps), and existence indexes (learned alternatives to Bloom filters).