Skip to content

Establishing software root of trust unconditionally

April 3, 2019
tags:

Establishing software root of trust unconditionally Gligor & Woo, NDSS’19

The authors won a best paper award for this work at NDSS this year. The main result is quite something, but as you might expect the lines of argument are detailed and not always easy to follow (and certainly not critically!) for non-experts like me. I’ll use today’s write-up to convey the big ideas as I understand them. Close reading of the original paper (and it’s accompanying technical report) will be essential if you then want to dig deeper.

The problem: establishing a root of trust

Root of trust (RoT) establishment on an untrusted system ensures that a system state comprises all and only content chosen by the user, and the user’s code begins execution in that state. All implies that no content is missing, and only that no extra content exists. If a system state is initialized to content that satisfies security invariants and RoT establishment succeeds, a user’s code begins execution in a secure initial state.

That is to say, even in the presence of persistent malware e.g. in the firmware of peripheral controllers, network interface cards, disk and USB controllers, and so on, root of trust can assure us that the system has booted into a known and trusted state (or alternatively, uncover the presence of malware with high probability).

The word “unconditionally” from the paper title is also highly significant here. In this context it means there are no external dependencies on e.g. secrets, trusted hardware modules, or special instructions (e.g. TPMs, SGX, …), and no polynomial bounds on an adversaries computing power (so for example they still work even in a post-quantum world).

By definition, a solution to a security or cryptography problem is unconditional if it depends only on the existence of physical randomness and the ability to harvest it… We know of no other protocols that establish RoT provably and unconditionally. Nor do we know any other software security problem that has been solved unconditionally in any realistic computational model.

The main result: unconditional RoT establishment

The untrusted system (system to be verified) is connected to a trusted local verifier. The adversary can do pretty much anything: firmware attacks, remote control of malware and secret extraction over a network channel, reading from and writing to the verifier’s I/O channel, and unlimited computing power on tap. The adversary does not however have access to the verifier’s device and external source of true random number.

The big picture comes together like this:

  • The verifier asks the system to initialise its random access memory M and register set R to the chosen content (i.e., known good state of a trustworthy program).
  • Then the verifier selects a random nonce and sends it to the system with a challenge to perform a (very!) carefully chosen computation over M and R and the nonce.
  • The computation is such that it is space-time optimal (i.e., not possible to do it any faster than time t, and not possible to use any less space than m).
  • The computation is non-interruptible, and is timed with very high accuracy. The verifier expects a result after time t, such that there isn’t time to slip in even a single additional instruction.
  • Moreover, the computation is second pre-image free. This means that you can’t easily find a second input (e.g., a somehow corrupted version of the desired system state in this case) that produces the same output as the original input (the system state we expect the system to be in).

Intuitively, second pre-image freedom and m-t optimality can jointly prevent an adversary from using fewer than m words or less time than t, or both, and hence from leaving unaccounted for content (e.g. malware) or executing arbitrary code in the system.

So all we need now is a super-accurate timer, and a rather special computation family…

Building blocks

The lower bounds of a computation are established by a proof that holds for all possible algorithms for performing it. For example, if we’re going to add two numbers we at least need to look at those two numbers… A particular algorithm for that computation is space-time optimal if its bounds match the space and time lower bounds of its computation. Asymptotic lower bounds are no use here (because an adversary might always be able to find slightly better concrete bounds), and nor are purely theoretical lower bounds: we must have a concrete algorithm that exactly matches them. Moreover, the algorithm needs to be executable in a realistic computer model (e.g. , accounting for general ISAs, I/Os, interrupts, multi-processors, and so on).

Word Random Access Machine (WRAM) models are the closest to real computers. The cWRAM model has a constant word length and at most two operands per instruction.

The cWRAM includes all known register-to-register, register-to-memory, and branching instructions of real system ISAs, as well as all integer, logic, and shift/rotate computation instructions… All cWRAM instructions execute in unit time.

Polynomial evaluation based on Horner’s rule can be implemented in cWRAM using a sequence of four instructions per step. A d-degree polynomial can be evaluated in d steps in a Horner-rule program. I.e., it can form the basis of our special computation.

The honest one-time evaluation of P_{d}(x) by a Horner-rule program is uniquely space-time optimal whenever the cWRAM execution time and memory are simultaneously minimized; i.e., no other programs can use fewer than both d+11 storage words and 6d time units.

(Proof in appendix B of the accompanying technical report).

What kind of polynomials can we evaluate that will have the desired second pre-image resistance? Good old hash functions! (As an aside, it’s quite amazing how many incredible uses hash functions have been put to!). We’ll compute the equivalent of k-independent and (almost) universal hash functions over the contents of the memory, the registers, and the nonce. The family H of randomized polynomials defined in section IV.A of the paper have the desired property.

Theorem 3 asserts that no program can use both fewer than k+22 storage words and (6k-4)6d time units in an honest one-time evaluation of Horner(H). And theorem 5 asserts that no adversary can do better with probability at most 3/p (p is the big prime we used to establish the family H).

Putting it all together

A system comprises c connected devices, where device i has random access memory M_i and processor registers R_i. We have to verify all of the devices concurrently, with a transactional verifier protocol such that verification only passes if all checks pass. The difference between the start and end times of any two verification programs must be small enough that neither device can detectably perform any computation for the other. The values of d and k in the Horner(H) problem for a device can be chosen such that the duration requirement is satisfied across devices of varying speed.

To enable accurate time measurement we need to:

  • Attach the verifier device via a DMA interface such that there is no peripheral device controller in the path, and the channel delay and variation is small enough that any untrusted communication can be detected. The verifier could also run as a co-processor connected to the main system bus.
  • Disabling caches, virtual memory, and the TLB (this can be done verifiably as shown in theorem 6 in section IV.E)
  • Eliminating the effects of random clock jitter using random sequential protocols as developed for embedded real-time systems.

Implementation

Performance measurements for Horner-rule evaluation of randomised polynomials show implementation practicality on a commodity hardware platform.

The timing for k=64 and d=10M is 54,578ms. For the baseline d = 128M [covering the entire SDRAM], the running time is close to 700 seconds…

Good things come to those who wait!

The last word

In this paper we showed that, with a proper theory foundation, RoT establishment can be both provable and unconditional. We know of no other software security problem that has had such a solution, to date. Finally, the security of time measurements on untrusted systems has been a long-standing unsolved engineering problem. Here, we also showed that this problem can be readily solved given the provable atomicity of the verifier’s protocol.

The crux of voice (in)security: a brain study of speaker legitimacy detection

April 1, 2019
tags:

The crux of voice (in)security: a brain study of speaker legitimacy detection Neupane et al., NDSS’19

The key results of this paper are easy to understand, but the implications are going to take us a long time to unravel. Speech morphing (voice morphing) is the process of translating a speaker’s voice to sound like a given impersonation target. This capability is now available off-the-shelf —this paper uses the CMU Festvox voice converter— and is getting better all the time. Being able to impersonate someone’s voice takes things like social engineering attacks to a whole new level…

…voice imitation is an emerging class of threats, especially given the advancement in speech synthesis technology seen in a variety of contexts that can harm a victim’s reputation and her security/safety. For instance, the attacker could publish the morphed voice samples on social media, impersonate the victim in phone conversations, leave fake voice messages to the victim’s contacts, and even launch man-in-the-middle attacks against end-to-end encryption technologies that require users to verify the voices of the callers, to name a few instances of such attacks.

So voice should sit alongside images and video as a source we can’t trust in our new post-reality world. (It seems especially powerful when you combine e.g. voice impersonation and video). But that’s something we already knew. It turns out though that voice may be a particularly devastating attack vector because ‘deep down,’ inside our brains, we genuinely can’t tell the difference between a real voice and a morphed voice impersonating it.

Previous studies have investigated brain activity when users are looking at real and fake websites and images, and found that “subconscious neural differences exist when users are subject to real vs. fake artifacts, even though users themselves may not be able to tell the two apart behaviorally.” The work reported in this paper began with the hypothesis that there should be similar neural differences between listening to the original and fake voices of a speaker. Despite trying really hard though, no such difference could be detected!

Our key insight is that there may be no statistically significant differences in the way the human brain processes the legitimate speakers vs synthesized speakers, whereas clear differences are visible when encountering legitimate vs different other human speakers… Overall, our work … reveals user’s susceptibility to voice synthesis attacks at a biological level.

However much we teach people to be on the lookout for voice impersonation, it seems they’re not going to be able to reliably detect it; especially as voice synthesis techniques will continue to improve. So to defend against voice impersonation attacks we’re going to need machine assistance. But if voice synthesis is trained in an adversarial style, will even that be possible?

If there’s a silver lining to be found here then perhaps it’s this: the fact that we can’t tell the difference suggests that current morphing technology may be ready to serve those who have lost their voices.

Analysing neural activity using fNIRS

Neural activity is studied using functional near-infrared spectroscopy (fNIRS). fNIRS is a non-invasive imaging method (using lots of external probes as per the image below) which measures the relative concentration of oxygenated hemoglobin (oxy-Hb) and dexoygenated hemoglobin (deoxy-Hb) in the brain cortex. It provides better temporal resolution than an fMRI —without requiring the subject to be in a supine position in a scanner— and better spatial resolution than an EEG.

Experiment set-up

Voice samples were collected from the Internet for Oprah Winfrey and Morgan Freeman (both of whom have very distinctive and easily recognisable voices).

Then twenty American speakers recruited via Amazon Mechanical Turk were recruited to record the speech of these two celebrities in their own voices, imitating as best they could the original speaking style, pace and emotion. One male and one female speaker from among these twenty was then selected (how?), and their speech was fed through the CMU Festvox voice converter to generated morphed versions of the voices of Morgan Freeman and Oprah Winfrey.

Twenty experiment participants were then recruited: 10 male, and 10 female; all English speaking and in the age range 19-36. Each participant was familiarised with the voice of a victim speaker for 1 minute, and then played 12 randomly selected speech samples: 4 in the original speaker’s voice, 4 in a morphed voice designed to impersonate the original speaker, and 4 in a different speakers voice. The participants were asked to identify the legitimate and fake voices of the victim speakers, but were not explicitly told about voice morphing technology.

The experiment was conducted 4 times for each participant: twice with one of the celebrity voices as the victim, and twice with a voice for which they were ‘briefly familiar’ – i.e., only encountered for the first time during the experiment.

While all this was going on, the fNIRS probe-cap captured brain activity in the following regions of the brain:

Key results

With the original voices, participants were correctly able to identify the speaker as the victim voice 82% of the time. The morphed voices were (incorrectly) identified as the authentic victim’s voice 58% of the time. As a baseline, the different voice impersonating the original speaker was (incorrectly) identified as the authentic victim’s voice 33% of the time.

When comparing the neural activation in the brain between original speaker speech samples and the morphed voice impersonating that speaker, no statistically significant differences were observed. At this point you might be thinking “perhaps the experiment just didn’t monitor the part of the brain where the difference shows” (I certainly was). And that’s still a possibility. However, comparing the neural activation between an original speaker speech sample and a different voice impersonating the speaker did show statistically significant differences. I.e., the probes were at least measuring brain activity in some of the areas that matter!

Differences were also observed in brain activation between familiar voices (the celebrities) and the ‘briefly familiar’ voices (those they were exposed to only as part of the experiment). Famous speakers led to higher activation in the frontopolar area and middle temporal gyrus.

Having failed to detect any statistically significant difference in brain activity for real and morphed voices using traditional statistical techniques, the authors then tried training a machine learning classifier on the brain activation data to see if it could learn to distinguish. The best performing model only achieved 53% accuracy. I.e., there genuinely seems to be nothing in the captured brain activity that can tell the difference between the two cases.

A final thought:

Since this potential indistinguishability of real vs. morphed lies at the core of human biology, we posit that the problem is very severe, as the human detection of synthesized attacks may not improve over time with evolution. Further, in our study, we use an off-the-shelf, academic voice morphing tool based on voice conversion, CMU Festvox, whereas with the advancement in the voice synthesizing technologies (e.g., newer voice modeling techniques such as those offered by Lyrebird and Google WaveNet), it might become even more difficult for users to identify such attacks. Also, our study participants are mostly young individuals and with no reported hearing disabilities, while older population samples and/or those having hearing disabilities may be more prone to voice synthesis attacks.

Bear in mind that participants in the study, while not primed on voice morphing technology, were explicitly asked to try to distinguish between real and impersonated voices. In a real-world attack are unlikely to be so attuned to the possibility of impersonation.

Calvin: fast distributed transactions for partitioned database systems

March 29, 2019

Calvin: fast distributed transactions for partitioned database systems Thomson et al., SIGMOD’12

Earlier this week we looked at Amazon’s Aurora. Today it’s the turn of Calvin, which is notably used by FaunaDB (strictly “_FaunaDB uses patent-pending technology inspired by Calvin…”). As the paper title suggests, the goal of Calvin is to put the ACID back into distributed databases. I’ve put these systems back-to-back because they are both vying for attention in the serverless world, but note that really this is an apples-to-oranges thing as Aurora scales out MySQL and PostgreSQL with single writer multi-reader models, whereas Calvin supports distributed transactions over distributed databases. A closer comparison with Calvin in the AWS world would be DynamoDB, which recently added limited transaction support (single region tables). Calvin is more commonly compared against Google’s Spanner. See Daniel Abadi on this topic (note that Abadi is one of the authors of the Calvin paper, and an advisor to FaunaDB), and also on “It’s time to move on from two-phase commit.”

Calvin [is] a transaction processing and replication layer designed to transform a generic, non-transactional, un-replicated data store into a fully ACID, consistently replicated distributed database system. Calvin supports horizontal scalability of the database and unconstrained ACID-compliant distributed transactions while supporting both asynchronous and Paxos-based synchronous replication, both within a single data center and across geographically separated data centers.

The secret to Calvin’s scalability is that it sets things up in such a way that distributed commit protocols are not needed. By avoiding this bottleneck the (2012) evaluation showed Calvin able to sustain half-a-million TPC-C transactions per second on a (100-node) cluster of commodity EC2 instances, “which is immediately competitive with the world-record results currently published on the TPC-C website that were obtained with much higher-end hardware.”

The big idea: minimise the contention window

The thing that gets you into scalability problems with distributed transactions is holding locks while coordinating with remote nodes to check everyone is in agreement. This makes a large contention window during which other transactions can get backed up waiting for locks on the same items (or in optimistic schemes, a lot of aborts).

In Calvin, machines agree on the plan for handling a particular transaction outside of any transaction boundaries (i.e., before any locks are acquired). With the plan agreed a transaction can begin and each node must follow the plan exactly. At this point, not even node failure can cause the transaction to abort:

If a node fails, it can recover from a replica that had been executing the same update plan in parallel, or alternatively, it can replay the history of planned activity for the node.

There’s one important requirement for this to work though: activity plans must be deterministic so that replicas cannot diverge and history will always be recreated in exactly the same way. Calvin achieves the desired deterministic outcome by (a) using a deterministic locking protocol for transaction execution, and (b) reaching a global agreement about which transactions to attempt and in what order.

It’s interesting to observe here that whereas many of the works we’ve been looking at recently deal with the problem of ordering through commutativity, Calvin instead deals with the problem of ordering by enforcing one and only one order. That has some interesting ‘tail’ implications that we’ll come back to later.

Another interesting point of comparison, this time with Aurora, is that whereas Aurora keeps only the redo commit logs and pushes those out to the storage layer, Calvin takes the opposite approach and doesn’t use redo logging at all (instead it replicates transaction inputs in conjunction with its determinism guarantee).

System architecture

The essence of Calvin lies in separating the system into three separate layers of processing…

The sequencing layer takes transactional inputs and reaches distributed agreement on a global ordering of transactions. (I.e., we’re still solving a form of distributed consensus right here).

The scheduling layer then uses a deterministic locking scheme to orchestrate transaction execution guaranteeing equivalence to the serial order determined by the sequencing layer.

The storage layer handles all physical data layout. Any storage engine featuring a basic CRUD interface can be plugged into Calvin.

By separating the replication mechanism, transactional functionality and concurrency control (in the sequencing and scheduling layers) from the storage system, the design of Calvin deviates significantly from traditional database design which is highly monolithic, with physical access methods, buffer manager, lock manager, and log manager highly integrated and cross-reliant.

Note though that in constrast to Aurora, the storage layer component is co-located and resident with the the scheduler and sequencer components (each grey box is the figure above is a server). Each node is responsible for a partition, and a replica is a set of nodes covering all partitions.

The sequencing layer divides time into 10-millisecond epochs, with an epoch number synchronously incremented across the entire system each time (how? what if a node is unresponsive? what if a node fails…?). Each sequencer collects transaction requests from clients during the 10ms window and compiles them into a batch, then the batches are replicated across the sequencers (hence a 10ms lower bound on transaction latency). Every scheduler pieces together its own view of the global transaction order by interleaving all sequencer’s batches in a deterministic round-robin order. Calvin supports either asynchronous replication or Paxos-based synchronous replication.

In async mode, one of the replicas is designated as a master and all transaction requests are forwarded to its sequencers. These assemble all received requests into a master list and then forward the list to the other replicas in the group. The challenge with this mode is that failover becomes complex. Hence the Paxos mode in which Paxos is used to agree on a combined batch of transaction requests for an epoch (the Calvin implementation uses ZooKeeper).

With Paxos, transaction latency is up in the 100ms+ range (could we use CURP here?):

With the storage layer separated from the scheduler, …

… both the logging and concurrency protocols have to be completely logical, referring only to record keys rather than physical data structures.

Calvin logs the inputs at the sequencing layer, with occasional checkpoints taken by the storage layer. This is possible because of the deterministic execution. Recall that requires not just global agreement on transaction ordering, but also deterministic lock acquisition during transaction execution. Calvin uses strict two-phase locking with two extra rules:

  1. If any pair of transactions A and B both requests exclusive locks on a local record R, then whichever transaction comes first in the global order must acquire the lock on R first.
  2. The lock manager must grant each lock to requesting transactions strictly in the order in which those transactions requested the lock.

One thing that leaps out here is that this requires us to know the read/write sets of transactions ahead of time. And indeed,…

Transactions which must perform reads in order to determine their full read/write sets (which we term dependent transactions) are not natively supported in Calvin since Calvin’s deterministic locking protocol requires advance knowledge of all transactions’ read/write sets before transaction execution can begin.

For such transactions you have to modify the application code a little to introduce a pre-flight “optimistic lock location prediction” (OLLP) step. This is reconnaissance query to figure out what the read/write set is likely to be. If the actual read/write set differs during execution we restart.

Dealing with disk latency

Prior work from the same team assumed an in-memory database. Whereas a commutative or non-deterministic system has freedom to execute transactions in any serial order, Calvin needs to execute transactions in the pre-determined order. So if a transaction A is stalled waiting for disk access we have to wait (whereas a non-deterministic system would be able to go ahead and run other non-conflicting transactions). To mitigate this impact, Calvin tries to ensure that transactions aren’t holding locks and waiting on disk I/O.

Any time a sequencer component receives a request for a transaction that may incur a disk stall, it introduces an artificial delay before forwarding the transaction request to the scheduling layer and meanwhile sends requests to all relevant storage components to “warm up” the disk-resident records that the transaction will access. If the artificial delay is greater than or equal to the time it takes to bring all the disk-resident records into memory, then when the transaction is actually executed, it will access only memory-resident data.

So long as the transaction really would have had to access data from disk anyway, then this additional up-front delay adds nothing to the end-to-end latency.

Now we have two new problems though: (a) how long should the delay before forward be? (i.e., how do we predict disk I/O latency), and (b) we need to track which keys are actually in memory across all storage nodes so that we know when we need to do prefetching.

Predicting disk I/O latency is both difficult (especially if you have heterogenous nodes!) and also “a particularly interesting and crucial parameter when tuning Calvin to perform well on dis—resident data under high contention.” In the evaluation, Calvin was tuned so that at least 99% of disk-accessing transactions were scheduled after their corresponding pre-fetching requests had completed.

The paper offers no scalable solution to tracking the data in memory, as of 2012 that remained ‘future work.’

Checkpointing

Calvin has three checkpointing modes, discussed §5 of the paper. I don’t have the space to discuss them all here, so I refer you to the paper for details. Of note though, is that recovery starts from a checkpoint and then exploits the deterministic nature of the database to recover by replaying transaction inputs (i.e., re-running (side-effect free we hope!) transactions).

Throughput

Here’s Calvin scaling to 500,000 tps on NewOrder (TPC-C):

Note that the performance per-machine shows a gradual decline as more nodes are added at higher contention.

Suppose, under a high contention workload, that machine A starts executing a distributed transaction that requires a remote read from machine B, but B hasn’t gotten to that transaction yet (B may still be working on earlier transactions in the sequence, and it can not start working on the transaction until locks have been acquired for all previous transactions in the sequence). Machine A may be able to begin executing some other non-conflicting transactions, but soon it will simply have to wait for B to catch up before it can commit the pending distributed transaction and execute subsequent conflicting transactions. By this mechanism, there is a limit to how far ahead of or behind the pack any particular machine can get. The higher the contention, the tighter this limit.

This makes Calvin sensitive to slow machines (e.g. that unlucky EC2 instance), and execution process skew. With higher contention rates it becomes more likely that each machine’s random slowdowns will cause other machines to slow their execution as well.

Even so, compared to two-phase commit Calvin performs much better under high contention.

Update: Matt Freels, CTO of FaunaDB, got in touch to point me at two great blog posts on the fauna.com describing how FaunaDB uses Calvin:

Amazon Aurora: on avoiding distributed consensus for I/Os, commits, and membership changes

March 27, 2019

Amazon Aurora: on avoiding distributed consensus for I/Os, commits, and membership changes, Verbitski et al., SIGMOD’18

This is a follow-up to the paper we looked at earlier this week on the design of Amazon Aurora. I’m going to assume a level of background knowledge from that work and skip over the parts of this paper that recap those key points. What is new and interesting here are the details of how quorum membership changes are dealt with, the notion of heterogeneous quorum set members, and more detail on the use of consistency points and the redo log.

Changing quorum membership

Managing quorum failures is complex. Traditional mechanisms cause I/O stalls while membership is being changed.

As you may recall though, Aurora is designed for a world with a constant background level of failure. So once a quorum member is suspected faulty we don’t want to have to wait to see if it comes back, but nor do we want throw away the benefits of all the state already present on a node that might in fact come back quite quickly. Aurora’s membership change protocol is designed to support continued processing during the change, to tolerate additional failures while changing membership, and to allow member re-introduction if a suspected faulty member recovers.

Each membership change is made via at least two transitions. Say we start out with a protection group with six segment members, A-F. A write quorum is any 4 of 6, and a read quorum is any 3 of 6. We miss some heartbeats for F and suspect it of being faulty.

Move one is to increment the membership epoch and introduce a new node G. All read and write requests, and any gossip messages, all carry the epoch number. Any request with a stale epoch number will be rejected. Making the epoch change requires a write quorum, just like any other write. The new membership epoch established through this process now requires a write set to be any four of ABCDEF and any four of ABCDEG. Notice that whether we ultimately choose to reinstate F, or we stick with G, we have valid quorums at all points under both paths. For a read set we need any 3 of ABCDEF or any 3 of ABCDEG.

You can probably see where this is going. If F comes back before G has finished completing hydrating from its peers, then we make a second membership transition back to the ABCDEF formation. If it doesn’t, we can make a transition to the ABCDEG formation.

Additional failures are handled in a similar manner. Suppose we’re in the transition state with a write quorum of (4/6 of ABCDEF) AND (4/6 of ABCDEG) and wouldn’t you know it, now there’s a problem with E! Meet H. We can transition to a write quorum that is (4/6 of ABCDEF and 4/6 ABCDEG) AND (4/6 of ABCDHF and 4/6 of ABCDHG). Note that even here, simply writing to ABCD fulfils all four conditions.

Heterogeneous quorums

Quorums are generally thought of as a collection of like members, grouped together to transparently handle failures. However, there is nothing in the quorum model to prevent unlike members with differing latency, cost, or durability characteristics.

Aurora exploits this to set up protection groups with three full segments, which store both redo log records and materialized data blocks, and three tail segments which store just the redo logs. Since data blocks typically take more space than redo logs the costs stay closer to traditional 3x replication than 6x.

With the split into full and tail segments a write quorum becomes any 4/6 segments, or all 3 full segments. (“In practice, we write log records to the same 4/6 quorum as we did previously. At least one of these log records arrives at a full segment and generates a data block“). A read quorum becomes 3/6 segments, to include at least one full segment. In practice though, data is read directly from a full segment avoiding the need for a quorum read – an optimisation we’ll look at next.

There are many options available once one moves to quorum sets of unlike members. One can combine local disks to reduce latency and remote disks for durability and availability. One can combine SSDs for performance and HDDs for cost. One can span quorums across regions to improve disaster recovery. There are numerous moving parts that one needs to get right, but the payoffs can be significant. For Aurora, the quorum set model described earlier lets us achieve storage prices comparable to low-cost alternatives, while providing high durability, availability, and performance.

Efficient reads

So we spent the previous paper and much of this one nodding along with read quorums that must overlap with write quorums, understanding the 4/6 and 3/6 requirements and so on, only to uncover the bombshell that Aurora doesn’t actually use read quorums in practice at all! What!? (Ok, it does use read quorums, but only during recovery).

The thing is there can be a lot of reads, and there’s an I/O amplification effect as a function of the size of the read quorum. Whereas with write amplification we’re sending compact redo log records, with reading we’re looking at full data blocks too. So Aurora avoids quorum reads.

Aurora does not do quorum reads. Through its bookkeeping of writes and consistency points, the database instance knows which segments have the last durable version of a data block and can request it directly from any of those segments… The database will usually issue a request to the segment with the lowest measured latency, but occasionally also query one of the others in parallel to ensure up to data read latency response times.

If a read is taking a long time, Aurora will issue a read to another storage node and go with whichever node returns first.

The bookkeeping that supports this is based on read views that maintain snapshot isolation using Multi-Version Concurrency Control (MVCC). When a transaction commits, the log sequence number (LSN) of its commit redo record is called the System Commit Number or SCN. When a read view is established we remember the SCN of the most recent commit, and the list of transactions active as of that LSN.

Data blocks seen by a read request must be at or after the read view LSN and back out any transactions either active as of that LSN or started after that LSN… Snapshot isolation is straightforward in a single-node database instance by having a transaction read the last durable version of a database block and apply undo to rollback any changes.

Storage consistency points

Aurora is able to avoid much of the work of consensus by recognizing that, during normal forward processing of a system, there are local oases of consistency. Using backward chaining of redo records, a storage node can tell if it is missing data and gossip with its peers to fill in gaps. Using the advancement of segment chains, a databased instance can determine whether it can advance durable points and reply to clients requesting commits. Coordination and consensus is rarely required….

Recall that the only writes which cross the network from the database instance to the storage node are log redo records. Redo log application code is run within the storage nodes to materialize blocks in the background or on-demand to satisfy read requests.

Log records form logical chains. Each log record stores the LSN of the previous log record in the volume, the previous LSN for the segment, and the previous LSN for the block being modified.

  • The block chain is used to materialise individual blocks on demand
  • The segment chain is used by each storage node to identify records it has not received and fill those holes via gossip
  • The full log chain provides a fallback path to regenerate storage volume metadata in case of a “disastrous loss of metadata state.”

…all log writes, including those for commit redo log records, are sent asynchronously to storage nodes, processed asynchronously at the storage node, and asynchronously acknowledged back to the database instance.

Individual nodes advance a Segment Complete LSN (SCL), representing the latest point in time for which it has received all log records. The SCL is sent as part of a write acknowledgement. Once the database instance has seen the SCL advance at 4/6 members of a protection group, it advances the Protection Group Complete LSN (PGCL) – the point at which the protection group has made all writes durable. In the following figure for example, the PGCL for PG1 is 103 (because 105 has not met quorum), and the PGCL for PG2 is 104.

The databases advances a Volume Complete LSN (VCL) once there are no pending writes preventing PGCL advancing for one of its protection groups. No consensus is required to advance SCL, PGCL, or VCL, this can all be done via local bookkeeping.

This is possible because storage nodes do not have a vote in determining whether to accept a write, they must do so. Locking, transaction management, deadlocks, constraints, and other conditions that influence whether an operation may proceed are all resolved at the database tier.

Commits are acknowledged by the database once the VCL has advanced beyond the System Commit Number of the transaction’s commit redo log record.

Amazon Aurora: design considerations for high throughput cloud-native relational databases

March 25, 2019

Amazon Aurora: design considerations for high throughput cloud-native relational databases Verbitski et al., SIGMOD’17

Werner Vogels recently published a blog post describing Amazon Aurora as their fastest growing service ever. That post provides a high level overview of Aurora and then links to two SIGMOD papers for further details. Also of note is the recent announcement of Aurora serverless. So the plan for this week on The Morning Paper is to cover both of these Aurora papers and then look at Calvin, which underpins FaunaDB.

Say you’re AWS, and the task in hand is to take an existing relational database (MySQL) and retrofit it to work well in a cloud-native environment. Where do you start? What are the key design considerations and how can you accommodate them? These are the questions our first paper digs into. (Note that Aurora supports PostgreSQL as well these days).

Here’s the starting point:

In modern distributed cloud services, resilience and scalability are increasingly achieved by decoupling compute from storage and by replicating storage across multiple nodes. Doing so lets us handle operations such as replacing misbehaving or unreachable hosts, adding replicas, failing over from a writer to a replica, scaling the size of a database instance up or down, etc.

So we’re somehow going to take the backend of MySQL (InnoDB) and introduce a variant that sits on top of a distributed storage subsystem. Once we’ve done that, network I/O becomes the bottleneck, so we also need to rethink how chatty network communications are.

Then there are a few additional requirements for cloud databases:

  • SaaS vendors using cloud databases may have numerous customers of their own. Many of these vendors use a schema/database as the unit of tenancy (vs a single schema with tenancy defined on a per-row basis). “As a result, we see many customers with consolidated databases containing a large number of tables. Production instances of over 150,000 tables for small databases are quite common. This puts pressure on components that manage metadata like the dictionary cache.”
  • Customer traffic spikes can cause sudden demand, so the database must be able to handle many concurrent connections. “We have several customers that run at over 8000 connections per second.”
  • Frequent schema migrations for applications need to be supported (e.g. Rails DB migrations), so Aurora has an efficient online DDL implementation.
  • Updates to the database need to be made with zero downtime

The big picture for Aurora looks like this:

The database engine as a fork of “community” MySQL/InnoDB and diverges primarily in how InnoDB reads and writes data to disk.

There’s a new storage substrate (we’ll look at that next), which you can see in the bottom of the figure, isolated in its own storage VPC network. This is deployed on a cluster of EC2 VMs provisioned across at least 3 AZs in each region. The storage control plane uses Amazon DynamoDB for persistent storage of cluster and storage volume configuration, volume metadata, and S3 backup metadata. S3 itslef is used to store backups.

Amazon RDS is used for the control plane, including the RDS Host Manager (HM) for monitoring cluster health and determining when failover is required.

It’s nice to see Aurora built on many of the same foundational components that are available to us as end users of AWS too.

Durability at scale

The new durable, scalable storage layer is at the heart of Aurora.

If a database system does nothing else, it must satisfy the contract that data, once written, can be read. Not all systems do.

Storage nodes and disks can fail, and at large scale there’s a continuous low level background noise of node, disk, and network path failures. Quorum-based voting protocols can help with fault tolerance. With V copies of a replicated data item, a read must obtain V_r votes, and a write must obtain V_w votes. Each write must be aware of the most recent write, which can be achieved by configuring V_w > V/2. Reads must also be aware of the most recent write, which can be achieved by ensuring V_r + V_w > V. A common approach is to set V = 3 and V_r = V_w = 2.

We believe 2/3 quorums are inadequate [even when the three replicas are each in a different AZ]… in a large storage fleet, the background noise of failures implies that, at any given moment in time, some subset of disks or nodes may have failed and are being repaired. These failures may be spread independently across nodes in each of AZ A, B, and C. However, the failure of AZ C, due to a fire, roof failure, flood, etc., will break quorum for any of the replicas that concurrently have failures in AZ A or AZ B.

Aurora is designed to tolerate the loss of an entire AZ plus one additional node without losing data, and an entire AZ without losing the ability to write data. To achieve this data is replicated six ways across 3 AZs, with 2 copies in each AZ. Thus V = 6; V_w is set to 4, and V_r is set to 3.

Given this foundation, we want to ensure that the probability of double faults is low. Past a certain point, reducing MTTF is hard. But if we can reduce MTTR then we can narrow the ‘unlucky’ window in which an additional fault will trigger a double fault scenario. To reduce MTTR, the database volume is partitioned into small (10GB) fixed size segments. Each segment is replicated 6-ways, and the replica set is called a Protection Group (PG).

A storage volume is a concatenated set of PGs, physically implemented using a large fleet of storage nodes that are provisioned as virtual hosts with attached SSDs using Amazon EC2… Segments are now our unit of independent background noise failure and repair.

Since a 10GB segment can be repaired in 10 seconds on a 10Gbps network link, it takes two such failures in the same 10 second window, plus a failure of an entire AZ not containing either of those two independent failures to lose a quorum. “At our observed failure rates, that’s sufficiently unlikely…

This ability to tolerate failures leads to operational simplicity:

  • hotspot management can be addressed by marking one or more segments on a hot disk or node as bad, and the quorum will quickly be repaired by migrating it to some other (colder) node
  • OS and security patching can be handled like a brief unavailability event
  • Software upgrades to the storage fleet can be managed in a rolling fashion in the same way.

Combating write amplification

A six-way replicating storage subsystem is great for reliability, availability, and durability, but not so great for performance with MySQL as-is:

Unfortunately, this model results in untenable performance for a traditional database like MySQL that generates many different actual I/Os for each application write. The high I/O volume is amplified by replication.

With regular MySQL, there are lots of writes going on as shown in the figure below (see §3.1 in the paper for a description of all the individual parts).

Aurora takes a different approach:

In Aurora, the only writes that cross the network are redo log records. No pages are ever written from the database tier, not for background writes, not for checkpointing, and not for cache eviction. Instead, the log applicator is pushed to the storage tier where it can be used to generate database pages in background or on demand.

Using this approach, a benchmark with a 100GB data set showed that Aurora could complete 35x more transactions than a mirrored vanilla MySQL in a 30 minute test.

Using redo logs as the unit of replication means that crash recovery comes almost for free!

In Aurora, durable redo record application happens at the storage tier, continuously, asynchronously, and distributed across the fleet. Any read request for a data page may require some redo records to be applied if the page is not current. As a result, the process of crash recovery is spread across all normal foreground processing. Nothing is required at database startup.

Furthermore, whereas in a regular database more foreground requests also mean more background writes of pages and checkpointing, Aurora can reduce these activities under burst conditions. If a backlog does build up at the storage system then foreground activity can be throttled to prevent a long queue forming.

The complete IO picture looks like this:

Only steps 1 and 2 above are in the foreground path.

The distributed log

Each log record has an associated Log Sequence Number (LSN) – a monotonically increasing value generated by the database. Storage nodes gossip with other members of their protection group to fill holes in their logs. The storage service maintains a watermark called the VCL (Volume Complete LSN), which is the highest LSN for which it can guarantee availablity of all prior records. The database can also define consistency points through consistency point LSNs (CPLs). A consistency point is always less than the VCL, and defines a durable consistency checkpoint. The most recent consistency point is called the VDL (Volume Durable LSN). This is what we’ll roll back to on recovery.

The database and storage subsystem interact as follows:

  • Each database-level transaction is broken up into multiple mini-transactions (MTRs) that are ordered and must be performed atomically
  • Each mini-transaction is composed of multiple contiguous log records
  • The final log record in a mini-transaction is a CPL

When writing, there is a constraint that no LSN be issued which is more than a configurable limit— the LSN allocation limit— ahead of the current VDL. The limit is currently set to 10 million. It creates a natural form of back-pressure to throttle incoming writes if the storage or network cannot keep up.

Reads are served from pages in a buffer cache and only result in storage I/O requests on a cache miss. The database establishes a read point: the VDL at the time the request was issued. Any storage node that is complete with respect to the read point can be used to serve the request. Pages are reconstructed using the same log application code.

A single writer and up to 15 read replicas can all mount a single shared storage volume. As a result, read replicas add no additional costs in terms of consumed storage or disk write operations.

Aurora in action

The evaluation in section 6 of the paper demonstrates the following:

  • Aurora can scale linearly with instance sizes, and on the highest instance size can sustain 5x the writes per second of vanilla MySQL.
  • Throughput in Aurora significantly exceeds MySQL, even with larger data sizes and out-of-cache working sets:

  • Throughput in Aurora scales with the number of client connections:

  • The lag in an Aurora read replica is significantly lower than that of a MySQL replica, even with more intense workloads:

  • Aurora outperforms MySQL on workloads with hot row contention:

Customers migrating to Aurora see lower latency and practical elimination of replica lag (e.g, from 12 minutes to 20ms).

Slim: OS kernel support for a low-overhead container overlay network

March 22, 2019

Slim: OS kernel support for a low-overhead container overlay network Zhuo et al., NSDI’19

Container overlay networks rely on packet transformations, with each packet traversing the networking stack twice on its way from the sending container to the receiving container.

There are CPU, throughput, and latency overheads associated with those traversals.

In this paper, we ask whether we can design and implement a container overlay network, where packets go through the OS kernel’s network stack only once. This requires us to remove packet transformation from the overlay network’s data-plane. Instead, we implement network virtualization by manipulating connection-level metadata at connection setup time, saving CPU cycles and reducing packet latency.

Slim comes with some caveats: it requires a kernel module for secure deployment, has longer connection establishment times, doesn’t fit with packet-based network policies, and only handles TCP traffic. For UDP, ICMP, and for its own service discovery, it also relies on an existing container overlay network (Weave Net). But for longer lasting connections managed using connection-based network policies it delivers some impressive results:

  • memcached throughput up by 71%, with latency reduced by 42%, and CPU utilisation reduced by 56%.
  • Nginx CPU utilisation reduced by 22-24%
  • PostgreSQL CPU utilisation reduced by 22%
  • Apache Kafka CPU utilisation reduced by 10%

Since Slim both builds on and compares to Weave Net, I should say at this point that Weave Net was the very first open source project from Weaveworks, the “GitOps for Kubernetes” company. Accel is an investor in Weaveworks, and I am also a personal investor. If you’re using Kubernetes, you should definitely check them out. Anyway, on with the show…

Container networking

In theory there are four possible modes for container networking: a bridge mode for containers on the same host; host mode in which containers use the IP address of their host network interface; macvlan mode (or similar hardware mechanisms) to give each container its own IP address; and overlay mode in which each container is given its own own virtual network interface and each application has its own network namespace.

In practice, there are management and deployment challenges with the bridge, host, and macvlan approaches, so overlay networks such as Weave Net are the preferred solution.

Overlay packets are encapsulated with host network headers when routed on the host network. This lets the container overlay network have its own IP address space and network configuration that is disjoint from that of the host network; each can be managed completely independently. Many container overlay network solutions are available today— such as Weave, Flannel, and Docker Overlay— all of which share similar internal architectures.

Overlay network overheads

The overheads in overlay networking come from the per-packet processing inside the OS kernel: delivering a packet on the overlay network requires one extra traversal of the network stack and also packet encapsulation and decapsulation.

Here are some test measurements comparing Weave Net in fast dataplane mode to host networking to give an example:

In this test, compared to direct host networking, for two containers on the same host (Intra) the overlay network reduces throughput by 23% and increases latency by 34%. For containers communicating across hosts (Inter), throughput reduces by 48% and latency increases by 85%. The overheads are lower when communicating on the same host since packet encapsulation is not required.

Compared to host networking, the CPU utilisation also increases by 93%.

There are several known techniques to reduce the data plane overhead. Packet steering creates multiple queues, each per CPU core, for a network interface and uses consistent hashing to map packets to different queues. In this way, packets in the same network connection are processed only on a single core. Different cores therefore do not have access to the same queue, removing the overhead due to multi-core synchronization.

The authors integrated the above Receive Packet Steering (RPS), and also an enhancement called Receive Flow Steering (RFS— which further ensures that interrupt processing occurs on the same core as the application— into Weave Net. With this enhancement, throughput is within 9% of that achieved with host networking, but it makes almost no difference to latency.

Introducing Slim

The big idea in Slim is to reduce CPU utilisation and latency overheads by having packets traverse the network stack only once. That means you can’t do per-packet processing. Instead, Slim works at the connection level.

Slim virtualizes the network by manipulating connection-level metadata. SlimSocket exposes the POSIX socket interface to application binaries to intercept invocations of socket-related system calls. When SlimSocket detects an application is trying to set up a connection, it sends a request to SlimRouter. After SlimRouter sets up the network connection, it passes access to the connection as a file descriptor to the process inside the container. The application inside the container then uses the host namespace file descriptor to send/receive packets directly to/from the host network. Because SlimSocket has the exact same interface as the POSIX socket, and Slim dynamically links SlimSocket into the application, the application binary need not be modified.

Given that Slim is out of the picture once the connection is established, a separate mechanism is needed to support control plane and data plane policies. SlimRouter stores control plane policies and enforces them at connection setup time. If the policy changes, SlimRouter scans existing connections and removes the file descriptors for any connection violating the new policy. This requires the support of a kernel module, SlimKernModule. To avoid containers learning the IP addresses of host machines, SlimKernModule (in secure mode) also prohibits unsafe system calls on file descriptors (e.g. getpeername). Existing kernel mechanisms are used to enforce data plane policies.

This is what it looks like when Slim is used with blocking I/O:

Calls to the POSIX socket interface are intercepted by the SlimSocket shim and forward to the SlimRouter. For non-blocking I/O (e.g., select, epoll) Slim also intercepts these API calls and maintains mappings for epoll file descriptor sets. The SlimRouter needs to know the IP address and port mappings in order to establish connections. It does this using a Weave Net overlay network!

When the client calls connect, it actually creates an overlay network connection on the standard container overlay network. When the server receives an incoming connection on the standard overlay network, SlimSocket queries SlimRouter for the physical IP address and port and sends them to the client side inside the overlay connection. In secure mode (§4.3), the result queried from SlimRouter is encrypted. SlimSocket on the client side sends the physical IP address and port (encrypted if in secure mode) to its SlimRouter and the SlimRouter establishes the host connection. This means connection setup time is longer in Slim than that on container overlay networks based on packet transformation.

Weave Net is also used for packets that require data plane handling such as ICMP and UDP.

Evaluation

Microbenchmarks compare Slim to Weave Net with RFS. Creating on a TCP connection with Weave Net takes 270 µs. With Slim it takes 556µs (440µs in insecure mode). For applications with persistent connections, this additional overhead will not be significant. Compared to Weave Net, Slim reduces CPU overhead by 22-41%.

Slim and Weave Net are then further compared on four application workloads based on Memcached, Nginx, PostgreSQL, and Apache Kafka respectively.

For Memcached, Slim improves throughput by 71% and reduces latency by 42%, with 25% lower CPU utilisation.

For Nginx, PostgreSQL the main advantage of Slim is reduced CPU utilisation (around 22% reduction). For Kafka the CPU utilisation reduction is around 10%, but latency is also reduced by 28%.

Slim’s source code is available at https://github.com/danyangz/slim.

Understanding lifecycle management complexity of datacenter topologies

March 20, 2019

Understanding lifecycle management complexity of datacenter topologies Zhang et al., NSDI’19

There has been plenty of interesting research on network topologies for datacenters, with Clos-like tree topologies and Expander based graph topologies both shown to scale using widely deployed hardware. This research tends to focus on performance properties such as throughput and latency, together with resilience to failures. Important as these are, note that they’re also what’s right in front of you as a designer, and relatively easy to measure. The great thing about today’s paper is that the authors look beneath the surface to consider the less visible but still very important “lifecycle management” implications of topology design. In networking, this translates into how easy it is to physically deploy the network, and how easy it to subsequently expand. They find a way to quantify the associated lifecycle management costs, and then use this to help drive the design of a new class of topologies, called FatClique.

… we show that existing topology classes have low lifecycle management complexity by some measures, but not by others. Motivated by this, we design a new class of topologies, FatClique, that, while being performance-equivalent to existing topologies, is comparable to, or better than them by all our lifecycle management complexity metrics.

Now, there’s probably only a relatively small subset of The Morning Paper readers involved in designing and deploying datacenter network topologies. So my challenge to you as you read through this paper, is to think about where the hidden complexity and costs are in your own systems. Would you do things differently if these were made more visible? It would be great to see more emphasis for example on things like developer experience (DX) and operational simplicity – in my experience these kinds of attributes can have an outsize impact on the long-term success of a system.

Anyway, let’s get back to cables and switches…

Physically deploying network topologies

When it comes to laying out a network topology for real in a datacenter, you need to think about packaging, placement, and bundling. Packaging is how you group things together, e.g. the arrangement of switches in racks, and placement concerns how these racks are physically placed on the datacenter floor. Placement in turn determines the kinds of cabling you need, and for optical cables the power of the transceivers. Within a rack we might package several connected switches into a single chassis using a backplane. At the other end of the scale, blocks are larger units of co-placement and packaging that combine several racks. With all those connections, it makes things a lot easier to group together multiple fibres all connecting the same two endpoints (racks) into bundles, which contain a fixed number of identical length fibres.

Manufacturing bundles is simpler than manufacturing individual fibres, and handling such bundles significantly simplifies operational complexity.

Patch panels make bundling easier by providing a convenient aggregation point to create and route bundles. Bundles and fibres are physically routed through the datacenter on cable trays. The trays themselves have capacity constraints of course.

Here’s an example of a logical Clos topology and its physical instantiation:

The authors identify three key metrics that together capture much of the deployment complexity in a topology:

  1. The number of switches. More switches equals more packaging complexity.
  2. The number of patch panels, which is a function of topological structure and a good proxy for wiring complexity.
  3. The number of bundle types. This metric captures the other important part of wiring complexity – how many distinct bundle types are needed. A bundle type is represented by its capacity (how how many fibres) and its length.

These complexity measures are complete. The number of cable trays, the design of the chassis, and the number of racks can be derived from the number of switches (and the number of servers and the datacenter floor dimensions, which are inputs to the topology design). The number of cables and transceivers can be derived from the number of patch panels.

Here’s how Clos and Expander (Jellyfish) representative topologies for the same number of servers stack up against these metrics:

The expander graph topology shows much higher deployment complexity in terms of the number of bundle types. Clos also exposes far fewer ports outside of a rack (it has better port hiding).

Expanding existing networks

When you want to expand an existing network first you need to buy all the new gear and lay it out on the datacenter floor, and then you can begin a re-wiring process. This is all going on with live traffic flowing, so expansion is carried out in steps. During each step the capacity of the topology is guaranteed to be at least some percentage of the existing topology capacity. The percentage is sometimes known as the expansion SLO.

During a step existing links to be re-wired are drained, then human operators physical rewire links at patch panels. The new links are tested and then undrained (strange word!), i.e., brought into service.

For example, here’s a logical expansion (top row) and its physical realisation:

The most time-consuming part of all this is the physical rewiring. The two metrics that capture expansion complexity are therefore:

  1. The number of expansion steps, and
  2. The average number of rewired links in a patch panel rack.

Here’s how Clos and Expander stack up on those metrics for the same networks we saw earlier:

This time the victory goes to Expander (Jellyfish). Jellyfish has a much higher north-to-south capacity ratio. Northbound links exit a block, and southbound links are to/from servers within a block. “Fat edges” have more northbound than southbound links, and the extra capacity means you can accomplish more movement in each step.

Clos topologies re-wire more links in each patch panel during an expansion step and require many steps because they have a low north-south capacity ratio.

Enter the FatClique

Inspired by these insights, the authors define a new class of topologies called FatClique, which combine the hierarchical structure of Clos with the edge expansion capabilities of expander graphs.

There are three levels in the hierarchy. A clique of switches form a sub-block. Cliques of sub-blocks come together to form blocks. And cliques of blocks come together to from the full FatClique topology.

Four key design variables determine the particular instantiation of a FatClique topology: the number of ports in a switch that connect to other servers; the number of ports in a switch that connect to other sub-blocks in a block; the number of switches in a sub-block; and the number of sub-blocks in a block. A synthesis algorithm
takes a set of six input constraints (see §5.1) and determines the values for these four design variables.

There is plenty more detail in section 5 of the paper which I don’t have the space to do justice too here.

FatClique vs Clos vs Expander

The evaluation compares FatClique to Clos, Xpander, and Jellyfish at different network topology sizes, as shown in the table below.


(Enlarge)

Here’s how they stack up against the complexity metrics:

Number of switches

Number of patch panels

Number of bundle types

and associated cabling costs:

Number of expansion steps

Average number of rewired links

We find that FatClique is the best at most scales by all our complexity metrics. (The one exception is that at small and medium scales, Clos has slightly fewer patch panels). It uses 50% fewer switches and 33% fewer patch panels than Clos at large scale, and has a 23% lower cabling cost (an estimate we were able to derive from published cable prices). Finally, FatClique can permit fast expansion while degrading network capacity by small amounts (2.5-10%): at these levels, Clos can take 5x longer to expand the topology, and each step of Clos expansion can take longer than FatClique because the number of links to be rewired at each step per patch panel can be 30-50% higher.

The one thing I couldn’t find in the evaluation is any data to back up the opening claim that FatClique achieves all of this “while being performance-equivalent to existing topologies.”

The last word

As the management complexity of networks increases, the importance of designing for manageability will increase in the coming years. Our paper is only a first step in this direction…