Skip to content

Veritas: shared verifiable databases and tables in the cloud

January 30, 2019

Veritas: shared verifiable databases and tables in the cloud Allen et al., CIDR’19

Two (or more) parties want to transact based on the sharing of information (e.g. current offers). In order to have trust in the system and provide a foundation for resolving disputes, we’d like a tamperproof and immutable audit log of all shared data and actions, such that an independent auditor can reconstruct the state of the system at any point in time.

Enter the blockchain?! Not so fast say Allen et al., blockchain technology as we know it today is ‘one step forward, two steps back’ ;).

Today, for gaining immutability and auditability with new blockchain platforms, we give up decades of research in data management— and hardened, enterprise-ready code that implements these ideas.

We’d still like to be able to use SQL for example. We want transaction throughput much closer to a traditional database, and we want to take advantage of query optimisation and sophisticated query processing engines. We could try adding database like features to blockchain systems, but that looks to be a long road:

There are now a gazillion start-ups that are adding these basic database features to blockchains, but it will take years if not decades to catch up.

How about trying it the other way round then? Start with a mature database system, and add a sprinkling of blockchain?

Instead of adding database capabilities to blockchains, we propose to address the problem from the opposite approach: we add trust and auditability to existing database management systems.

The key notions in the paper are verifiable databases and verifiable tables. A verifiable database has all the features of a regular database, but in addition it supports tamper-evident collaboration across mutually untrusted entities.

The idea of a shared verifiable table goes one step further: integrating a special table directly into the existing databases of the transacting parties. The same instance of the table is visible to all parties, and all activities are written to a tamper-proof log. There is an N:1 relationship between shared tables and tamper-proof logs.

Verifiable databases (and tables) provide a set of cryptographic guarantees:

  • each party can verify the actions (updates) of all other parties and provide proof of its own actions
  • all parties can verify that the state of the shared database (or table) and its responses to queries is consistent with the prior actions of legitimate actors
  • unauthorized parties (hackers or operators with administrative privileges) cannot tamper with the state of the verifiable database (table) without being detected by the verification mechanism

So we’re looking at a permissioned system supporting a set of verifiers. The assumption in this work is that verifiers have access to the full log. Confidentiality is an orthogonal concern that could be addressed by frameworks such as Coco, Quorum, Spice, or [Corda](https://www.corda.net/.

Introducing Veritas*

(* Yes, that’s a very confusing name that I don’t think has anything to do with Veritas the data management company).

If we want to retain database-like throughput and latency then we can’t write everything into a blockchain. So instead we need to use the blockchain as the anchor of trust.

At a high level, we’ll let verifiers verify the log produced by the database and then write their votes into the blockchain. Instead of acknowledging each log record individually, they can further reduce blockchain writes by batching – essentially recording the high-watermark of the log position to which they have verified. With the votes of the verifiers and a copy of the original log all parties can reconcile the history using a protocol called ‘Caesar Consensus’ first devised for Volt.

To avoid the overhead of shipping logs to the verifiers, one interesting option is to run a secure enclave alongside the database that performs verification, and just have all of the verifiers attest to the integrity of the trusted execution environment as it starts up.

In terms of the content of the log itself, Veritas opts for a fine-grained log with the verifier checking every record that is read or created as an intermediary or final query result. This makes the verifier simpler, giving it a lower TCB and footprint.

The verifier of the Concerto key-value store which is the design we adopted for Veritas, for instance, is composed of two simple AES blocks implemented in an FPGA and provably verifies the effects of a key-value store with hundreds of thousands of lines of code. It is also fully streaming and only keeps a few hashes as internal state.

Veritas also uses deferred verification (that is, transaction commit is not dependent on successful verification). The interesting argument is that online verification would give no stronger guarantees of malicious behaviour detection since even in the online case transaction failure could be forced by other changes outside of the transaction environment – e.g. by an earlier change that causes an integrity constraint to be violated. In this instance just rolling back the transaction in question would not be enough.

So far we’ve been talking about verifiable databases. Verifiable tables are a way to integrate data from a shared, verifiable database into a local database.

Conceptually, the big difference between verifiable databases and verifiable tables is concurrency control. Concurrency control is centralized in a verifiable database and implemented in the same way as in any traditional database system… In contrast, concurrency control is fundamentally distributed in a system that supports shared, verifiable tables.

The Veritas prototype happens to use a distributed transaction scheme based on a centralised ordering service, but any suitable scheme could be plugged in here.

Evaluation

The Veritas prototype is ~1,500 lines of C# code. The evaluation is focused on exploring verification overheads as a function of the number of nodes in the system. Experiments use the YCSB benchmark and a single shared table with one million transactions evenly distributed across all nodes. Three different workloads are used. Workload A is an update-heavy workload. Workload B is read-heavy, and workload C is read-only.

The following chart shows throughput for each workload as we vary the number of nodes:

It’s a good news, bad news story:

…even for read-only workloads (C), the best case for the optimistic concurrency control scheme currently implemented in Veritas, the overhead is substantial. Nevertheless, Figure 8 (above) also shows that the overhead is not catastrophic; even with the simple scheme currently used in Veritas it is possible to achieve tens of thousands of transactions per second.

When we hold the number of nodes at 4, and vary the database size we can explore the relative impact of concurrency control and verification. (Smaller databases sizes lead to a greater number of conflicts).

… verification and distributed concurrency control has a price… but the bottlenecks of a database system that uses verification (and distributed trust) are the same as the bottlenecks of a traditional database system.

The case for network-accelerated query processing

January 28, 2019

The case for network-accelerated query processing Lerner et al., CIDR’19

Datastores continue to advance on a number of fronts. Some of those that come to mind are adapting to faster networks (e.g. ‘FARM: Fast Remote Memory’) and persistent memory (see e.g. ‘Let’s talk about storage and recovery methods for non-volatile memory database systems’), deeply integrating approximate query processing (e.g. ‘ApproxHadoop: Bringing approximations to MapReduce frameworks’ and ‘BlinkDB’), embedding machine learning in the core of the system (e.g. ‘SageDB’), and offloading processing into the network (e.g KV-Direct) — one particular example of exploiting hardware accelerators. Today’s paper gives us an exciting look at the untapped potential for network-accelerated query processing. We’re going to need all that data structure synthesis and cost-model based exploration coupled with self-learning to unlock the potential that arises from all of these advances in tandem!

NetAccel uses programmable network devices to offload some query patterns for MPP databases into the switch.

Thus, for the first time, moving data through networking equipment can contributed to query execution. Our preliminary results show that we can improve response times on even the best agreed upon plans by more than 2x using 25Gbps networks. We also see the promise of linear performance improvement with faster speeds. [Emphasis mine]

That promise of linear performance hints at an 8x speedup with a 100Gbps network in a data center!

Programmable switches and MAUs

Modern programmable network devices have packet-processing hardware in the form of Match-Action Units or MAUs. A MAU combines a match engine (is this a packet of interest?), with an action engine (what should I do with it?). The match engine holds data in a table format. MAUs are programmable in the sense that the table layout, type of lookup to perform, and processing done at a match event can all be specified. Combining several MAUs together in pipeline fashion yields a programmable dataplane.

When programming MAUs we must take care to ensure progress can be made at line speed. This requires constant-time lookups and deterministic actions. For example, there are no loops or dynamic resource allocation permitted as these could make a program’s runtime non-deterministic. The P4 language can be used to program collections of MAUs, and it’s language restrictions and compiler ensure that a given program can run properly on a given target. Current hardware devices have e.g. 12-20 MAUs.

Processing in the network

Traditionally, the fastest plans in MPP databases are those with the least amount of data movement across nodes, and network switches are passive elements that just route packets. The tuples generated by query execution are an opaque payload from the perspective of the network.

A query such as Query 20 from the TPC-H benchmark (Fig 2a), which joins five relations, has a query plan designed to minimise data motion (Fig 2b). With programmable switches though, it becomes possible to implement some parts on the query in the network. This requires an updated query plan (Fig 2c) to take advantage of the possibility.


(Enlarge)

For the first time, entire segments of a query plan can be performed on the switch, with potentially strong consequences to data placement, query execution, and optimization. The tuples are processed on the switch at “line-speed” – up to 100 Gbps in our case – yielding performance improvements that increase with network speed.

Introducing NetAccel

NetAccel is an MPP database (Greenplum) extended with three novel components:

  • a set of network accelerated query operators that can be instantiated on a switch and combined to implement segments of a query
  • a deparser that takes care of the communication between the MPP nodes and the switch
  • a network scheduler (controller) that identifies appropriate queries, determines the placement of network-accelerated operations, and monitors their execution

The network scheduler decides how the MAUs will be organised. One strategy is to implement fixed common query patterns, e.g. join-and-group-by. Alternatively the MAU allocation could be adaptive. In the prototype a fixed join-and-group-by pattern is used. The scheduler also puts in place a strategy to deal with overflowing tuples. That is, what should happen if the incoming data tuples do not fit in allocated data space at the MAU. One simple overflowing strategy is to overflow to the control plane, which can be reached simply by routing packets to it. Once a tuple has overflowed, any further operation on the tuple is also redirected.

The deparser manages communication between the MPP node and the switch. This includes the tuples themselves, as well as the instructions for what is to be done with them.

An important consideration here is the choice of a network protocol. For instance, simply making a tuple be the payload of a TCP packet would not work. The switch drops many packets… and moreover creates new packets dynamically… TCP being stateful making such changes to the packet flow without a receiver equating them to anomalies would be an overhead. Better network protocol stacks exist for our case. We could use a traditional IP stack and a connectionless protocol such as UDP. Or use a lightweight protocol directly atop of Ethernet. We are currently exploring the latter…

At 100 Gbps and tuples of less than 40 bytes, we may need to forward more than 148M tuples per second per port. Normal OS and TCP/IP stacks cannot operate at this pace: NetAccel bypasses both.

The fun part of course is the network-accelerated operators themselves. The authors discuss operators for hash-joins, hash-based aggregation, data motion, and data reloading.

Hash-joins require as a minimum two MAUs: one to store a hash table, and one to keep track of overflow. But we can extend capacity by using a chain of MAUs to store larger tables in each case. Each additional MAU in the chain stores an additional position in the collision chain for each location in the hash table.

If insertion fails at all MAUs (the collision chain is full), then the packet is overflowed. Packet metadata is updated to indicate that an insertion was not possible, and we also record that the chain to which the packet hashes is full. The full algorithm looks like this:

See section 3.3 in the paper for the details of the hash-based aggregation algorithm.

Data motion operators assist in sending the results of an operation to downstream nodes, and data reloading operators are used to relocate data within the switch.

Prototype and evaluation

The prototype implements a join-and-group-by-one query pattern on the MAUs of a Tofino switch. The pattern uses 10 MAUs, as shown in the figure below.


(Enlarge)

The experiments use a four-node Greenplum Parallel Database instance (three segment nodes and one control node), running Query 20 from the TPC-H benchmark with scale factor of 100.

For the most expensive join-and-group-by segment of this query, the normal Greenplum query plan completes in 1702ms, whereas the network accelerated plan completes in 834ms.

In the normal plan, Greenplum tries to move as little data as possible and to perform the join-and-group-by locally. This is the accepted best practice for distributed plans. Conversely, the accelerated plan pushes both relations onto the switch. The effective network speed the deparse achieved in that case was within 2% of the nominal maximum of 25 Gbps. The accelerated plan ran 2.04x faster.

A second experiment explores the allocation of MAU tables to either the join or group-by stages, finding that it is more advantageous to assign extra MAUs to the join rather than to the group-by.

A final experiment investigates the effects of varying network speed. In the chart below, the ‘original’ bar shows the plan originally chosen by Greenplum, and ‘normal’ is the best possible plan of the query without using the network acceleration. The ‘accel’ bar shows the performance achieved with network acceleration, and the ‘min’ bar shows the theoretically minimum time it would take if the network was systematically saturated.

The current limitation to getting the ‘accel’ performance closer to ‘min’ is the deparser, which plateaus at 29 Gbps.

Realizing the potential speed— the difference between the ‘min’ bar at a given speed and the ‘accel’ one— requires some future work (e.g., by using RDMA).

Research agenda

NetAccel opens several fundamental research directions to be explored, including:

  • CPU offloading to deparse tuples onto the network without involving the CPU (e.g., RDMA, smart NICs).
  • Richer offloading strategies beyond simple control plane overflow
  • MAU allocation strategies for efficient and flexible allocation
  • Switch paralllelism that distributes work across several pipelines
  • Updating the query optimiser and cost model to accommodate the option of in-network processing. Whereas traditional plans aim to minimise data movement, network-accelerated strategies benefit from minimising intermediate states.

While realizing the full potential of our vision will take years, we are excited by the prospect of NetAccel and by the new data processing avenues opening up with the advent of next-generation networking equipment.

I have to say, I can’t wait to see where this will take us. I have a feeling it’s going to be a big part of large scale data processing in the future.

Programming paradigms for dummies: what every programmer should know

January 25, 2019

Programming paradigms for dummies: what every programmer should know Peter Van Roy, 2009

We’ll get back to CIDR’19 next week, but chasing the thread starting with the Data Continuum paper led me to this book chapter by Peter Van Roy mapping out the space of programming language designs. (Thanks to TuringTest for posting a reference to it in a HN thread). It was too good not to take a short detour to cover it! If you like the chapter, you’ll probably enjoy the book, ‘Concepts, Techinques, and Models of Computer Programming’ by Van Roy & Haridi on which much of this chapter was based .

This chapter gives an introduction to all the main programming paradigms, their underlying concepts, and the relationships between them… We give a taxonomy of about 30 useful programming paradigms and how they are related.

Programming paradigms are approaches based on a mathematical theory or particular set of principles, each paradigm supporting a set of concepts. Van Roy is a believer in multi-paradigm languages: solving a programming problem requires choosing the right concepts, and many problems require different sets of concepts for different parts. Moreover, many programs have to solve more than one problem! “A language should ideally support many concepts in a well-factored way, so that the programmer can choose the right concepts whenever they are needed without being encumbered by the others.” That makes intuitive sense, but in my view does also come with a potential downside: the reader of a program written in such a language needs to be fluent in multiple paradigms and how they interact. (Mitigating this is probably what Van Roy had in mind with the ‘well-factored’ qualification: a true multi-paradigm language should avoid cross-paradigm interference, not just support a rag-bag of concepts). As Van Roy himself says later on when discussing state: “The point is to pick a paradigm with just the right concepts. Too few and programs become complicated. Too many and reasoning becomes complicated.

There are a huge number of programming languages, but many fewer paradigms. But there are still a lot of paradigms. This chapter mentions 27 different paradigms that are actually used.

The heart of the matter is captured in the following diagram, “which rewards careful study.” Each box is a paradigm, and the arrows between boxes show the concept(s) that need to be added to move between them.


(Enlarge)

Figure 2 is organised according to the creative extension principle:

Concepts are not combined arbitrarily to form paradigms. They can be organized according to the the creative extension principle… In a given paradigm, it can happen that programs become complicated for technical reasons that have no direct relationship to the specific problem that is being solved. This is a sign that there is a new concept waiting to be discovered.

The most common ‘tell’ is a need to make pervasive (nonlocal) modifications to a program in order to achieve a single objective. (I’m in danger of climbing back on my old AOP soapbox here!). For example, if we want any function to be able to detect an error at any time and transfer control to an error correction routine, that’s going to be invasive unless we have a concept of exceptions.

Two key properties of a programming paradigm are whether or not it has observable non-determinism, and how strongly it supports state.

… non-determinism is observable if a user can see different results from executions that start at the same internal configuration. This is highly undesirable… we conclude that observable nondeterminism should be supported only if its expressive power is needed.

Regarding state, we’re interested in how a paradigm supports storing a sequence of values in time. State can be unnamed or named; deterministic or non-determinstic; and sequential or concurrent. Not all combinations are useful! Figure 3 below shows some that are:

The horizontal axis in the main paradigms figure (figure 2) is organised according to the bold line in the figure above.

The four most important programming concepts

The four most important programming concepts are records, lexically scoped closures, independence (concurrency) and named state.

Records are groups of data items with indexed access to each item (e.g. structs). Lexically scoped closures combine a procedure with its external references (things it references outside of itself at its definition). They allow you to create a ‘packet of work’ that can be passed around and executed at a future date. Independence here refers to the idea that activities can evolve independently. I.e., they can be executed concurrently. The two most popular paradigms for concurrency are shared-state and message-passing. Named state is at the simplest level the idea that we can give a name to a piece of state. But Van Roy has a deeper and very interesting argument that revolves around named mutable state:

State introduces an abstract notion of time in programs. In functional programs, there is no notion of time… Functions do not change. In the real world, things are different. There are few real-world entities that have the timeless behaviour of functions. Organisms grows and learn. When the same stimulus is given to an organism at different times, the reaction will usually be different. How can we model this inside a program? We need to model an entity with a unique identity (its name) whose behaviour changes during the execution of the program. To do this, we add an abstract notion of time to the program. This abstract time is simply a sequence of values in time that has a single name. We call this sequence a named state.

Then Van Roy goes on to give what seems to me to be conflicting pieces of advice: “A good rule is that named state should never be invisible: there should always be some way to access it from the outside” (when talking about correctness), and “Named state is important for a system’s modularity” (think information hiding).

Abstracting data

A data abstraction is a way to organize the use of data structures according to precise rules which guarantee that the data structures are used correctly. A data abstraction has an inside, an outside, and an interface between the two.

Data abstractions can be organised along two main dimensions: whether or not the abstraction uses named state, and whether or not the operations are bundled into a single entity with the data.

Van Roy then goes on to discuss polymorphism and inheritance (note that Van Roy prefers composition to inheritance in general, but if you must use inheritance then make sure to follow the substitution principle).

Concurrency

The central issue in concurrency is non-determinism.

Nondeterminism is very hard to handle if it can be observed by the user of the program. Observable nondeterminism is sometimes called a race condition

Not allowing non-determinism would limit our ability to write programs with independent parts. But we can limit the observability of non-determinate behaviour. There are two options here: defining a language in such a way that non-determinism cannot be observed; or limiting the scope of observable non-determinism to those parts of the program that really need it.

There are at least four useful programming paradigms that are concurrent but have no observable non-determinism (no race conditions). Table 2 (below) lists these four together with message-passing concurrency.

Declarative concurrency is also known as monotonic dataflow. Deterministic inputs are received and used to calculate deterministic outputs.

In functional reactive programming, FRP, (aka ‘continuous synchronous programming’) we write function programs but the function arguments can be changed and the change is propagated to the output.

Discrete synchronous programming (aka reactive) systems wait for input events, perform internal calculations, and emit output events. The main difference between reactive and FRP is that in reactive programming time is discrete instead of continuous.

Constraints

In constraint programming we express the problem to be solved as a constraint satisfaction problem (CSP)… Constraint programming is the most declarative of all practical programming paradigms.

Instead of writing a set of instructions to be executed, in constraint programming you model the problem: representing the problem as a set of variables with constraints over those variables and propagators that implement the constraints. You then pass this model to a solver.

Language design guidelines

Now that we’ve completed a whirlwind tour through some of the concepts and paradigms, I want to finish up with some of Van Roy’s thoughts on designing a programming language. One interesting class of language is the ‘dual-paradigm’ language. A dual-paradigm language typically supports one paradigm for programming in the small, and another for programming in the large. The second paradigm is typically chosen to support abstraction and modularity. For example, solvers supporting constraint programming embedded in an OO language.

More generally, Van Roy sees a layered language design with four core layers, a structure which has been independently discovered across multiple projects:

The common language has a layered structure with four layers: a strict functional core, followed by declarative concurrency, then asynchronous message passing, and finally global named state. This layered structure naturally supports four paradigms.

Van Roy draws four conclusions from his analysis here:

  1. Declarative programming is at the very core of programming languages.
  2. Declarative programming will stay at the core for the foreseeable future, because distributed, secure, and fault-tolerant programming are essential topics that need support from the programming language
  3. Deterministic concurrency is an important form of concurrency that should not be ignored. It is an excellent way to exploit the parallelism of multi-core processors.
  4. Message-passing concurrency is the correct default for general-purpose concurrency instead of shared-state concurrency.

For large-scale software systems, Van Roy believes we need to embrace a self-sufficient style of system design in which systems become self-configuring, healing, adapting, etc.. The system has components as first class entities (specified by closures), that can be manipulated through higher-order programming. Components communicate through message-passing. Named state and transactions support system configuration and maintenance. On top of this, the system itself should be designed as a set of interlocking feedback loops. Here I’m reminded of systems thinking and causal loop diagrams.

The last word

Each paradigm has its own “soul” that can only be understood by actually using the paradigm. We recommend that you explore the paradigms by actually programming in them…

The data calculator: data structure design and cost synthesis from first principles and learned cost models

January 23, 2019

The Data Calculator: data structure design and cost synthesis from first principles and learned cost models Idreos et al., SIGMOD’18

This paper preceded the work on data continuums that we looked at last time, and takes a more general look at interactive and semi-automated design of data structures. A data structure here is defined as a combination of (1) a data layout describing how the data is stored, and (2) algorithms that describe how its basic functionality is achieved over the specific data layout. For data structures with just two different types of nodes (e.g., leaf-nodes and non-leaf nodes in a tree), the authors estimate there are more than 10^{32} possible valid data structure designs! Dozens of new data structures are published each year, but we’re still only scratching the surface.

Our intuition as that most designs (and even inventions) are about combining a small set of fundamental concepts in different ways or tunings… Our vision is to build the periodic table of data structures so we can express their massive design space. We take the first step in the paper, presenting a set of first principles that can synthesize an order of magnitude more data structure designs than what has been published in the literature.

Alongside the ability to synthesise data structure designs the authors also build a cost model so that designs can be evaluated and ranked. The cost model is informed by a combination of analytical models, benchmarks, and machine learning for a small set of fundamental access primitives.

The design primitives and cost model are brought together in a tool called the ‘Data Calculator’, which computes the performance of arbitrary data structure designs as combinations of the primitives.

It is an interactive tool that accelerates the process of design by turning it into an exploration process, improving the productivity of researchers and engineers; it is able to answer what-if data structure design questions to understand how the introduction of new design choices, workloads, and hardware affect the performance (latency) of an existing design. It currently supports read queries for basic hardware conscious layouts.


(Enlarge)

Data layout primitives

At the core of the Data Calculator are a set of design primitives, organised into five different groups (classes): how nodes are organised, filtered, and partitioned, the layout of child nodes, and whether or not recursion is allowed. Scanning the ‘domain’ column in the table below, these read like configuration parameters for a data structure generator. (The rightmost columns show the parameter values used by some common data structures – see the key at the bottom of the table). The primitives for dictating physical placement are critical to computing the cache-friendliness of a data structure, and hence the cost of traversing it.


(Enlarge)

As an example, a B+ tree uses sorted nodes with key retention none for internal nodes (they only store fences and pointers), and full for leaf nodes.

Primitives combine to form ‘elements’: the full specification of a node type within the data structure. For example, internal (non-terminal) elements and leaf (terminal) elements. Radar plots which show the design primitives selected, and the particular configuration choice made for that primitive (the segments on the radii) can be used to visually compare designs.


(Enlarge)

For two-element structures the design space describes 10^{32} possible designs, and for three-element structures (e.g. MassTree) there are 10^{48} possibilities. The Data Calculator tool provides guidance to help navigate the design space, and test a given design against a workload and hardware setting.

How much is that data structure in the window?

Traditional cost analysis in systems and data structures happens through experiments and the development of analytical cost models. Both options are not scalable when we want to quickly test multiple different parts of the massive design space we define in this paper.

Each data access primitive characterises one aspect of how data is accessed (e.g. binary search, scan, …), and has a number of different possible embodiments (‘level 2 primitives’). The idea is to model the behaviour of these level 2 primitives, and then combine the models to understand the behaviour of data structures synthesised from combinations of the primitives.

For every level 2 primitive, the Data Calculator contains one or more models that describe its performance (latency) behavior. These are not static models; they are trained and fitted for combinations of data and hardware profiles as both these factors drastically affect performance…. The models are simple parametric models; given the design decision to keep primitives simple (so they can be easily reused), we have domain expertise to expect how their performance behavior will look like (sic).

The access primitives the types of models used for them are shown in the following table.


(Enlarge)

For example, when binary searching a sorted array we know that it should take time O(\log n), and the model to be fitted is of the form f(n) = c_1 n + c_2 \log n + y_0, where c_1, c_2 and y_0 are coefficients learned through linear regression.

Given a data layout specification, hardware profile, and a workload, the calculator uses a rule-based system to compute the overall cost of an operation. The following figure, which should be read starting from the top-right, shows the process for estimating the cost of a Get operation. As part of the process the calculator simulates populating the data structure to figure out how many nodes will exist, the height of the structure, and so on. These information is needed to estimate the operation cost. The output is an abstract syntax tree with the access patterns of the path that needed to be traversed. These can be costed using the fitted models, and the overall cost is calculated as their sum.


(Enlarge)

A key part of figuring out performance is whether data is ‘hot’ or ‘cold’, and if it’s hot, where it sits in the cache hierarchy. The modelling intuition is that we’ll see step-like performance changes at each cache level depending on whether or not the data is cached there. We can therefore model each layer with a sigmoid function, and the overall behaviour across a cache hierarchy as the sum of a set of sigmoid functions.

For a given workload, the calculator figures out how many times each node is going to be accessed, allowing us to compute a popularity factor p: the number of accesses for the given node, divided by the total number of accesses across all nodes. The random access cost for an operation is then derived from the fitted cache model and a weighting factor w = 1/(p * sid). The parameter sid is the cycling sequence number of the operation in the workload. The end result is that we predict smaller access costs for more frequently accessed nodes, and vice-versa.

The evaluation shows that the calculator does a good job of estimating latency for a variety of different data structure designs (columns in the figure below), and hardware (the rows).


(Enlarge)

Answering ‘what-if?’ questions

By varying the input parameters to the calculator we can ask ‘what-if?’ questions. For example, “what would be the performance impact if I change my B-tree design by adding a bloom filter in each leaf?”, or “what would be the performance impact if I switched to this instance type on AWS?”. Training the level 2 primitives on a new hardware platform takes only a few minutes, and is a one-off task for each platform.

The evaluation gives an indication of the power of being able to ask these kinds of questions using a design based on a B-tree and a workload of uniform data and queries with 1 million inserts and 100 Gets:

  • It takes only 20 seconds to assess the impact of a hardware change, revealing in this case that performance would drop.
  • It takes 47 seconds to then ask, “if I do move to that hardware, is there a better choice of data structure?” (The search was restricted to exploring just five possible elements).
  • It takes 20 seconds to confirm a hypothesis that adding a Bloom filter in all B-tree leaves would be beneficial.
  • It takes 24 seconds to explore what would happen if skew is introduced into the workload (performance suffers with the current design), and a further 47 seconds to compute changes that will yield better performance with the skewed workload.

For the final part of the evaluation, the Design Calculator is asked to come up with designs for two different workload scenarios, on the same hardware profile, starting from a blank slate. The calculator is however given guidance to explore designs that combine hashing, B-Tree like indexing, and a simple log. The synthesis costs vary based on the estimated workload size: from a few seconds to 30 minutes. ”

Thus, the Data Calculator quickly answers complex questions that would normally take humans days or even weeks to test fully.

In the first scenario there are mixed reads and writes, with all reads being point queries in 20% of the domain. The calculator finds a design that use hashing at the upper levels of the hierarchy for fast access to data, but then most of the remaining domain is modelled using simple unsorted pages, with B+-tree style indexing for the read intensive part.

In the second scenario 50% of reads are point reads touching 10% of the domain, and the other 50% are range queries touching a different 10% of the domain. The calculator chooses to use hashing for the part of the domain receiving point queries, and B+-tree style indexing for the range query part.

The quest for the first principles of data structures needs to continue to find the primitives for additional significant classes of designs including updates, compression, concurrency, adaptivity, graphs, spatial data, version control management, and replication.

[confidentiality,…?]

Design continuums and the path toward self-designing key-value stores that know and learn

January 21, 2019

Design continuums and the path toward self-designing key-value stores that know and learn Idreos et al., CIDR’19

We’ve seen systems that help to select the best data structure from a pre-defined set of choices (e.g. ‘Darwinian data structure selection’), systems that synthesise data structure implementations given an abstract specification (‘Generalized data structure synthesis’), systems that use learning to tune configurations given a pre-defined set of configuration options (e.g, ‘BOAT: Building auto-tuners with structured Bayesian optimization’), systems that use learned models as key inputs to algorithms (e.g. ‘SageDB: a learned database system’), and systems that use reinforcement learning to discover fit-for-workload policies (‘Towards a hands-free query optimizer through deep learning’). Today’s paper choice jumps right into the middle of this mix, showing how families of data structures can be organised into design continuums with well-understood design parameters and cost models informing selection of a given data-structure design from within the space. As well as helping us to understand and reason about data structures within the space, the constrained feature space with quick-to-evaluate cost models opens the way to practical machine learning guided data structure synthesis and selection. The notion of design continuums is explored in the context of one specific continuum: data structures for key-value stores.

…these properties allow us to envision a new class of self-designing key-value stores with a substantially improved ability to adapt to workload and hardware changes by transitioning between drastically different data structure designs to assume a diverse set of performance properties at will.

The paper jumps back and forward between the details of the design continuum for KV-stores, and the idea of a design continuum itself. I shall put more emphasis on the latter in this write-up.

What is a design continuum?

Selecting a data structure involves making trade-offs: what’s right for one workload and set of objectives may not be right for another. When we’re building systems by hand, we tend to create either (a) general purpose systems, or (b) systems positioned towards extremes in the design space, specialised for a particular task. There’s a lot of ground between the two. The picture that emerges for me is that we’re heading towards a world of ‘personalisation’ (mass-customisation) for system software. In consumer facing software, personalisation is the idea that the system can learn the interests and preferences of its individual users, and tailor the user experience for each. Whereas we all used to get a general purpose user-experience, now personalisation is everywhere. In system software, the ‘user’ is a workload, and by learning the ‘preferences’ (characteristics) of that workload we can deliver an optimised ‘experience’ when executing it. Instead of general purpose we have fit-for-(this exact)-purpose.

Mass customisation requires (a) that we be able to enumerate points in the data structure space such that we can explore it efficiently, and (b) a way of efficiently evaluating the performance of a given data structure design. In general this is a really hard problem because the design space is vast.

Our insight is that there exist “design continuums” embedded in the design space of data structures. An intuitive way to think of design continuums is as a performance hyperplane that connects a specific set of data structure designs. Design continuums are effectively a projection of the design space, a “pocket” of designs where we can identify unifying properties among its members.

A properly constructed design continuum has a small set of design parameters forming a continuous performance tradeoff for fundamental performance metrics.

Constructing design continuums

A design continuum has five main elements.

Firstly there are a set of environment parameters which give us critical information about the workload and the environment (e.g. hardware) it runs on. For the key-value store design continuum, the authors use the following environment parameters:

Secondly, a super-structure shows how design primitives can be combined in structured ways to create families of data structures. In the key-value store example, the super-structure is based on the founding idea of a number of layers or levels, with upper levels being hot, and lower levels progressively colder.

Third, a set of design parameters form the externalised ‘knobs’ that can be tuned to influence the design. The ideal is to have the smallest set of movable design abstractions that permit differentiating among target designs in the continuum.

Specifically, fewer design parameters (for the same target designs) lead to a cleaner abstraction which in turn makes it easier to come up with algorithms that automatically find the optimal design. We minimize the number of design parameters in two ways: 1) by adding deterministic design rules which encapsulate expert knowledge about what is a good design, and 2) by collapsing more than one interconnected design decisions to a single design parameter.

For the key-value store design continuum, we have the following design parameters:

The design parameters inform, but do not themselves directly specify, the choices to be made within the super-structure. For this we have a set of design rules, parameterized by the environment and design parameters. “These rules enable instantiating specify designs by deterministically deriving key design aspects.” For example, in the key-value continuum one of the rules is that the capacity of levels grows exponentially at each level by a factor of T (a design parameter). And a memory budget will first be used for fence pointers, with the remainder assigned to Bloom filters. Here are the design rules for the KV continuum:

Finally, a cost model provides a closed-form equation for each one of the core performance metrics. The cost model provides an understanding of how changes in the design parameters (configuration tuning) impact the various performance metrics. The cost model for the KV continuum looks like this:

Using the cost model, a theory-of-constraints style approach can be used to find an optimal design: iteratively finding the current limiting factor and the knobs that can be tweaked to alleviate it.

The overall process of putting together a design continuum looks like this, where we proceed from left to right:

An ideal design continuum has six key properties:

  1. It is functionally intact; all possible designs should be able to support all operation types (i.e., are substitutable).
  2. It is pareto-optimal. There should be no two designs such that one of them is better than the other on one or more of the performance metrics (and vice versa) while being equal on all the others.
  3. It has a one-to-one mapping from the domain of design knobs to the co-domain of performance and memory trade-offs (bijective).
  4. It enables a diverse set of performance properties.
  5. The time complexity for navigating the continuum to converge on an optimal or near-optimal design should be tractable. (I.e., it is navigable).
  6. It is likely to be layered in order to provide a trade-off between diversity and navigability.

Growing a design continuum

One way to evolve a design continuum is to start with a small coherent core and then incrementally extend it. This is a three-step process: bridging, patching, and then costing. Bridging is the initial model extension, which in order of preference can take place by (1) introducing new design rules, (2) expanding the domains of existing design parameters, or (3) adding new design parameters. The bridging process may introduce many new intermediate designs, some of which may not be functionally intact (workable). Patching adds new design rules and/or constraints on permissible design parameter settings to avoid these broken designs. Care should be take that patching only constrains new parts of the continuum opened up by bridging. The final step is to generalize the cost model to account for all the new designs.

Self-designing systems

Knowing which design is the best for a workload opens the opportunity for systems that can adapt on-the-fly. While adaptivity has been studied in several forms including adapting storage to queries, the new opportunity is morphing among what is typically considered as fundamentally different designs, e.g. from an LSM-tree to a B+ tree, which can allow systems to gracefully adapt to a larger array of diverse workload patterns.

The authors identify three challenges on the way to such self-designing systems. Firstly, if we’re going to adapt on-the-fly, then we need way to physically transition among any two designs at runtime (and take into account the cost of those transitions). Secondly, we’ll need to generate tailored code for each design point. Given that we a continuum, it should be possible to write a single generalized algorithm for each operation that can instantiate the concrete algorithm used for that operation in each possible design. Finally, there will be limits to what we can do with rule-based tweaking of design parameters.

The path forward is to combine machine learning with the design continuum. Machine learning is increasingly used to tune exposed tuning knobs in systems. The new opportunity here is the native combination of such techniques with the system design itself… The net result in that design continuums can be blended with ML approaches to co-design a tailored system that both knows how to navigate a vast space of the design space and learns when needed to navigate design options that are hard to deterministically formulate how they will interact with the rest of the design.

The Key-Value store design continuum

I’ve hit my target word limit for a #themorningpaper post, and we haven’t really had a chance to dive into the KV continuum in any detail at all. In brief, the authors show how B+ trees, Log-Structured-Merge trees, and Log-Structured Hash-tables can be combined into one layered design continuum, with different instances falling out from different design parameter choices.


(Enlarge)

There’s a whole bunch of good detail here that’s interesting in its own right, so if the big ideas in the paper have caught your imagination, I definitely encourage you to go and check it out.

Towards a hands-free query optimizer through deep learning

January 18, 2019

Towards a hands-free query optimizer through deep learning Marcus & Papaemmanouil, CIDR’19

Where the SageDB paper stopped— at the exploration of learned models to assist in query optimisation— today’s paper choice picks up, looking exclusively at the potential to apply learning (in this case deep reinforcement learning) to build a better optimiser.

Why reinforcement learning?

Query optimisers are traditionally composed of carefully tuned and complex heuristics based on years of experience. Feedback from the actual execution of query plans can be used to update cardinality estimates. Database cracking, adaptive indexing, and adaptive query processing all incorporate elements of feedback as well.

In this vision paper, we argue that recent advances in deep reinforcement learning (DRL) can be applied to query optimization, resulting in a “hands-free” optimizer that (1) can tune itself for a particular database automatically without requiring intervention from expert DBAs, and (2) tightly incorporates feedback from past query optimizations and executions in order to improve the performance of query execution plans generated in the future.

If we view query optimisation as a DRL problem, then in reinforcement learning terminology the optimiser is the agent, the current query plan is the state, and each available action represents an individual change to the query plan. The agent learns a policy which informs the actions it chooses under differing circumstances. Once the agent decides to take no further actions the episode is complete and the agent’s reward is (ideally) a measure of how well the generated plan actually performed.

There are a number of challenges, explored in this paper, with making this conceptual mapping work well in practice. Not least of which is that evaluating the reward function (executing a query plan to see how well it performs) is very expensive compared to e.g. computing the score in an Atari game.

The ReJOIN join order enumerator

ReJOIN explores some of these ideas on a subset of the overall query optimisation problem: learning a join order enumerator.

Each query sent to ReJOIN is an episode, the state represents subtrees of a binary join tree together with information about the query join and selection predicates. Actions combine two subtrees into a single tree. Once all input relations are joined the episode ends, and ReJOIN assigns a reward based on the optimiser’s cost model. It’s policy network is updated on the basis of this score. The final join ordering is passed to the optimiser to complete the physical plan.

Using the optimiser’s cost model as a proxy for the ultimate performance of a generated query plan enables join orderings to be evaluated much more quickly.

The following chart shows how ReJOIN learns to produce good join orders during training. It takes nearly 9000 episodes (queries) to become competitive with PostgreSQL.

Once ReJOIN has caught up with PostgreSQL, it goes on to surpass it, producing orderings with lower cost.

Also of note here is that after training, ReJOIN produces its query plans faster than PostgreSQL’s built-in join enumerator in many cases. The bottom-up nature of ReJOIN’s algorithm is O(n), whereas PostgreSQL’s greedy bottom-up algorithm is O(n^2).

In addition to being limited to just join ordering, ReJOIN’s use of the query optimiser’s cost model to generate reward signals means that it is still dependent on a well-tuned cost model, which is a big part of the problem we wanted to solve in the first place. Ideally we’d like to extend the approach to handle full physical plan generation, and also remove the dependency on having an existing cost model.

Challenges in extending the approach

Once we go from just join ordering to the full search space including operator and access path selection etc., the approach from ReJOIN is unable to learn effective polities in reasonable time. An initial model failed to out-perform random choice even after 72 hours of training.

If we use actual query execution time to generate the reward, then initial policies which will often generate very inefficient plans will take a long time to obtain a reward signal. Thus we learn the slowest at exactly the point when we’d like to learn the fastest and the system takes a prohibitive amount of time to converge to good results. (An experiment with ReJOIN using real query latency instead of optimiser cost confirmed this).

Finally, query latency as a reward signal doesn’t meet the expectations of many DRL algorithms that the reward signal is dense and linear. A dense reward signal is one that provides incremental feedback with every action the agent takes (such as the score updating in an Atari game), not just at the end of the episode. The linear assumption means that an algorithm may attempt to maximise the sum of many small rewards within an episode.

We have identified how the large search space, delayed reward signal, and costly performance indicators provide substantial hurdles to naive applications of DRL to query optimization.

Should we just give up on the idea then? All is not lost yet! The last section of the paper details a number of alternative approaches that could overcome some of these hurdles.

Research directions

Three techniques that may help to make DRL-based query optimisation practical again are learning from demonstration, cost-model bootstrapping, and incremental learning. We’ll look at each of these briefly in turn next (there are no results from applying or evaluating these ideas as yet).

Learning from demonstration

In learning from demonstration, a model is first trained to mimic the behaviour of an existing expert, and then goes on to learn directly from its actions. In the context of query optimisation, we would first train a model to mimic the actions taken by an existing optimiser (indexes, join orderings, pruning of bad plans etc.), and then switch to optimising queries directly bypassing the ‘mentor’ optimiser. In this second phase the agent fine-tunes its own policy. The advantage of this strategy is that mimicking the existing optimiser in the early stages helps the optimiser agent to avoid the ‘obviously bad’ parts of the search space.

Since the behavior of the model in the second phase should not initially stray too far from the behavior of the expert system, we do not have to worry about executing any exceptionally poor query plans. Additionally, since the second training phase only needs to fine-tune an already-performant model, the delayed reward signal is of far less consequence.

Cost model bootstrapping

Cost model bootstrapping uses a very similar in spirit two-phase approach. In the first phase, instead of learning to mimic the actions of an existing expert optimiser, the judgements of the existing expert optimiser (i.e., it’s cost model) are used to bring the agent to an initial level of competence. The optimiser’s cost model is used as the reward signal during initial training, and once the agent has learned to produce good policies according to the cost model, it is then switched to learning from a reward based on actual query latency.

One complication is doing this is that the reward units (scale) need to be consistent across the costs produced by the query optimiser cost model, and the latency measurements of actual query executions. We could apply some kind of normalisation across the two, or alternatively transfer the weights from the first network to a new network (transfer learning).

Incremental learning

Incremental learning attempts to mitigate the issues stemming from poor query plans early in the learning cycle by beginning learning on simpler problems:

… incrementally learning query optimization by first training a model to handle simple cases and slowly introducing more complexity. This approach makes the extremely large search space more manageable by dividing it into smaller pieces.

In the context of query optimisation, we can make problems easier by reducing the number of relations, or by reducing the number of dimensions to be considered. That leads to a problem space that looks like this:

We could therefore try starting with a small number of pipeline phases and gradually introducing more, as shown in the following figure:

Or we could try starting with small examples and gradually focus on larger and larger queries.

Maybe a hybrid strategy will be best, starting with join order and small queries, and gradually increasing sophistication in both dimensions.

SageDB: a learned database system

January 16, 2019

SageDB: a learned database system Kraska et al., CIDR’19

About this time last year, a paper entitled ‘The case for learned index structures’ (part I, part II) generated a lot of excitement and debate. Today’s paper choice builds on that foundation, putting forward a vision where learned models pervade every aspect of a database system.

The core idea behind SageDB is to build one or more models about the data and workload distribution and based on them automatically build the best data structures and algorithms for all components of the database system. This approach, which we call “database synthesis” will allow us to achieve unprecedented performance by specializing the implementation of every database component to the specific database, query workload, and execution environment.

For the want of a model

In the absence of runtime learning and adaptation, database systems are engineered for general purpose use and do not take full advantage of the specific characteristics of the workload and data at hand. The size of the opportunity for SageDB is the gap between such an approach and what is possible when designing a specialised solution with full knowledge of the data distribution and workload.

Consider an extreme case: we want to store and query ranges of fixed-length records with continuous integer keys. Using a conventional index here makes no sense as the key itself can be used as an offset. A C program loading 100M integers into an array and summing over a range runs in about 300ms. Doing the same operation in Postgres takes about 150 seconds: a 500x overhead for the general purpose design.

…we can optimize almost any algorithm or data structure used by the database with knowledge of the exact data distribution. These optimizations can sometimes even change the complexity class of well-known data processing algorithms.

Knowledge of the data distribution comes in the form of a (learned) model. Armed with such a model, the authors argue that we can automatically synthesise index structures, sorting and join algorithms, and even entire query optimisers, leveraging the data distribution patterns for performance gains.

Overfitting is good

What kind of a model makes sense? A histogram for example is a very simple model, but for the use cases discussed here either too coarse-grained or too big to be useful. At the other end of the spectrum, deep and wide neural nets come with high costs (though these are expected to decrease with advances in hardware). Combine this with the fact that for this use case, ‘overfitting’ is good! We want to capture the precise nuances of our exact data as precisely as possible. (The research program to date is largely focused on analytic workloads, some degree of generalisation is clearly beneficial once we start to consider updates).

As of today, we found that we often need to generate special models to see significant benefits.

As an example, consider the RMI model from ‘The case for learned index structures’ :

  1. Fit a simple model (linear regression, simple neural net etc.) over the data
  2. Use the prediction of the model to pick another model, an expert, which more accurately models the subset of the data
  3. Repeat the process until the leaf model is making a final prediction

RMI is just a starting point. For example, it is possible to make the top model or bottom model more complex, replace parts of the models at a particular level stage with other types of models, use quantization, vary the feature representation, combine models with other data structures, and so on. We therefore believe we will see an explosion of new ideas on how to most efficiently generate models for database components to achieve the right balance between precision, low latency, space, and execution time for a given workload.

Data access

Last year’s paper on ‘The case for learned index structures’ showed that an RMI-based index can outperform state of the art B-Tree implementations by a factor of two while being orders of magnitude smaller (“note that the updated arXiv version contains new results“). Subsequent work has extended this to data stored on disk, compression inserts, and multi-dimensional data.

For multi-dimensional data, the baseline is an R-Tree (as opposed to a B-Tree). R-Trees map rectangles to a list of index ranges such that the index of every point lying in the rectangle is contained in the union of these ranges. We can replace an R-Tree with a learned model, just as we could the B-Tree. One of the tricks that makes the RMI B-Tree replacement work is that it is sufficient for the model to get us ‘in the right locality’ and then we can do a local search around the prediction to finish the job. For R-Trees, we also need a layout that enables efficient localised search.

While many possible projection strategies exist, we found that successively sorting and partitioning points along a sequence of dimensions into equally-sized cells produces a layout that is efficient to compute, learnable (e.g., in contrast to z-order, which is very hard to learn), and tight (i.e., almost all points in the union of the index ranges satisfy the query).

The authors implemented such a learned index over an in-memory column store with compression, and compared it to a full column scan, a clustered index (sorting by the column providing the best overall performance), and an R-Tree. The benchmarks used 60 million records from the lineitem table of the TPC-H benchmark, with query selectivity of 0.25%.

The learned index beats the next best performing implementation by 34x (note the log scales on the charts) and has only a tiny space overhead compared to the clustered solution.

Further analysis revealed that the learned index beats the clustered index on almost every type of query – the exception is when the clustered dimension in the clustered index is the only dimension in the query.

Query execution

This is one of my favourite parts of the paper, because it demonstrates how learned models can even help in the humble and age-old case of sorting. The approach to sorting is to use a learned model to put the records into roughly the right order, and then correct the nearly perfected sorted data as a final step. For this an efficient local-sort such as insertion sort can be used, which is very fast with almost-sorted arrays.

The figure below shows results of a learned approach to sorting for increasingly large data sizes of 64-bit doubles randomly sampled from a normal distribution. In the comparison, Timsort is the default sort for Java and Python, std::sort is from the C++ library. The learned variant is 18% faster than the next best (Radix sort in this case) on average.

(This doesn’t include the time taken to learn the model).

Learned models can also be used to improve joins. For example, consider a merge-join with two stored join columns and a model-per-column. We can use the model to skip data that will not join (the authors don’t detail how the equivalent of ‘local patching’ is supposed to work in this scenario, it’s not immediately obvious to me).

The authors also experimented with workload aware schedulers, implementing a reinforcement-learning based scheduling system using a graph neural network:

Our system represents a scheduling algorithm as a neural network that takes as input information about the data (e.g., using a CDF model) and the query workload (e.g, using a model trained on previous executions of queries) to make scheduling decisions.

On a sample of 10 TPC-H queries, the learned scheduler improved average job completion time by 45% over Spark’s default FIFO scheduler.

The strategy that the scheduler learned to get this improvement was to combine completing short jobs quickly with maximising cluster efficiency, learning to run jobs near their parallelism ‘sweet spot.’

Query optimiser

Traditional query optimizers are extremely hard to build, maintain, and often yield sub-optimal query plans. The brittleness and complexity of the optimizer makes it a good candidate to be learned…

Initial experiments starting with a traditional cost model and refining it over time through learning showed that the model quality can be improved, but that to make big gains would require making significant improvements to cardinality estimation. The research direction now (no reported results as yet) is to explore hybrid-model based approaches to cardinality estimation. These hybrid models combine a learned model of the underlying data patterns and correlations, with exception/outlier listens that capture extreme (and hard to learn) anomalies of the particular instance of the data.

Other areas

Other suggested areas where learned models may prove beneficial in the future include approximate query processing, predictive modelling, and workloads including inserts and updates.

The last word

SageDB presents a radical new approach to build database systems, by using ML models combined with program synthesis to generate system components. If successful, we believe this approach will result in a new generation of big data processing tools, which can better take advantage of GPUs and TPUs, provide significant benefits in regard to storage consumption and space, and, in some cases, even change the complexity class of certain data operations.