F1: A Distributed SQL Database That Scales

F1: A Distributed SQL Database That Scales – Google 2012

(** updated paper link above, thanks to Brenden Kromhout for pointing out the dead link **)

In recent years, conventional wisdom in the engineering community has been that if you need a highly scalable, high- throughput data store, the only viable option is to use a NoSQL key/value store, and to work around the lack of ACID transactional guarantees and the lack of conveniences like secondary indexes, SQL, and so on. When we sought a replacement for Google’s MySQL data store for the AdWords product, that option was simply not feasible: the complexity of dealing with a non-ACID data store in every part of our business logic would be too great, and there was simply no way our business could function without SQL queries.

This is the story of F1, the database that Google built to run their AdWords business. Previously in this series we also looked at Photon, the distributed stream-joining solution supporting the same system.

I like this paper not just for the description of how the F1 database is architected, but also for the discussions of why it is the way it is, and how the implications of system design decisions for users (developers) have been taken into consideration. Let’s start right there with some pretty hard-hitting words:

We [also] have a lot of experience with eventual consistency systems at Google. In all such systems, we find developers spend a significant fraction of their time building extremely complex and error-prone mechanisms to cope with eventual consistency and handle data that may be out of date. We think this is an unacceptable burden to place on developers and that consistency problems should be solved at the database level. Full transactional consistency is one of the most important properties of F1.

and …

Designing applications to cope with concurrency anomalies in their data is very error-prone, time-consuming, and ultimately not worth the performance gains.

The key design criteria are:

  • Scalability simply by adding resources. No manual sharding and resharding.
  • Availability (this is Google’s core business system)
  • Consistency – full ACID transactions and no need for app devs to ever deal with inconsistencies
  • Usability – Full SQL query support and other functionality expected from a SQL database. “Features like indexes and ad-hoc query are not just nice to have, but absolute requirements for our business.

Recent publications have suggested that these design goals are mutually exclusive. A key contribution of this paper is to show how we achieved all of these goals in F1’s design, and where we made trade-offs and sacrifices.

F1 builds on top of Spanner for data storage, and accepts higher latency for typical reads and writes in order to provide consistency. This latency is ameliorated through explicit clustering and heavy use of batching, parallelism, and asynchronous reads which bring the overall performance for user facing transactions to the same level as the previous MySQL based system.

To get an idea of the scale that F1 is designed to operate at, consider that…

(the AdWords) database is over 100 TB, serves up to hundreds of thousands of requests per second, and runs SQL queries that scan tens of trillions of data rows per day.

There’s a lot in this paper, which makes it hard to do justice in a summary. Some of the distinguishing features of F1 include:

  • Hierarchical Schema – whereby tables can be organized into a hierarchy with child table rows stored with, and interleaved within the rows of the parent table. Think aggregate entity for example. Transactions restricted to a single root usually avoid 2PC and the associated latency.
  • Table colums can contain structured data types, in particular, Protocol Buffers.

At Google, Protocol Buffers are ubiquitous for data storage and interchange between applications. When we still had a MySQL schema, users often had to write tedious and error-prone transformations between database rows and in-memory data struc- tures. Putting protocol buffers in the schema removes this impedance mismatch and gives users a universal data structure they can use both in the database and in application code…. Protocol Buffer columns are more natural and reduce semantic complexity for users, who can now read and write their logical business objects as atomic units, without having to think about materializing them using joins across several tables

  • Transactional and fully consistent indices, split into local indices and global indices.
  • Support for fully non-blocking schema changes (this is pretty neat!)

The AdWords database is shared by thousands of users and is under constant development. Batches of schema changes are queued by developers and applied daily. This database is mission critical for Google and requires very high availability. Downtime or table locking during schema changes (e.g. adding indexes) is not acceptable.

Schema changes are applied asynchronously on multiple F1 servers. Anomalies are prevented by the use of a schema leasing mechanism with support for only current and next schema versions; and by subdividing schema changes into multiple phases where consecutive pairs of changes are mutually compatible ad cannot cause anomalies. The full details of the schema change algorithms are provided in a separate paper.

  • Snapshot, pessimistic, and optimistic transaction support, with optimistic transactions being the default mode.
  • Flexible lock granularity enabling the users to specify additional locks covering column subsets.

Frequently, F1 users use lock columns in parent tables to avoid insertion phantoms for specific predicates or make other business logic constraints easier to enforce.

  • A fully integrated change history mechanism (with what sounds like SQL-based Continuous Query (CQ) support for these changes).

Many database users build mechanisms to log changes, either from application code or using database features like triggers. In the MySQL system that AdWords used before F1, our Java application libraries added change history records into all transactions. This was nice, but it was inefficient and never 100% reliable. Some classes of changes would not get history records, including changes written from Python scripts and manual SQL data changes. In F1, Change History is a first-class feature at the database level.

  • A NoSQL key/value store interface
  • A fully-fledged SQL interface which is used for both OLTP and OLAP queries.

There’s also an interesting insight into ORM. Google found that a distributed store, even though it fully supported the SQL interface, needed an alternate ORM approach:

The nature of working with a distributed data store required us to rethink the way our client applications interacted with the database. Many of our client applications had been written using a MySQL-based ORM layer that could not be adapted to work well with F1. Code written using this library exhibited several common ORM anti-patterns… For F1, we replaced this ORM layer with a new, stripped-down API that forcibly avoids these anti-patterns. The new ORM layer does not use any joins and does not implicitly traverse any relationships between records. All object loading is explicit, and the ORM layer exposes APIs that promote the use of parallel and asynchronous read access.

The paper contains a good discussion on how F1 processes queries, for which I defer you to the original.

All of the great developer-friendly features do have to be paid for in the end though, and in the trade-offs made in the design of F1 this manifests itself most obviously as increased CPU usage.

Resource costs are usually higher in F1, where queries often use an order of magnitude more CPU than similar MySQL queries. MySQL stored data uncompressed on local disk, and was usually bottlenecked by disk rather than CPU, even with flash disks. F1 queries start from data compressed on disk and go through several layers, decompressing, processing, recompressing, and sending over the network, all of which have significant cost. Improving CPU efficiency here is an area for future work.