The Google File System

The Google File System – Ghemawat, Gobioff & Leung, 2003

Here’s a paper with a lot to answer for! Back in 2003 Ghemawat et al reported that

We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault-tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

Google’s workloads and use of commodity hardware led to the following observations:

  1. Component failures are the norm rather than the exception
  2. The files Google worked with were huge by traditional standards – multi-GB files were very common. Managing a large number of small (Kb) files became unwieldy causing block size and I/O to be revisited.
  3. Most Google workloads only mutated files by appending to them. Random writes were practically non-existent. Files were often only read sequentially
  4. Google were able to co-design their applications and the API to their file system (i.e. abandon the requirement for POSIX compliance) for better flexibility

The result was the Google File System, designed to store a modest (few million) number of large files, with workloads consisting of large streaming reads and small random reads, large sequential writes that append data to files, and semantics for supporting multiple clients concurrently appending to the same file. In addition:

High sustained bandwidth is more important than low latency.

The implementation details probably sound very familiar: a master and multiple chunkservers, with files divided into replicated chunks of 64MB.

We keep the overall system highly available with two simple yet effective strategies: fast recovery and replication.

We always talk a lot about replication, it’s good to remember the fast-recovery part of the equation too. Also of note in the paper is the empasis on providing good diagnostic tools.

In summary:

The Google File System demonstrates the qualities essential for supporting large-scale data processing workloads on commodity hardware. While some design decisions are specific to our unique setting, many apply to data processing tasks of a similar magnitude and cost consciousness.

GFS of course went on to inspire Hadoop’s HDFS, and the rest is history. It’s good to go back and look at the workload assumptions that inspired GFS, as a sanity check that your use-cases match. Deploying HDFS just because 67% of other companies install HDFS is not a good strategy!

It’s also interesting to wonder how things would have played out if Google themselves had decided to open source GFS. Google of course subsequently went on to introduce Caffeine and
Colossus (aka GFS 2) as they needed to respond in more of a real-time than a batch mode. HDFS evolved to….?