f4: Facebook’s warm BLOB storage system – Muralidhar et al. 2014
This is a story of system engineering trade-offs, a design informed by data analysis, and hard-won experience. It’s the story of how Facebook implemented a tiered storage solution for BLOBs and introduced per data class (temperature) replication factor, latency, and time-to-recovery tuning. If you’re into this kind of thing, it’s also a really good read.
Facebook’s corpus of photos, videos, and other Binary Large OBjects (BLOBs) that need to be reliably stored and quickly accessible is massive and continues to grow. As the footprint of BLOBs increases, storing them in our traditional storage system, Haystack, is becoming increasingly inefficient. To increase our storage efficiency, measured in the effective-replication-factor of BLOBs, we examine the underlying access patterns of BLOBs and identify temperature zones that include hot BLOBs that are accessed frequently and warm BLOBs that are accessed far less often. Our overall BLOB storage system is designed to isolate warm BLOBs and enable us to use a specialized warm BLOB storage system, f4.
f4 was storing about 65PBs of data at the time the paper was written. As of February 2014, this included over 400 billion photos. The underlying file system is HDFS.
Based on a two-week trace, benchmarks of existing systems, and daily snapshots of summary statistics, the Facebook team discovered that:
There is a strong correlation between the age of a BLOB and its temperature. Newly created BLOBs are requested at a far higher rate than older BLOBs. For instance, the request rate for week-old BLOBs is an order of magnitude lower than for less-than-a-day old content for eight of nine examined types. In addition, there is a strong correlation between age and the deletion rate.We use these findings to inform our design…
These data access patterns support a two-tier BLOB storage solution with Haystack used for ‘hot’ blobs, and content then migrating to f4 after three months for photos, and after one month for other blob types. By reducing the replication factor for ‘warm’ blobs the average request latency goes up slightly (from 14ms to 17ms), and the time to recovery after failure is increased, but the cost of storage is significantly reduced. The effective replication factor (ratio of physical data size to logical data size) for Haystack is 3.6, for f4 this has been brought down to 2.1.
Just to give an idea of the overall scale we’re talking about, the unit of acquisition, deployment and roll-out is a cell. Each cell has 14 racks of 15 hosts, with 30 4TB drives per host. That’s 6,300 4TB drives!
Replication Factor reduction
Haystack uses triple replication with RAID-6 giving a replication factor of 3×1.2 = 3.6.
f4 stores volumes of warm BLOBs in cells that use distributed erasure coding, which uses fewer physical bytes than triple replication. It uses Reed-Solomon(10,4) coding and lays blocks out on different racks to ensure resilience to disk, machine, and rack failures within a single datacenter. Is uses XOR coding in the wide-area to ensure resilience to datacenter failures. f4 has been running in production at Facebook for over 19 months. f4 currently stores over 65PB of logical data and saves over 53PB of storage.
Using Reed-Solomon coding achieves reliability at lower storage overheads than full replication, at the cost of longer rebuild and recovery times under failure. A (10,4) encoding has a 1.4x expansion factor. Keeping two copies of this would give a 2.8x effective replication factor, but the XOR technique further reduces the replication factor to 2.1.
We pair each volume/stripe/block with a buddy volume/stripe/block in a different geographic region. We store an XOR of the buddies in a third region.
Deletes are handled in a neat way by encrypting individual blobs within a volume, and simply deleting the encryption key when deleting a blob. This helps to keep the overall design much simpler.
Mechanical Sympathy
There are some nice touches of mechanical sympathy showing through in the design.
- Choosing a 1 GB block size for encoding reduces the number of blobs that span multiple blocks, thus requiring multiple I/O operations to read
- Separating out storage-intensive tasks from computing intensive tasks in the architecture (for example, the introduction of a separate transformation tier) enables the use of appropriate hardware for each.
- Candidate hard drives were benchmarked to determine maximum IOPS that could be consistently achieved at the desired latency
An important consideration in the design of f4 was keeping the hardware and software well matched. Hardware that provides capacity or IOPS that are not used by the software is wasteful; software designed with unrealistic expectations of the hardware will not work. The hardware and software components of f4 were co-designed to ensure they were well-matched by using software measurements to inform hardware choices and vice-versa.
The lower CPU requirements of the storage nodes may even enable the use of lower-powered CPUs in the future, introducing a second form of cost saving.
Self-healing
We’re all familiar with the ‘design for failure’ mantra. f4 is designed to handle disk failures, host failures, rack failures, and data center failures. The failure rates are interesting: drives have an annual failure rate of about 1% (so > 1 drive failure per cell per week), hosts fail ‘periodically’, and racks fail ‘multiple times per year’.
When there are failures in a cell, some data blocks will become unavailable, and serving reads for the BLOBs it holds will require online reconstruction of them from companion data blocks and parity blocks. Backoff nodes are storage-less, CPU-heavy nodes that handle the online reconstruction of request BLOBs.
This online reconstruction happens in the request path and rebuilds only the requested blob, not the full block (the blob is typically much smaller than the blob). This keeps request latency down. Full block rebuilding is then handled offline by dedicated rebuilder nodes.
At large scale, disk and node failures are inevitable. When this happens blocks stored on the failed components need to be rebuilt. Rebuilder nodes are storage-less, CPU-heavy nodes that handle failure detection and background reconstruction of data blocks. Each rebuilder node detects failure through probing and reports the failure to a coordinator node. It rebuilds blocks by fetching n companion or parity blocks from the failed block’s strip and decoding them. Rebuilding is a heavy-weight process that imposes significant I/O and network load on the storage nodes. Rebuilder nodes throttle themselves to avoid adversely impacting online user requests. Scheduling the rebuilds to minimize the likelihood of data loss is the responsibility of the coordinator nodes.
Coordinator nodes are also storage-less, CPU heavy. As well as scheduling block rebuilding they also ensure that the current data layout minimizes the chances of data unavailability.
… blocks in a stripe are laid out on different failure domains to maximize reliability. However, after initial placement and after failure, reconstruction, and replacement there can be violations where a stripe’s blocks are in the same failure domain. The coordinator runs a placement balancer process that validates the block layout in the cell, and rebalances blocks as appropriate.
Thus different components at different layers in the system architecture are continually watching, repairing, and rebalancing the system, like the life-preserving mechanisms in a living organism.
Facebook lessons learned
In the course of designing, building, deploying, and refining f4 we learned many lessons. Among these the importance of simplicity for operational stability, the importance of measuring underlying software for your use case’s efficiency, and the need for heterogeneity in hardware to reduce the likelihood of correlated failures stand out.
Hardware heterogeneity gives immunity against shared weaknesses in homogeneous hardware.
We recently learned about the importance of heterogeneity in the underlying hardware for f4 when a crop of disks started failing at a higher rate than normal. In addition, one of our regions experienced higher than average temperatures that exacerbated the failure rate of the bad disks. This combination of bad disks and high temperatures resulted in an increase from the normal ~1% AFR to an AFR over 60% for a period of weeks.
Takeaways
There’s a lot to like about the way the Facebook team approached this challenge. It all begins with a detailed understanding of the use case and usage patterns, backed up by data analysis. Engineering trade-offs are then able to optimise different parts of the system for their different usage and these parts of the paper are a nice reminder that there is seldom a ‘right’ one-size-fits-all solution. The software and hardware are considered together, such that the system architecture enables the right hardware to be used for the right job (in previous version of the system compute-heavy and storage-heavy tasks were mixed on the same nodes). Finally, the multiple layers of repair and recovery create a self-sustaining, self-healing system that can survive multiple and continual failures.
As ever, it’s hard to decide what to miss out when providing a summary of a paper. I can assure you there’s plenty of meat left on the bone of this one, it’s well worth the time to review the full paper. As always, the link is at the top of this post.