Large-scale cluster management at Google with Borg

May 7, 2015July 26, 2017 ~ adriancolyer ~ 9 Comments

Large-scale cluster management at Google with Borg - Verma et al. 2015 Borg has been running all of Google's workloads for the last ten years, and the learnings from Borg are being packaged into kubernetes so that the rest of the world can benefit from them. An important paper then as the rest of us … Continue reading Large-scale cluster management at Google with Borg

The Chubby lock service for loosely coupled distributed systems

February 13, 2015July 26, 2017 ~ adriancolyer ~ 4 Comments

The Chubby lock service for loosely coupled distributed systems - Burrows '06 This paper describes the Chubby lock service at Google, which was designed as a coarse-grained locking service, found use mostly as a name service and configuration repository, and inspired the creation of Zookeeper. [Chubby's] design is based on well-known ideas that have meshed … Continue reading The Chubby lock service for loosely coupled distributed systems

Dremel: interactive analysis of web-scale datasets

January 26, 2015July 26, 2017 ~ adriancolyer ~ 2 Comments

Dremel: interactive analysis of web-scale datasets - Melnik et al. (Google), 2010. Dremel is Google's interactive ad-hoc query system that can run aggregate queries over trillions of rows in seconds. It scales to thousands of CPUs, and petabytes of data. It was also the inspiration for Apache Drill. Dremel borrows the idea of serving trees … Continue reading Dremel: interactive analysis of web-scale datasets

Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing

January 19, 2015July 26, 2017 ~ adriancolyer ~ 5 Comments

Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing - Google 2014 Mesa is another in the tapestry of systems that support Google's advertising business. Previously editions of The Morning Paper have covered Photon, Spanner, F1, and F1's online schema update mechanism. Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related … Continue reading Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing

The Tail at Scale

January 15, 2015July 26, 2017 ~ adriancolyer ~ 16 Comments

The Tail at Scale - Dean and Barroso 2013 We've all become familiar with the importance of fault-tolerance and the techniques that can be used to achieve it. Less well-known is the idea of tail-tolerance. A system that doesn't respond quickly enough feels clunky to its users and can have serious negative consequences for site/service … Continue reading The Tail at Scale

Spanner: Google’s Globally Distributed Database

January 8, 2015July 26, 2017 ~ adriancolyer ~ 8 Comments

Spanner: Google's Globally Distributed Database - Google 2012 Since we've spent the last two days looking at F1 and its online asynchronous schema change support, it seems appropriate today to look at Spanner, the system that underpins them both. There are three interesting stories that come out of the paper for me, each of which … Continue reading Spanner: Google’s Globally Distributed Database

Online, Aysnchronous Schema Change in F1

January 7, 2015July 26, 2017 ~ adriancolyer ~ 4 Comments

Online, Asynchronous Schema Change in F1 Rae et al. 2013 Continuous deployment and evolution of running services with zero downtime is the holy grail. With stateless services this is comparatively easy to achieve. But once we have stateless services, and especially large volumes of data in a store, things get more difficult. We would ideally … Continue reading Online, Aysnchronous Schema Change in F1

F1: A Distributed SQL Database That Scales

January 6, 2015July 26, 2017 ~ adriancolyer ~ 10 Comments

F1: A Distributed SQL Database That Scales - Google 2012 (** updated paper link above, thanks to Brenden Kromhout for pointing out the dead link **) In recent years, conventional wisdom in the engineering community has been that if you need a highly scalable, high- throughput data store, the only viable option is to use … Continue reading F1: A Distributed SQL Database That Scales

Photon: Fault-tolerant and scalable joining of continuous data streams

December 4, 2014July 26, 2017 ~ adriancolyer ~ 4 Comments

Photon: Fault-tolerant and scalable joining of continuous data streams - Google 2013 To the best of our knowledge, this is the first paper to formulate and solve the problem of joining multiple streams continuously under these system constraints: exactly-once semantics, fault-tolerance at datacenter-level, high scalability, low latency, unordered streams, and delayed primary stream. It's interesting … Continue reading Photon: Fault-tolerant and scalable joining of continuous data streams

The Google File System

October 30, 2014July 26, 2017 ~ adriancolyer ~ 4 Comments

The Google File System - Ghemawat, Gobioff & Leung, 2003 Here's a paper with a lot to answer for! Back in 2003 Ghemawat et al reported that We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault-tolerance while running on inexpensive commodity hardware, … Continue reading The Google File System