Azure Data Lake Store: a hyperscale distributed file service for big data analytics

Azure data lake store: a hyperscale distributed file service for big data analytics Douceur et al., SIGMOD'17 Today's paper takes us inside Microsoft Azure's distributed file service called the Azure Data Lake Store (ADLS). ADLS is the successor to an internal file system called Cosmos, and marries Cosmos semantics with HDFS, supporting both Cosmos and ... Continue Reading

Spanner: becoming a SQL system

Spanner: becoming a SQL system Bacon et al., SIGMOD'17 This week we'll start digging into some of the papers from SIGMOD'17. First up is a terrific 'update' paper on Google's Spanner which brings the story up to date in the five years since the original OSDI'12 paper. ... in many ways, today's Spanner is very ... Continue Reading

Chimera: Large-Scale Classification Using Machine Learning, Rules, and Crowdsourcing

Chimera: Large-Scale Classification Using Machine Learning, Rules, and Crowdsourcing - Sun et al. 2014 (WalmartLabs) Large-scale classification, where we need to classify hundreds of thousands or millions of items into thousands of classes, is becoming increasingly common in this age of Big Data... So far, however, very little has been published on how large-scale classification ... Continue Reading

Enterprise Database Applications and the Cloud: A difficult road ahead

Enterprise Database Applications and the Cloud: A difficult road ahead - Stonebraker et al. 2014 In the rush to the cloud, stateless application components are well catered for but state always makes things more complicated. In this paper, Stonebraker et al. set out some of the reasons enterprise database applications present challenges to cloud migration. ... Continue Reading

Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing

Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing - Google 2014 Mesa is another in the tapestry of systems that support Google's advertising business. Previously editions of The Morning Paper have covered Photon, Spanner, F1, and F1's online schema update mechanism. Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related ... Continue Reading

The Google File System

The Google File System - Ghemawat, Gobioff & Leung, 2003 Here's a paper with a lot to answer for! Back in 2003 Ghemawat et al reported that We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault-tolerance while running on inexpensive commodity hardware, ... Continue Reading