Azure Data Lake Store: a hyperscale distributed file service for big data analytics

July 4, 2017July 2, 2017 ~ adriancolyer ~ 3 Comments

Azure data lake store: a hyperscale distributed file service for big data analytics Douceur et al., SIGMOD'17 Today's paper takes us inside Microsoft Azure's distributed file service called the Azure Data Lake Store (ADLS). ADLS is the successor to an internal file system called Cosmos, and marries Cosmos semantics with HDFS, supporting both Cosmos and … Continue reading Azure Data Lake Store: a hyperscale distributed file service for big data analytics

Spanner: becoming a SQL system

July 3, 2017July 2, 2017 ~ adriancolyer ~ 6 Comments

Spanner: becoming a SQL system Bacon et al., SIGMOD'17 This week we'll start digging into some of the papers from SIGMOD'17. First up is a terrific 'update' paper on Google's Spanner which brings the story up to date in the five years since the original OSDI'12 paper. ... in many ways, today's Spanner is very … Continue reading Spanner: becoming a SQL system

Chimera: Large-Scale Classification Using Machine Learning, Rules, and Crowdsourcing

January 28, 2016July 27, 2017 ~ adriancolyer ~ 2 Comments

Chimera: Large-Scale Classification Using Machine Learning, Rules, and Crowdsourcing - Sun et al. 2014 (WalmartLabs) Large-scale classification, where we need to classify hundreds of thousands or millions of items into thousands of classes, is becoming increasingly common in this age of Big Data... So far, however, very little has been published on how large-scale classification … Continue reading Chimera: Large-Scale Classification Using Machine Learning, Rules, and Crowdsourcing

Enterprise Database Applications and the Cloud: A difficult road ahead

February 12, 2015July 26, 2017 ~ adriancolyer

Enterprise Database Applications and the Cloud: A difficult road ahead - Stonebraker et al. 2014 In the rush to the cloud, stateless application components are well catered for but state always makes things more complicated. In this paper, Stonebraker et al. set out some of the reasons enterprise database applications present challenges to cloud migration. … Continue reading Enterprise Database Applications and the Cloud: A difficult road ahead

Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing

January 19, 2015July 26, 2017 ~ adriancolyer ~ 5 Comments

Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing - Google 2014 Mesa is another in the tapestry of systems that support Google's advertising business. Previously editions of The Morning Paper have covered Photon, Spanner, F1, and F1's online schema update mechanism. Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related … Continue reading Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing

Detecting Discontinuities in Large-Scale Systems

November 25, 2014July 26, 2017 ~ adriancolyer

Detecting Discontinuities in Large-Scale Systems - Malik et al 2014. The 7th IEEE/ACM International Conference on Utility and Cloud Computing is coming to London in a couple of weeks time. Many of the papers don't seem to be online yet, but here's one that is. Malik et al. tackle the problem of long-term forecasting for … Continue reading Detecting Discontinuities in Large-Scale Systems

The Google File System

October 30, 2014July 26, 2017 ~ adriancolyer ~ 4 Comments

The Google File System - Ghemawat, Gobioff & Leung, 2003 Here's a paper with a lot to answer for! Back in 2003 Ghemawat et al reported that We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault-tolerance while running on inexpensive commodity hardware, … Continue reading The Google File System