Coordination Avoidance in Database Systems

March 19, 2015July 26, 2017 ~ adriancolyer ~ 23 Comments

Coordination Avoidance in Database Systems - Bailis et al. 2014 The very title of this paper speaks to the theme we've been looking at so far this week - how to reduce the amount of coordination needed in a distributed system. (Which seems fitting having just spent the prior two weeks looking at how costly … Continue reading Coordination Avoidance in Database Systems

RIPQ: Advanced photo caching on flash for Facebook

February 27, 2015July 26, 2017 ~ adriancolyer

RIPQ: Advanced Photo Caching on Flash for Facebook - Tang et al. 2015 It's three for the price of one with this paper: we get to deepen our understanding of the characteristics of flash, examine a number of priority queue and caching algorithms, and get a glimpse into what's behind an important part of Facebook's … Continue reading RIPQ: Advanced photo caching on flash for Facebook

Enterprise Database Applications and the Cloud: A difficult road ahead

February 12, 2015July 26, 2017 ~ adriancolyer

Enterprise Database Applications and the Cloud: A difficult road ahead - Stonebraker et al. 2014 In the rush to the cloud, stateless application components are well catered for but state always makes things more complicated. In this paper, Stonebraker et al. set out some of the reasons enterprise database applications present challenges to cloud migration. … Continue reading Enterprise Database Applications and the Cloud: A difficult road ahead

Encapsulation of parallelism in the Volcano query processing system

February 11, 2015July 26, 2017 ~ adriancolyer ~ 5 Comments

Encapsulation of parallelism in the volcano query processing system - Graefe '89. You may have picked up on the throwaway line in the Impala paper: "The execution model is the traditional Volcano-style with Exchange operators." So what exactly is the 'traditional Volcano style', and what are 'exchange operators'? Today's choice is the paper that first … Continue reading Encapsulation of parallelism in the Volcano query processing system

Impala: a modern, open-source SQL engine for Hadoop

February 5, 2015July 26, 2017 ~ adriancolyer ~ 4 Comments

Impala: A modern, open-source SQL engine for Hadoop - Kornacker et al . 2015 (Cloudera*) This is post 4 of 5 in a series looking at the latest research from CIDR'15. Also in the series so far this week: 'The missing piece in complex analytics', 'WANalytics, analytics for a geo-distributed, data intensive world', and 'Liquid: … Continue reading Impala: a modern, open-source SQL engine for Hadoop

Introducing CIDR’15 week on The Morning Paper

February 1, 2015January 30, 2015 ~ adriancolyer ~ 4 Comments

The data systems research community are a smart bunch, although it's not their research and papers I'm referring to here. Many conferences move around, but not the Conference on Innovative Data Systems Research (CIDR). CIDR has found a rather nice venue "on the Pacific Ocean, just south of Monterey", and decided to stick there. Schedule … Continue reading Introducing CIDR’15 week on The Morning Paper

Dremel: interactive analysis of web-scale datasets

January 26, 2015July 26, 2017 ~ adriancolyer ~ 2 Comments

Dremel: interactive analysis of web-scale datasets - Melnik et al. (Google), 2010. Dremel is Google's interactive ad-hoc query system that can run aggregate queries over trillions of rows in seconds. It scales to thousands of CPUs, and petabytes of data. It was also the inspiration for Apache Drill. Dremel borrows the idea of serving trees … Continue reading Dremel: interactive analysis of web-scale datasets

The MADlib Analytics Library

January 23, 2015July 26, 2017 ~ adriancolyer ~ 1 Comment

The MADlib Analytics Library - MAD Skills, the SQL - Hellerstein et al. 2012 The way that we use large databases has evolved from being primarily in support of accounting and financial record-keeping, to primarily in support of predictive analytics over a wide range of potentially noisy data. Analytics at scale requires the marriage of … Continue reading The MADlib Analytics Library

Architecture of a Database System

January 20, 2015July 26, 2017 ~ adriancolyer ~ 9 Comments

Architecture of a Database System - Hellerstein, Stonebraker & Hamilton, 2007. This is a longer read (and hence a slightly longer write-up too) coming in at 119 pages, but it's written in a very easy style so the pages fly by. It oozes wisdom and experience from every paragraph as Joe Hellerstein and Michael Stonebroker … Continue reading Architecture of a Database System

Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing

January 19, 2015July 26, 2017 ~ adriancolyer ~ 5 Comments

Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing - Google 2014 Mesa is another in the tapestry of systems that support Google's advertising business. Previously editions of The Morning Paper have covered Photon, Spanner, F1, and F1's online schema update mechanism. Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related … Continue reading Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing