ApproxHadoop: Bringing Approximations to MapReduce Frameworks

April 16, 2015 ~ Adrian Colyer ~ 5 Comments

ApproxHadoop: Bringing Approximations to MapReduce Frameworks - Goiri et al. 2015 Yesterday we saw how including networking concerns in scheduling decisions can increase throughput for MapReduce jobs (and Storm topologies) by ~30%. Today we look at an even more effective strategy for getting the most out of your Hadoop cluster: doing less work! On one ... Continue Reading

WANalytics: Analytics for a geo-distributed, data intensive world

February 3, 2015 ~ Adrian Colyer ~ 2 Comments

WANalytics: analytics for a geo-distributed data intensive world - Vulimiri et al. 2015 ...data is born distributed; we only control data replication and distributed execution strategies. This is true for so many sources of data. Combine this with Dave McCrory's observation that 'Data has Gravity' (i.e. it attracts applications and other data processing workloads to ... Continue Reading

Spanner: Google’s Globally Distributed Database

January 8, 2015 ~ Adrian Colyer ~ 8 Comments

Spanner: Google's Globally Distributed Database - Google 2012 Since we've spent the last two days looking at F1 and its online asynchronous schema change support, it seems appropriate today to look at Spanner, the system that underpins them both. There are three interesting stories that come out of the paper for me, each of which ... Continue Reading

Online, Aysnchronous Schema Change in F1

January 7, 2015 ~ Adrian Colyer ~ 4 Comments

Online, Asynchronous Schema Change in F1 Rae et al. 2013 Continuous deployment and evolution of running services with zero downtime is the holy grail. With stateless services this is comparatively easy to achieve. But once we have stateless services, and especially large volumes of data in a store, things get more difficult. We would ideally ... Continue Reading

F1: A Distributed SQL Database That Scales

January 6, 2015 ~ Adrian Colyer ~ 10 Comments

F1: A Distributed SQL Database That Scales - Google 2012 (** updated paper link above, thanks to Brenden Kromhout for pointing out the dead link **) In recent years, conventional wisdom in the engineering community has been that if you need a highly scalable, high- throughput data store, the only viable option is to use ... Continue Reading

The Log-Structured Merge-Tree (LSM Tree)

November 26, 2014 ~ Adrian Colyer ~ 3 Comments

The Log-Structured Merge-Tree (LSM Tree) - O'Neil et al. '96. Log-Structured Merge is an important technique used in many modern data stores (for example, BigTable, Cassandra, HBase, Riak, ...). Suppose you have a hierarchy of storage options for data - for example, RAM, SSDs, Spinning disks, with different price/performance characteristics. Furthermore, you have a large ... Continue Reading

The Declarative Imperative: Experiences and Conjectures in Distributed Logic

November 13, 2014 ~ Adrian Colyer ~ 13 Comments

The Declarative Imperative: Experiences and Conjectures in Distributed Logic - Hellerstein 2010. This paper is an extended version of an invited talk that Joe Hellerstein gave to the ACM PODS conference in 2010. The primary audience is therefore database researchers, but there's some good food for thought for the rest of us in there too. ... Continue Reading

Highly Available Transactions: Virtues and Limitations

November 7, 2014 ~ Adrian Colyer ~ 9 Comments

Highly Available Transactions: Virtues and Limitations - Bailis et. al 2014. Since yesterday we looked at the Boom Hierarchy, it seemed fitting today to take a selection from the BOOM project (no relation). Thus earning me the Basil Brush award ;) What a great paper this is, I have so many highlights and annotations on ... Continue Reading

Shark: SQL and Rich Analytics at Scale

October 13, 2014 ~ Adrian Colyer ~ Leave a comment

Shark: SQL and Rich Analytics at Scale, Xin et al 2013. Given the Databricks Spark result reported last week, it seems timely to look at a system built on top of Spark, Shark, that ultimately informed the Spark SQL project. [Shark] leverages a novel distributed memory abstraction to provide a unified engine that can run ... Continue Reading

the morning paper

a random walk through Computer Science research, by Adrian Colyer
Made delightfully fast by strattic

Datastores