Musketeer – Part I : What’s the best data processing system?

April 27, 2015July 26, 2017 ~ adriancolyer ~ 18 Comments

Musketeer: all for one, one for all in data processing systems - Gog et al. 2015 For between 40-80% of the jobs submitted to MapReduce systems, you'd be better off just running them on a single machine... It was Eurosys 2015 last week, and a great new crop of papers were presented. Gog et al. … Continue reading Musketeer – Part I : What’s the best data processing system?

ApproxHadoop: Bringing Approximations to MapReduce Frameworks

April 16, 2015July 26, 2017 ~ adriancolyer ~ 5 Comments

ApproxHadoop: Bringing Approximations to MapReduce Frameworks - Goiri et al. 2015 Yesterday we saw how including networking concerns in scheduling decisions can increase throughput for MapReduce jobs (and Storm topologies) by ~30%. Today we look at an even more effective strategy for getting the most out of your Hadoop cluster: doing less work! On one … Continue reading ApproxHadoop: Bringing Approximations to MapReduce Frameworks

Mojim: A Reliable and Highly-Available Non-Volatile Memory System

April 14, 2015July 26, 2017 ~ adriancolyer ~ 6 Comments

Mojim: A Reliable and Highly-Available Non-Volatile Memory System - Zhang et al. 2015 This is the second in a series of posts looking at the latest research from the recently held ASPLOS 15 conference. It seems like we've been anticipating NVMM (Non-volatile main memory) for a while now; and there has been plenty of research … Continue reading Mojim: A Reliable and Highly-Available Non-Volatile Memory System

Scalable Atomic Visibility with RAMP Transactions

March 27, 2015July 26, 2017 ~ adriancolyer ~ 4 Comments

Scalable Atomic Visibility with RAMP Transactions - Bailis et al. 2014 RAMP transactions came up last week as part of the secret sauce in Coordination avoidance in database systems that contributed to a 25x improvement on the TPC-C benchmark. So what exactly are RAMP transactions and why might we need them? As soon as you … Continue reading Scalable Atomic Visibility with RAMP Transactions

Coordination Avoidance in Database Systems

March 19, 2015July 26, 2017 ~ adriancolyer ~ 23 Comments

Coordination Avoidance in Database Systems - Bailis et al. 2014 The very title of this paper speaks to the theme we've been looking at so far this week - how to reduce the amount of coordination needed in a distributed system. (Which seems fitting having just spent the prior two weeks looking at how costly … Continue reading Coordination Avoidance in Database Systems

Enterprise Database Applications and the Cloud: A difficult road ahead

February 12, 2015July 26, 2017 ~ adriancolyer

Enterprise Database Applications and the Cloud: A difficult road ahead - Stonebraker et al. 2014 In the rush to the cloud, stateless application components are well catered for but state always makes things more complicated. In this paper, Stonebraker et al. set out some of the reasons enterprise database applications present challenges to cloud migration. … Continue reading Enterprise Database Applications and the Cloud: A difficult road ahead

Encapsulation of parallelism in the Volcano query processing system

February 11, 2015July 26, 2017 ~ adriancolyer ~ 5 Comments

Encapsulation of parallelism in the volcano query processing system - Graefe '89. You may have picked up on the throwaway line in the Impala paper: "The execution model is the traditional Volcano-style with Exchange operators." So what exactly is the 'traditional Volcano style', and what are 'exchange operators'? Today's choice is the paper that first … Continue reading Encapsulation of parallelism in the Volcano query processing system

Impala: a modern, open-source SQL engine for Hadoop

February 5, 2015July 26, 2017 ~ adriancolyer ~ 4 Comments

Impala: A modern, open-source SQL engine for Hadoop - Kornacker et al . 2015 (Cloudera*) This is post 4 of 5 in a series looking at the latest research from CIDR'15. Also in the series so far this week: 'The missing piece in complex analytics', 'WANalytics, analytics for a geo-distributed, data intensive world', and 'Liquid: … Continue reading Impala: a modern, open-source SQL engine for Hadoop

WANalytics: Analytics for a geo-distributed, data intensive world

February 3, 2015July 26, 2017 ~ adriancolyer ~ 2 Comments

WANalytics: analytics for a geo-distributed data intensive world - Vulimiri et al. 2015 ...data is born distributed; we only control data replication and distributed execution strategies. This is true for so many sources of data. Combine this with Dave McCrory's observation that 'Data has Gravity' (i.e. it attracts applications and other data processing workloads to … Continue reading WANalytics: Analytics for a geo-distributed, data intensive world

Dremel: interactive analysis of web-scale datasets

January 26, 2015July 26, 2017 ~ adriancolyer ~ 2 Comments

Dremel: interactive analysis of web-scale datasets - Melnik et al. (Google), 2010. Dremel is Google's interactive ad-hoc query system that can run aggregate queries over trillions of rows in seconds. It scales to thousands of CPUs, and petabytes of data. It was also the inspiration for Apache Drill. Dremel borrows the idea of serving trees … Continue reading Dremel: interactive analysis of web-scale datasets