Introducing CIDR’15 week on The Morning Paper

The data systems research community are a smart bunch, although it’s not their research and papers I’m referring to here. Many conferences move around, but not the Conference on Innovative Data Systems Research (CIDR). CIDR has found a rather nice venue “on the Pacific Ocean, just south of Monterey”, and decided to stick there. Schedule the conference right at the beginning of January, and you have what I can only imagine is a very nice way to start the new year. Too avoid making this too obvious, the conference is only held every other year. We’re not fooled! :)

For 2015 I have a few new things I want to try with The Morning Paper. One of those is to highlight the latest research from the 2015 editions of selected academic conferences of interest – dedicating a week to picking my favourite papers from the year’s crop. Suggestions for conferences you’d like to see me cover always very welcome. I like to keep a mix of foundations and frontiers in the paper selections, and this seems like a fun way of bringing you some of the latest thinking from different communities.

CIDR ’15 produced a great set of papers, and whittling it down to 5 was hard. Normally I like to keep you guessing about what paper you’re going to find in your inbox each morning (email subscription is the best way to make sure you never miss an issue!), but for these themed weeks I’m going to break with tradition and give you an overview of my choices up front – write-ups will then follow through the week. In no particular order:

The Missing Piece in Complex Analytics

The Missing Piece in Complex Analytics: Low latency scalable model management and serving with Velox – Crankshaw et al. (AMP Lab) ’15. A paper from the AMP lab is always worthy of attention. This one is in an area that was dear to my heart during my time at Pivotal (and still is): how do you connect models generated by offline analysis with runtime systems that need to use and update them?

WANalytics: Analytics for a geo-distributed data-intensive world

WANalytics: Analytics for a geo-distributed data-intensive world – Vulimir et al. ’15. “We didn’t distribute our data, it was born distributed.” Now how do you make that efficient while running complex data analysis pipelines with a desire to minimise cross-datacenter traffic and honour regulatory constraints? Some impressive early results and a promising direction…

Liquid: Unifying nearline and offline big data integration

Liquid: Unifying nearline and offline big data integration – Fernandez et al. ’15. A look at a key part of LinkedIn’s data processing architecture. Nearline systems aren’t online, but do still require low latency response times and high data throughput. Are distributed file systems the right foundation for this? And if not, what is?

Impala: A modern open-source SQL engine for Hadoop

Impala: A modern open-source SQL engine for Hadoop – Cloudera ’15. A peek inside the design of Cloudera’s Impala system that provides low-latency and high concurrency for BI/analytic query workloads.

Specialized Evolution of the General Purpose CPU

Specialized Evolution of the General Purpose CPU – Rajwar et al. (Intel) ’15. Not quite sure how this snuck into an innovative data systems research conference, but it’s a great tour of developments in CPUs, with some graphs we can extrapolate and a look at what the future might hold.

One for you to read: Immutability Changes Everything

Immutability changes everything – Helland ’15. I had to make the very hard decision not to cover Pat Helland’s paper as part of this week’s selections. This a a great tour of the consequences of (and opportunities created by) immutability up and down the stack, and full of Helland’s pithy one-liners (including the title!). It’s been widely cited this past couple of weeks already though, so perhaps I’ll save it away for a future edition.