Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics

March 31, 2016 ~ Adrian Colyer ~ 2 Comments

Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics - Venkataraman et al. 2016 With cloud computing environments such as Amazon EC2, users typically have a large number of choices in terms of the instance types and number of instances they can run their jobs on. Not surprisingly, the amount of memory per core, storage media, ... Continue Reading

MacroBase: Analytic Monitoring for the Internet of Things

March 16, 2016 ~ Adrian Colyer ~ 5 Comments

MacroBase: Analytic Monitoring for the Internet of Things - Bailis et al. 2016 It looks like Peter Alvaro is not the only one to be doing some industrial collaboration recently! MacroBase is the result of Peter Bailis' collaboration with Cambridge Mobile Telematics (CMT), an IoT company. The topic at hand is analytic monitoring - detecting ... Continue Reading

Trajectory Data Mining: An Overview

March 7, 2016 ~ Adrian Colyer ~ 2 Comments

Trajectory Data Mining: An Overview - Zheng 2015 In 'Trajectory Data Mining,' Zheng conducts a high-level tour of the techniques involved in working with trajectory data. This is the data created by a moving object, as a sequence of locations, often with uncertainty around the exact location at each point. This could be GPS trajectories ... Continue Reading

Arabesque: A System for Distributed Graph Mining

January 26, 2016 ~ Adrian Colyer ~ 3 Comments

Arabesque: A System For Distributed Graph Mining - Teixeira et al. 2015 We've studied graph computation systems before in The Morning Paper: systems such as Pregel, Giraph and GraphLab that provide vertex-centric programming models ('think like a vertex') on top of a Bulk Synchronous Parallel compute model. We've also seen some of the limitations of ... Continue Reading

Pregelix: Big(ger) Graph Analytics on a Dataflow Engine

June 3, 2015 ~ Adrian Colyer ~ Leave a comment

Pregelix: Big(ger) Graph Anayltics on a Dataflow Engine - Bu et al. 2015 FlashGraph shows us that it's possible to efficiently process graphs that aren't solely in-memory, and GraphX showed us that we can map graph abstractions on top of a dataflow engine. Put the two ideas together, and you get something that looks like ... Continue Reading

Making Sense of Performance in Data Analytics Frameworks

April 20, 2015 ~ Adrian Colyer ~ 7 Comments

Making Sense of Performance in Data Analytics Frameworks - Ousterhout et al. 2015 We all know the causes of poor performance in big data analytics workloads: network I/O, disk I/O, and straggler tasks. Ousterhout et al. set out to try and quantify this, and found that what we think we know isn't necessarily so. Yet ... Continue Reading

ApproxHadoop: Bringing Approximations to MapReduce Frameworks

April 16, 2015 ~ Adrian Colyer ~ 5 Comments

ApproxHadoop: Bringing Approximations to MapReduce Frameworks - Goiri et al. 2015 Yesterday we saw how including networking concerns in scheduling decisions can increase throughput for MapReduce jobs (and Storm topologies) by ~30%. Today we look at an even more effective strategy for getting the most out of your Hadoop cluster: doing less work! On one ... Continue Reading

Impala: a modern, open-source SQL engine for Hadoop

February 5, 2015 ~ Adrian Colyer ~ 4 Comments

Impala: A modern, open-source SQL engine for Hadoop - Kornacker et al . 2015 (Cloudera*) This is post 4 of 5 in a series looking at the latest research from CIDR'15. Also in the series so far this week: 'The missing piece in complex analytics', 'WANalytics, analytics for a geo-distributed, data intensive world', and 'Liquid: ... Continue Reading

WANalytics: Analytics for a geo-distributed, data intensive world

February 3, 2015 ~ Adrian Colyer ~ 2 Comments

WANalytics: analytics for a geo-distributed data intensive world - Vulimiri et al. 2015 ...data is born distributed; we only control data replication and distributed execution strategies. This is true for so many sources of data. Combine this with Dave McCrory's observation that 'Data has Gravity' (i.e. it attracts applications and other data processing workloads to ... Continue Reading

The Missing Piece in Complex Analytics

February 2, 2015 ~ Adrian Colyer ~ 2 Comments

The Missing Piece in Complex Analytics: Low latency scalable model management and serving with Velox - Crankshaw et al. 2015. Analytics at scale can be used to create statistical models for making predictions about the world, but once the data scientists and analysts have done their initial work and a model has been built and ... Continue Reading

the morning paper

a random walk through Computer Science research, by Adrian Colyer
Made delightfully fast by strattic

Analytics