Twitter Heron: Stream Processing at Scale

June 15, 2015July 26, 2017 ~ adriancolyer ~ 14 Comments

Twitter Heron: Stream Processing at Scale - Kulkarni et al. 2015 It's hard to imagine something more damaging to Apache Storm than this. Having read it through, I'm left with the impression that the paper might as well have been titled "Why Storm Sucks", which coming from Twitter themselves is quite a statement. There's a … Continue reading Twitter Heron: Stream Processing at Scale

Naiad: A Timely Dataflow System

June 12, 2015July 26, 2017 ~ adriancolyer ~ 17 Comments

Naiad: A Timely Dataflow System - Murray et al. 2013 Many data processing tasks require low-latency interactive access to results, iterative sub-computations, and consistent intermediate outputs so that sub-computations can be nested and composed. (For example, an) application that performs iterative processing on a real-time data stream, and supports interactive queries on a fresh, consistent … Continue reading Naiad: A Timely Dataflow System

The Drinking Philosophers Problem

June 11, 2015July 26, 2017 ~ adriancolyer

The Drinking Philosophers Problem - Chandy & Misra 1984 How could I resist a paper with a title like that! The Drinking Philosophers is referenced in the PowerGraph paper as a solution to the problem of serializable execution in graph-parallel computation. Vertices are philosophers, and edges are forks (drinks). Let's take a look at how … Continue reading The Drinking Philosophers Problem

A higher order estimate of the optimum checkpoint interval for restart dumps

June 10, 2015July 26, 2017 ~ adriancolyer

A higher order estimate of the optimum checkpoint interval for restart dumps - Daly 2004 TL;DR: if you know how long it takes your system to create a checkpoint/snapshot (δ), and you know the expected mean-time between failures (M), then set the checkpoint interval to be √(2δM) - δ. OK, I grant that today's paper … Continue reading A higher order estimate of the optimum checkpoint interval for restart dumps

Detecting Termination of Distributed Computations Using Markers

June 9, 2015July 26, 2017 ~ adriancolyer

Detecting Termination of Distributed Computations Using Markers - Misra 1983 There's an intriguing line in the Distributed GraphLab paper that caught my eye: "Termination is evaluated using distributed consensus algorithm described in [Ref]." Today's choice is the paper by Misra in 1983 that describes this distributed termination detection algorithm. The solution is similar in spirit … Continue reading Detecting Termination of Distributed Computations Using Markers

A Bridging Model for Parallel Computation

June 8, 2015July 26, 2017 ~ adriancolyer ~ 5 Comments

A Bridging Model for Parallel Computation - Valiant 1990 We've seen a lot of references to the 'Bulk Synchronous Parallel' model over the last two weeks. When it was conceived by Valiant in 1990 though, it was intended as a much more general model than simply an abstraction to support graph processing. As the von … Continue reading A Bridging Model for Parallel Computation

Scalability! But at what COST?

June 5, 2015July 26, 2017 ~ adriancolyer ~ 15 Comments

Scalability! But at what COST? - McSherry et al. 2015 With thanks to Felix Cuadrado, @felixcuadrado, for pointing this paper out to me via twitter. Scalability is highly prized, yet it can be a misleading metric when studied in isolation. McSherry et al. study the COST of distributed systems: the Configuration that Outperforms a Single … Continue reading Scalability! But at what COST?

Blogel: A Block-Centric Framework for Distributed Computation on Real-World Graphs

June 4, 2015July 26, 2017 ~ adriancolyer

Blogel: A Block-Centric Framework for Distributed Computation on Real-World Graphs - Yan et al. 2014 We've looked at a lot of different Graph-processing systems over the last couple of weeks (onto a new topic next week I promise!), and despite a variety of different implementation and execution models, one thing they all have in common … Continue reading Blogel: A Block-Centric Framework for Distributed Computation on Real-World Graphs

Pregelix: Big(ger) Graph Analytics on a Dataflow Engine

June 3, 2015July 26, 2017 ~ adriancolyer

Pregelix: Big(ger) Graph Anayltics on a Dataflow Engine - Bu et al. 2015 FlashGraph shows us that it's possible to efficiently process graphs that aren't solely in-memory, and GraphX showed us that we can map graph abstractions on top of a dataflow engine. Put the two ideas together, and you get something that looks like … Continue reading Pregelix: Big(ger) Graph Analytics on a Dataflow Engine

FlashGraph: Processing Billion Node Graphs on an Array of Commodity SSDs

June 2, 2015July 26, 2017 ~ adriancolyer ~ 1 Comment

FlashGraph: Processing Billion Node Graphs on an Array of Commodity SSDs - Zheng et al. The Web Data Commons project is the largest web corpus available to the public. Their hyperlink (page) graph dataset contains 3.4B vertices and 129B edges contained in over 1TB of data, and a graph diameter of 650. To the best … Continue reading FlashGraph: Processing Billion Node Graphs on an Array of Commodity SSDs