Streams à la carte: Extensible pipelines with object algebras

August 13, 2015 ~ Adrian Colyer ~ 5 Comments

Streams à la carte: Extensible pipelines with object algebras - Biboudis et al. 2015 Streaming APIs are popping up everywhere, allowing the programmer to express streaming computations such as: int sum = IntStream.of(v) .filter(x -> x % 2 == 0) .map(x -> x * x) .sum(); On examining the streaming libraries in Java, Scala, and ... Continue Reading

Discretized Streams: Fault Tolerant Stream Computing at Scale

June 19, 2015 ~ Adrian Colyer ~ 2 Comments

Discretized Streams: Fault Tolerant Stream Computing at Scale - Zaharia et al. 2013 This is the Spark Streaming paper, and it sets out very clearly the problem that Discretized Streams were designed to solve: dealing effectively with faults and stragglers when processing streams in large clusters. This is hard to do in the traditional continuous ... Continue Reading

Spinning Fast Iterative Dataflows

June 18, 2015 ~ Adrian Colyer ~ 3 Comments

Spinning Fast Iterative Dataflows - Ewen et al. 2012 Last week we saw how Naiad combines low-latency stream processing with iterative computation, and yesterday we looked in more detail at the Differential Dataflow model for incremental processing (needed for low-latency). The Apache Flink project also combines low-latency stream processing with support for incremental, iterative computation. ... Continue Reading

Differential Dataflow

June 17, 2015 ~ Adrian Colyer ~ 3 Comments

Differential Dataflow - McSherry et al. 2013 The ability to perform complex analyses on [datasets that are constantly being updated] is very valuable; for example, each tweet published on the Twitter social network may supply new information about the community structure of the service’s users, which could be immediately exploited for real-time recommendation services or ... Continue Reading

Twitter Heron: Stream Processing at Scale

June 15, 2015 ~ Adrian Colyer ~ 14 Comments

Twitter Heron: Stream Processing at Scale - Kulkarni et al. 2015 It's hard to imagine something more damaging to Apache Storm than this. Having read it through, I'm left with the impression that the paper might as well have been titled "Why Storm Sucks", which coming from Twitter themselves is quite a statement. There's a ... Continue Reading

Naiad: A Timely Dataflow System

June 12, 2015 ~ Adrian Colyer ~ 17 Comments

Naiad: A Timely Dataflow System - Murray et al. 2013 Many data processing tasks require low-latency interactive access to results, iterative sub-computations, and consistent intermediate outputs so that sub-computations can be nested and composed. (For example, an) application that performs iterative processing on a real-time data stream, and supports interactive queries on a fresh, consistent ... Continue Reading

GraphX: Graph Processing in a Distributed Dataflow Framework

June 1, 2015 ~ Adrian Colyer ~ 4 Comments

GraphX: Graph Processing in a Distributed Dataflow Framework - Gonzalez et al. 2014 This is the second of two weeks dedicated to graph processing. So far in this mini-series we've looked at what we know about networks of complex systems and graphs that model the real-world; Google's Pregel which led to a whole set of ... Continue Reading

Wormhole: Reliable pub-sub to support Geo-Replicated Internet Services

May 14, 2015 ~ Adrian Colyer ~ Leave a comment

Wormhole: Reliable pub-sub to support Geo-Replicated Internet Services - Sharma et al. 2015 At Facebook, lots of applications are interested in data being written to Facebook's data stores. Having each of these applications poll the data stores of interest would be untenable, so Facebook built a pub-sub system to identify updates and transmit notifications to ... Continue Reading

Liquid: Unifying nearline and offline big data integration

February 4, 2015 ~ Adrian Colyer ~ 3 Comments

Liquid: Unifying Nearline and Offline Big Data Integration - Fernandez et al. 2015 This is post 3 of 5 in a series looking at the latest research from the CIDR '15 conference. Also in the series so far this week: 'The missing piece in complex analytics' and 'WANalytics: analytics for a geo-distributed, data intensive world'. ... Continue Reading

Photon: Fault-tolerant and scalable joining of continuous data streams

December 4, 2014 ~ Adrian Colyer ~ 4 Comments

Photon: Fault-tolerant and scalable joining of continuous data streams - Google 2013 To the best of our knowledge, this is the first paper to formulate and solve the problem of joining multiple streams continuously under these system constraints: exactly-once semantics, fault-tolerance at datacenter-level, high scalability, low latency, unordered streams, and delayed primary stream. It's interesting ... Continue Reading

the morning paper

a random walk through Computer Science research, by Adrian Colyer
Made delightfully fast by strattic

Stream processing