Streams à la carte: Extensible pipelines with object algebras

Streams à la carte: Extensible pipelines with object algebras - Biboudis et al. 2015 Streaming APIs are popping up everywhere, allowing the programmer to express streaming computations such as: int sum = IntStream.of(v) .filter(x -> x % 2 == 0) .map(x -> x * x) .sum(); On examining the streaming libraries in Java, Scala, and … Continue reading Streams à la carte: Extensible pipelines with object algebras

Discretized Streams: Fault Tolerant Stream Computing at Scale

Discretized Streams: Fault Tolerant Stream Computing at Scale - Zaharia et al. 2013 This is the Spark Streaming paper, and it sets out very clearly the problem that Discretized Streams were designed to solve: dealing effectively with faults and stragglers when processing streams in large clusters. This is hard to do in the traditional continuous … Continue reading Discretized Streams: Fault Tolerant Stream Computing at Scale

Spinning Fast Iterative Dataflows

Spinning Fast Iterative Dataflows - Ewen et al. 2012 Last week we saw how Naiad combines low-latency stream processing with iterative computation, and yesterday we looked in more detail at the Differential Dataflow model for incremental processing (needed for low-latency). The Apache Flink project also combines low-latency stream processing with support for incremental, iterative computation. … Continue reading Spinning Fast Iterative Dataflows

Naiad: A Timely Dataflow System

Naiad: A Timely Dataflow System - Murray et al. 2013 Many data processing tasks require low-latency interactive access to results, iterative sub-computations, and consistent intermediate outputs so that sub-computations can be nested and composed. (For example, an) application that performs iterative processing on a real-time data stream, and supports interactive queries on a fresh, consistent … Continue reading Naiad: A Timely Dataflow System

GraphX: Graph Processing in a Distributed Dataflow Framework

GraphX: Graph Processing in a Distributed Dataflow Framework - Gonzalez et al. 2014 This is the second of two weeks dedicated to graph processing. So far in this mini-series we've looked at what we know about networks of complex systems and graphs that model the real-world; Google's Pregel which led to a whole set of … Continue reading GraphX: Graph Processing in a Distributed Dataflow Framework

Wormhole: Reliable pub-sub to support Geo-Replicated Internet Services

Wormhole: Reliable pub-sub to support Geo-Replicated Internet Services - Sharma et al. 2015 At Facebook, lots of applications are interested in data being written to Facebook's data stores. Having each of these applications poll the data stores of interest would be untenable, so Facebook built a pub-sub system to identify updates and transmit notifications to … Continue reading Wormhole: Reliable pub-sub to support Geo-Replicated Internet Services

Liquid: Unifying nearline and offline big data integration

Liquid: Unifying Nearline and Offline Big Data Integration - Fernandez et al. 2015 This is post 3 of 5 in a series looking at the latest research from the CIDR '15 conference. Also in the series so far this week: 'The missing piece in complex analytics' and 'WANalytics: analytics for a geo-distributed, data intensive world'. … Continue reading Liquid: Unifying nearline and offline big data integration

Photon: Fault-tolerant and scalable joining of continuous data streams

Photon: Fault-tolerant and scalable joining of continuous data streams - Google 2013 To the best of our knowledge, this is the first paper to formulate and solve the problem of joining multiple streams continuously under these system constraints: exactly-once semantics, fault-tolerance at datacenter-level, high scalability, low latency, unordered streams, and delayed primary stream. It's interesting … Continue reading Photon: Fault-tolerant and scalable joining of continuous data streams