Streams à la carte: Extensible pipelines with object algebras - Biboudis et al. 2015 Streaming APIs are popping up everywhere, allowing the programmer to express streaming computations such as: int sum = IntStream.of(v) .filter(x -> x % 2 == 0) .map(x -> x * x) .sum(); On examining the streaming libraries in Java, Scala, and … Continue reading Streams à la carte: Extensible pipelines with object algebras
Tag: Stream processing
Stream processing systems and algorithms.
Discretized Streams: Fault Tolerant Stream Computing at Scale
Discretized Streams: Fault Tolerant Stream Computing at Scale - Zaharia et al. 2013 This is the Spark Streaming paper, and it sets out very clearly the problem that Discretized Streams were designed to solve: dealing effectively with faults and stragglers when processing streams in large clusters. This is hard to do in the traditional continuous … Continue reading Discretized Streams: Fault Tolerant Stream Computing at Scale
Spinning Fast Iterative Dataflows
Spinning Fast Iterative Dataflows - Ewen et al. 2012 Last week we saw how Naiad combines low-latency stream processing with iterative computation, and yesterday we looked in more detail at the Differential Dataflow model for incremental processing (needed for low-latency). The Apache Flink project also combines low-latency stream processing with support for incremental, iterative computation. … Continue reading Spinning Fast Iterative Dataflows
Differential Dataflow
Differential Dataflow - McSherry et al. 2013 The ability to perform complex analyses on [datasets that are constantly being updated] is very valuable; for example, each tweet published on the Twitter social network may supply new information about the community structure of the service’s users, which could be immediately exploited for real-time recommendation services or … Continue reading Differential Dataflow
Twitter Heron: Stream Processing at Scale
Twitter Heron: Stream Processing at Scale - Kulkarni et al. 2015 It's hard to imagine something more damaging to Apache Storm than this. Having read it through, I'm left with the impression that the paper might as well have been titled "Why Storm Sucks", which coming from Twitter themselves is quite a statement. There's a … Continue reading Twitter Heron: Stream Processing at Scale
Naiad: A Timely Dataflow System
Naiad: A Timely Dataflow System - Murray et al. 2013 Many data processing tasks require low-latency interactive access to results, iterative sub-computations, and consistent intermediate outputs so that sub-computations can be nested and composed. (For example, an) application that performs iterative processing on a real-time data stream, and supports interactive queries on a fresh, consistent … Continue reading Naiad: A Timely Dataflow System
GraphX: Graph Processing in a Distributed Dataflow Framework
GraphX: Graph Processing in a Distributed Dataflow Framework - Gonzalez et al. 2014 This is the second of two weeks dedicated to graph processing. So far in this mini-series we've looked at what we know about networks of complex systems and graphs that model the real-world; Google's Pregel which led to a whole set of … Continue reading GraphX: Graph Processing in a Distributed Dataflow Framework
Wormhole: Reliable pub-sub to support Geo-Replicated Internet Services
Wormhole: Reliable pub-sub to support Geo-Replicated Internet Services - Sharma et al. 2015 At Facebook, lots of applications are interested in data being written to Facebook's data stores. Having each of these applications poll the data stores of interest would be untenable, so Facebook built a pub-sub system to identify updates and transmit notifications to … Continue reading Wormhole: Reliable pub-sub to support Geo-Replicated Internet Services
Liquid: Unifying nearline and offline big data integration
Liquid: Unifying Nearline and Offline Big Data Integration - Fernandez et al. 2015 This is post 3 of 5 in a series looking at the latest research from the CIDR '15 conference. Also in the series so far this week: 'The missing piece in complex analytics' and 'WANalytics: analytics for a geo-distributed, data intensive world'. … Continue reading Liquid: Unifying nearline and offline big data integration
Photon: Fault-tolerant and scalable joining of continuous data streams
Photon: Fault-tolerant and scalable joining of continuous data streams - Google 2013 To the best of our knowledge, this is the first paper to formulate and solve the problem of joining multiple streams continuously under these system constraints: exactly-once semantics, fault-tolerance at datacenter-level, high scalability, low latency, unordered streams, and delayed primary stream. It's interesting … Continue reading Photon: Fault-tolerant and scalable joining of continuous data streams