Range thresholding on streams Qiao et al. SIGMOD 2016 It’s another streaming paper today, also looking at how to efficiently handle a large volume of concurrent queries over a stream, and also claiming a significant performance breakthrough of several orders of magnitude. We’re looking at a different type of query though, known as a range … Continue reading Range thresholding on streams
Tag: Stream processing
Stream processing systems and algorithms.
Sharing-aware outlier analytics over high-volume data streams
Sharing-aware outlier analytics over high-volume data streams Cao et al. SIGMOD 2016 With yesterday’s preliminaries on skyline queries out of the way, it’s time to turn our attention to the Sharing-aware Outlier Processing (SOP) algorithm of Cao et al. The challenge that SOP addresses is that of building a stream-based outlier detection system that can … Continue reading Sharing-aware outlier analytics over high-volume data streams
Realtime data processing at Facebook
Realtime Data Processing at Facebook Chen et al. SIGMOD 2016 ‘Realtime Data Processing at Facebook’ provides us with a great high-level overview of the systems Facebook have built to support real-time workloads. At the heart of the paper is a set of five key design decisions for building such systems, together with an explanation of … Continue reading Realtime data processing at Facebook
StreamScope: Continuous reliable distributed processing of big data streams
StreamScope: Continuous Reliable Distributed Processing of Big Data Streams - Lin et al. NSDI '16 An emerging trend in big data processing is to extract timely insights from continuous big data streams with distributed computation running on a large cluster of machines. Examples of such data streams include those from sensors, mobile devices, and on-line … Continue reading StreamScope: Continuous reliable distributed processing of big data streams
MacroBase: Analytic Monitoring for the Internet of Things
MacroBase: Analytic Monitoring for the Internet of Things - Bailis et al. 2016 It looks like Peter Alvaro is not the only one to be doing some industrial collaboration recently! MacroBase is the result of Peter Bailis' collaboration with Cambridge Mobile Telematics (CMT), an IoT company. The topic at hand is analytic monitoring - detecting … Continue reading MacroBase: Analytic Monitoring for the Internet of Things
Asynchronous Complex Analytics in a Distributed Dataflow Architecture
Asynchronous Complex Analytics in a Distributed Dataflow Architecture - Gonzalez et al. 2015 Here's a theme we've seen before: the programming model offered by large scale distributed systems doesn't always lend itself to efficient algorithms for solving certain classes of problems. In today's paper, Gonzalez et al. examine the growing gap between efficient machine learning … Continue reading Asynchronous Complex Analytics in a Distributed Dataflow Architecture
Mining High-Speed Data Streams
Mining High-Speed Data Streams - Domingos & Hulten 2000 This paper won a 'test of time' award at KDD'15 as an 'outstanding paper from a past KDD Conference beyond the last decade that has had an important impact on the data mining community.' Here's what the test-of-time committee have to say about it: This paper … Continue reading Mining High-Speed Data Streams
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
MillWheel: Fault-Tolerant Stream Processing at Internet Scale - Akidau et al. (Google) 2013 Earlier this week we looked at the Google Cloud Dataflow model which is implemented on top of FlumeJava (for batch) and MillWheel (for streaming): We have implemented this model internally in FlumeJava, with MillWheel used as the underlying execution engine for streaming … Continue reading MillWheel: Fault-Tolerant Stream Processing at Internet Scale
Asynchronous Distributed Snapshots for Distributed Dataflows
Asynchronous Distributed Snapshots for Distributed Dataflows - Carbone et al. 2015 The team behind Apache Flink and data Artisans are a smart group of folks. Their recent blog post on High-throughput, low-latency, and exactly-once stream processing with Apache Flink is well worth reading and has a good description of the evolution of streaming architectures, the … Continue reading Asynchronous Distributed Snapshots for Distributed Dataflows
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing - Akidau et al. (Google) - 2015 With thanks to William Vambenepe for suggesting this paper via twitter. Google Cloud Dataflow reached GA last week, and the team behind Cloud Dataflow have a paper accepted at VLDB'15 … Continue reading The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing