Probabilistically Bounded Staleness for Practical Partial Quorums

Probabilistically Bounded Staleness for Practical Partial Quorums - Bailis et al. 2012, and Quantifying Eventual Consistency with PBS - Bailis et al. 2014 'Probabilistically Bounded Staleness... ' was the original VLDB '12 paper, and then the authors were invited to submit an extended version to the VLDB Journal ('Quantifying Eventual Consistency...') which was published in … Continue reading Probabilistically Bounded Staleness for Practical Partial Quorums

Optimizing Search Engines using Clickthrough Data

Optimizing Search Engines using Clickthrough Data - Joachims, 2002 Today's choice is another KDD 'test-of-time' winner. The paper introduced the problem of ranking documents w.r.t. a query using not explicit user feedback but implicit user feedback in the form of clickthrough data. The author presented the Ranking SVM Algorithm to solve the proposed ranking problem. … Continue reading Optimizing Search Engines using Clickthrough Data

Efficient Algorithms for Public-Private Social Networks

Efficient Algorithms for Public-Private Social Networks - Chierichetti et al. 2015 Today's choice won a best paper award at KDD'15. The authors examine a number of algorithms for computing graph (network) measures in the context of social networks that enable private groups and connections. These are characterised by a large public graph G=(V,E), and for … Continue reading Efficient Algorithms for Public-Private Social Networks

A Machine Learning Framework to Identify Students at Risk of Adverse Academic Outcomes

A Machine Learning Framework to Identify Students at Risk of Adverse Academic Outcomes - Lakkaraju et al. 2015 This is the first of a series of papers from the Knowledge Discovery and Data Mining (KDD'15) conference that we'll look at this week. Today's paper is all about helping high school students in the US who … Continue reading A Machine Learning Framework to Identify Students at Risk of Adverse Academic Outcomes

MillWheel: Fault-Tolerant Stream Processing at Internet Scale

MillWheel: Fault-Tolerant Stream Processing at Internet Scale - Akidau et al. (Google) 2013 Earlier this week we looked at the Google Cloud Dataflow model which is implemented on top of FlumeJava (for batch) and MillWheel (for streaming): We have implemented this model internally in FlumeJava, with MillWheel used as the underlying execution engine for streaming … Continue reading MillWheel: Fault-Tolerant Stream Processing at Internet Scale

Asynchronous Distributed Snapshots for Distributed Dataflows

Asynchronous Distributed Snapshots for Distributed Dataflows - Carbone et al. 2015 The team behind Apache Flink and data Artisans are a smart group of folks. Their recent blog post on High-throughput, low-latency, and exactly-once stream processing with Apache Flink is well worth reading and has a good description of the evolution of streaming architectures, the … Continue reading Asynchronous Distributed Snapshots for Distributed Dataflows

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing - Akidau et al. (Google) - 2015 With thanks to William Vambenepe for suggesting this paper via twitter. Google Cloud Dataflow reached GA last week, and the team behind Cloud Dataflow have a paper accepted at VLDB'15 … Continue reading The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing