Machine Learning: The High-Interest Credit Card of Technical Debt

Machine Learning: The High-Interest Credit Card of Technical Debt - Sculley et al. 2014 Today's paper offers some pragmatic advice for the developers and maintainers of machine learning systems in production. It's easy to rush out version 1.0 the authors warn us, but making subsequent improvements can be unexpectedly difficult. You very much get the … Continue reading Machine Learning: The High-Interest Credit Card of Technical Debt

Dapper, A Large Scale Distributed Systems Tracing Infrastructure

Dapper, A Large Scale Distributed Systems Tracing Infrastructure - Sigelman et al. (Google) 2010 I'm going to dedicate the rest of this week to a series of papers addressing the important question of "how the hell do I know what is going on in my distributed system / cloud platform / microservices deployment?" As we'll … Continue reading Dapper, A Large Scale Distributed Systems Tracing Infrastructure

Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network

Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network - Singh et. al (Google) 2015 Let's end the week with something completely different: a look at ten years and five generations of networking within Google's datacenters. Bandwidth demands within the datacenter are doubling every 12-15 months, even faster than the … Continue reading Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network

MillWheel: Fault-Tolerant Stream Processing at Internet Scale

MillWheel: Fault-Tolerant Stream Processing at Internet Scale - Akidau et al. (Google) 2013 Earlier this week we looked at the Google Cloud Dataflow model which is implemented on top of FlumeJava (for batch) and MillWheel (for streaming): We have implemented this model internally in FlumeJava, with MillWheel used as the underlying execution engine for streaming … Continue reading MillWheel: Fault-Tolerant Stream Processing at Internet Scale

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing - Akidau et al. (Google) - 2015 With thanks to William Vambenepe for suggesting this paper via twitter. Google Cloud Dataflow reached GA last week, and the team behind Cloud Dataflow have a paper accepted at VLDB'15 … Continue reading The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

Heracles: Improving Resource Efficiency at Scale

Heracles: Improving Resource Efficiency at Scale - Lo et al. 2015 Until recently, scaling from Moore’s law provided higher compute per dollar with every server generation, allowing datacenters to scale without raising the cost. However, with several imminent challenges in technology scaling, alternate approaches are needed. Those approaches involve increasing server utilization, which is still … Continue reading Heracles: Improving Resource Efficiency at Scale

Pregel: A System for Large-Scale Graph Processing

Pregel: A System for Large-Scale Graph Processing - Malewicz et al. (Google) 2010 "Many practical computing problems concern large graphs." Yesterday we looked at some of the models for understanding networks and graphs. Today's paper focuses on processing of graphs, especially the efficient processing of large graphs where large can mean billions of vertices and … Continue reading Pregel: A System for Large-Scale Graph Processing