BigDebug: Debugging primitives for interactive big data processing in Spark

June 7, 2016 ~ Adrian Colyer ~ 2 Comments

BigDebug: Debugging primitives for interactive big data processing in Spark - Gulzar et al. ICSE 2016 BigDebug provides real-time interactive debugging support for Data-Intensive Scalable Computing (DISC) systems, or more particularly, Apache Spark. It provides breakpoints, watchpoints, latency monitoring, forward and backward tracing, crash monitoring, and a real-time fix-and-resume capability. The overheads are low for ... Continue Reading

Machine Learning: The High-Interest Credit Card of Technical Debt

February 29, 2016 ~ Adrian Colyer ~ 9 Comments

Machine Learning: The High-Interest Credit Card of Technical Debt - Sculley et al. 2014 Today's paper offers some pragmatic advice for the developers and maintainers of machine learning systems in production. It's easy to rush out version 1.0 the authors warn us, but making subsequent improvements can be unexpectedly difficult. You very much get the ... Continue Reading

The O-Ring Theory of DevOps

November 11, 2015 ~ Adrian Colyer ~ 12 Comments

The O-Ring Theory of Economic Development - Kremer 1993 Something a little different today, loosely based on the paper cited above, but not a direct review of it. I'm hosting a retrospective evening for the GOTO London conference tonight and plan to share some of these ideas there... The pursuit of excellence is no longer ... Continue Reading

Holistic Configuration Management at Facebook

October 16, 2015 ~ Adrian Colyer ~ 14 Comments

Holistic Configuration Management at Facebook - Tang et al. (Facebook) 2015 This paper gives a comprehensive description of the use cases, design, implementation, and usage statistics of a suite of tools that manage Facebook’s conﬁguration end-to-end, including the frontend products, backend systems, and mobile apps. The configuration for Facebook's site is updated thousands of times ... Continue Reading

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

October 13, 2015 ~ Adrian Colyer ~ 4 Comments

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems - Mace et al. 2015 Problems in distributed systems are complex, varied, and unpredictable. By default, the information required to diagnose an issue may not be reported by the system or contained in system logs. Current approaches tie logging and statistics mechanisms into the development path of ... Continue Reading

Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures

October 12, 2015 ~ Adrian Colyer ~ 2 Comments

Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures - Kasikci et al. 2015 Last week was the 25th ACM Symposium on Operating System Principles and the conference has produced a very interesting looking set of papers. I'm going to dedicate the next couple of weeks to reviewing some of these, starting ... Continue Reading

App-Bisect: Autonomous healing for microservices-based apps

October 9, 2015 ~ Adrian Colyer ~ Leave a comment

App-Bisect: Autonomous healing for microservices-based apps - Rajagopalan & Jamjoon 2015 We've become comfortable with the idea of continuous deployment across multiple microservices, but what happens when that deployment introduces a problem? The standard answer comes in two parts: (a) use a canary when rolling out a new version to detect a potential problem before ... Continue Reading

lprof: A Non-intrusive Request-Flow Profiler for Distributed Systems

October 8, 2015 ~ Adrian Colyer ~ 6 Comments

lprof: A Non-intrusive Request-Flow Profiler for Distributed Systems - Zhao et al. 2014 The Mystery Machine needs a request id in log records that can be used to correlate entries in a trace. What if you don't have that? lprof makes the absolute most of whatever logging your system already has. lprof is novel in ... Continue Reading

The Mystery Machine: End-to-end performance analysis of large-scale internet services

October 7, 2015 ~ Adrian Colyer ~ 11 Comments

The Mystery Machine: End-to-end performance analysis of large-scale internet services - Chow et al. 2014 Google's Dapper paper is very well known, but Facebook's Mystery Machine seems to be much less well known - and that's a shame because I have a hunch the approach could be very relevant to many people. Current debugging and ... Continue Reading

Dapper, A Large Scale Distributed Systems Tracing Infrastructure

October 6, 2015 ~ Adrian Colyer ~ 7 Comments

Dapper, A Large Scale Distributed Systems Tracing Infrastructure - Sigelman et al. (Google) 2010 I'm going to dedicate the rest of this week to a series of papers addressing the important question of "how the hell do I know what is going on in my distributed system / cloud platform / microservices deployment?" As we'll ... Continue Reading

the morning paper

a random walk through Computer Science research, by Adrian Colyer
Made delightfully fast by strattic

Operations