BigDebug: Debugging primitives for interactive big data processing in Spark - Gulzar et al. ICSE 2016 BigDebug provides real-time interactive debugging support for Data-Intensive Scalable Computing (DISC) systems, or more particularly, Apache Spark. It provides breakpoints, watchpoints, latency monitoring, forward and backward tracing, crash monitoring, and a real-time fix-and-resume capability. The overheads are low for … Continue reading BigDebug: Debugging primitives for interactive big data processing in Spark
Tag: Operations
Machine Learning: The High-Interest Credit Card of Technical Debt
Machine Learning: The High-Interest Credit Card of Technical Debt - Sculley et al. 2014 Today's paper offers some pragmatic advice for the developers and maintainers of machine learning systems in production. It's easy to rush out version 1.0 the authors warn us, but making subsequent improvements can be unexpectedly difficult. You very much get the … Continue reading Machine Learning: The High-Interest Credit Card of Technical Debt
The O-Ring Theory of DevOps
The O-Ring Theory of Economic Development - Kremer 1993 Something a little different today, loosely based on the paper cited above, but not a direct review of it. I'm hosting a retrospective evening for the GOTO London conference tonight and plan to share some of these ideas there... The pursuit of excellence is no longer … Continue reading The O-Ring Theory of DevOps
Holistic Configuration Management at Facebook
Holistic Configuration Management at Facebook - Tang et al. (Facebook) 2015 This paper gives a comprehensive description of the use cases, design, implementation, and usage statistics of a suite of tools that manage Facebook’s configuration end-to-end, including the frontend products, backend systems, and mobile apps. The configuration for Facebook's site is updated thousands of times … Continue reading Holistic Configuration Management at Facebook
Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems
Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems - Mace et al. 2015 Problems in distributed systems are complex, varied, and unpredictable. By default, the information required to diagnose an issue may not be reported by the system or contained in system logs. Current approaches tie logging and statistics mechanisms into the development path of … Continue reading Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems
Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures
Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures - Kasikci et al. 2015 Last week was the 25th ACM Symposium on Operating System Principles and the conference has produced a very interesting looking set of papers. I'm going to dedicate the next couple of weeks to reviewing some of these, starting … Continue reading Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures
App-Bisect: Autonomous healing for microservices-based apps
App-Bisect: Autonomous healing for microservices-based apps - Rajagopalan & Jamjoon 2015 We've become comfortable with the idea of continuous deployment across multiple microservices, but what happens when that deployment introduces a problem? The standard answer comes in two parts: (a) use a canary when rolling out a new version to detect a potential problem before … Continue reading App-Bisect: Autonomous healing for microservices-based apps
lprof: A Non-intrusive Request-Flow Profiler for Distributed Systems
lprof: A Non-intrusive Request-Flow Profiler for Distributed Systems - Zhao et al. 2014 The Mystery Machine needs a request id in log records that can be used to correlate entries in a trace. What if you don't have that? lprof makes the absolute most of whatever logging your system already has. lprof is novel in … Continue reading lprof: A Non-intrusive Request-Flow Profiler for Distributed Systems
The Mystery Machine: End-to-end performance analysis of large-scale internet services
The Mystery Machine: End-to-end performance analysis of large-scale internet services - Chow et al. 2014 Google's Dapper paper is very well known, but Facebook's Mystery Machine seems to be much less well known - and that's a shame because I have a hunch the approach could be very relevant to many people. Current debugging and … Continue reading The Mystery Machine: End-to-end performance analysis of large-scale internet services
Dapper, A Large Scale Distributed Systems Tracing Infrastructure
Dapper, A Large Scale Distributed Systems Tracing Infrastructure - Sigelman et al. (Google) 2010 I'm going to dedicate the rest of this week to a series of papers addressing the important question of "how the hell do I know what is going on in my distributed system / cloud platform / microservices deployment?" As we'll … Continue reading Dapper, A Large Scale Distributed Systems Tracing Infrastructure