Holistic Configuration Management at Facebook

October 16, 2015July 27, 2017 ~ adriancolyer ~ 14 Comments

Holistic Configuration Management at Facebook - Tang et al. (Facebook) 2015 This paper gives a comprehensive description of the use cases, design, implementation, and usage statistics of a suite of tools that manage Facebook’s conﬁguration end-to-end, including the frontend products, backend systems, and mobile apps. The configuration for Facebook's site is updated thousands of times … Continue reading Holistic Configuration Management at Facebook

IronFleet: Proving Practical Distributed Systems Correct

October 15, 2015July 27, 2017 ~ adriancolyer ~ 10 Comments

IronFleet: Proving Practical Distributed Systems Correct - Hawblitzel et al. (Microsoft Research) 2015 Every so often a paper comes along that makes you re-evaluate your world view. I happily would have told you that full formal verification of non-trivial systems (especially distributed systems) in a practical manner (i.e. something you could consider using for real … Continue reading IronFleet: Proving Practical Distributed Systems Correct

Coz: Finding code that counts with causal profiling

October 14, 2015July 27, 2017 ~ adriancolyer ~ 1 Comment

Coz: Finding code that counts with causal profiling - Curtsinger & Berger 2015 update: fixed typo in paper title Sticking to the theme of 'understanding what our systems are doing,' but focusing on a single process, Coz is a causal profiler. In essence, it makes the output of a profiler much more useful to you … Continue reading Coz: Finding code that counts with causal profiling

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

October 13, 2015July 27, 2017 ~ adriancolyer ~ 4 Comments

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems - Mace et al. 2015 Problems in distributed systems are complex, varied, and unpredictable. By default, the information required to diagnose an issue may not be reported by the system or contained in system logs. Current approaches tie logging and statistics mechanisms into the development path of … Continue reading Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures

October 12, 2015July 27, 2017 ~ adriancolyer ~ 2 Comments

Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures - Kasikci et al. 2015 Last week was the 25th ACM Symposium on Operating System Principles and the conference has produced a very interesting looking set of papers. I'm going to dedicate the next couple of weeks to reviewing some of these, starting … Continue reading Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures

App-Bisect: Autonomous healing for microservices-based apps

October 9, 2015July 27, 2017 ~ adriancolyer

App-Bisect: Autonomous healing for microservices-based apps - Rajagopalan & Jamjoon 2015 We've become comfortable with the idea of continuous deployment across multiple microservices, but what happens when that deployment introduces a problem? The standard answer comes in two parts: (a) use a canary when rolling out a new version to detect a potential problem before … Continue reading App-Bisect: Autonomous healing for microservices-based apps

lprof: A Non-intrusive Request-Flow Profiler for Distributed Systems

October 8, 2015July 27, 2017 ~ adriancolyer ~ 6 Comments

lprof: A Non-intrusive Request-Flow Profiler for Distributed Systems - Zhao et al. 2014 The Mystery Machine needs a request id in log records that can be used to correlate entries in a trace. What if you don't have that? lprof makes the absolute most of whatever logging your system already has. lprof is novel in … Continue reading lprof: A Non-intrusive Request-Flow Profiler for Distributed Systems

The Mystery Machine: End-to-end performance analysis of large-scale internet services

October 7, 2015July 27, 2017 ~ adriancolyer ~ 11 Comments

The Mystery Machine: End-to-end performance analysis of large-scale internet services - Chow et al. 2014 Google's Dapper paper is very well known, but Facebook's Mystery Machine seems to be much less well known - and that's a shame because I have a hunch the approach could be very relevant to many people. Current debugging and … Continue reading The Mystery Machine: End-to-end performance analysis of large-scale internet services

Dapper, A Large Scale Distributed Systems Tracing Infrastructure

October 6, 2015July 27, 2017 ~ adriancolyer ~ 7 Comments

Dapper, A Large Scale Distributed Systems Tracing Infrastructure - Sigelman et al. (Google) 2010 I'm going to dedicate the rest of this week to a series of papers addressing the important question of "how the hell do I know what is going on in my distributed system / cloud platform / microservices deployment?" As we'll … Continue reading Dapper, A Large Scale Distributed Systems Tracing Infrastructure

IPFS – Content Addressed, Versioned, P2P File System

October 5, 2015July 27, 2017 ~ adriancolyer ~ 3 Comments

IPFS - Content Addressed, Versioned, P2P File System - Benet 2014 This paper has sat on my reading list for almost a year! I first heard about it in Joe Armstrong's 2014 talk at CodeMesh "Connecting things together is really difficult but it could and should be rather easy". CodeMesh 2015 is just around the … Continue reading IPFS – Content Addressed, Versioned, P2P File System