IronFleet: Proving Practical Distributed Systems Correct

IronFleet: Proving Practical Distributed Systems Correct - Hawblitzel et al. (Microsoft Research) 2015 Every so often a paper comes along that makes you re-evaluate your world view. I happily would have told you that full formal verification of non-trivial systems (especially distributed systems) in a practical manner (i.e. something you could consider using for real … Continue reading IronFleet: Proving Practical Distributed Systems Correct

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems - Mace et al. 2015 Problems in distributed systems are complex, varied, and unpredictable. By default, the information required to diagnose an issue may not be reported by the system or contained in system logs. Current approaches tie logging and statistics mechanisms into the development path of … Continue reading Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures

Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures - Kasikci et al. 2015 Last week was the 25th ACM Symposium on Operating System Principles and the conference has produced a very interesting looking set of papers. I'm going to dedicate the next couple of weeks to reviewing some of these, starting … Continue reading Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures

App-Bisect: Autonomous healing for microservices-based apps

App-Bisect: Autonomous healing for microservices-based apps - Rajagopalan & Jamjoon 2015 We've become comfortable with the idea of continuous deployment across multiple microservices, but what happens when that deployment introduces a problem? The standard answer comes in two parts: (a) use a canary when rolling out a new version to detect a potential problem before … Continue reading App-Bisect: Autonomous healing for microservices-based apps

lprof: A Non-intrusive Request-Flow Profiler for Distributed Systems

lprof: A Non-intrusive Request-Flow Profiler for Distributed Systems - Zhao et al. 2014 The Mystery Machine needs a request id in log records that can be used to correlate entries in a trace. What if you don't have that? lprof makes the absolute most of whatever logging your system already has. lprof is novel in … Continue reading lprof: A Non-intrusive Request-Flow Profiler for Distributed Systems

The Mystery Machine: End-to-end performance analysis of large-scale internet services

The Mystery Machine: End-to-end performance analysis of large-scale internet services - Chow et al. 2014 Google's Dapper paper is very well known, but Facebook's Mystery Machine seems to be much less well known - and that's a shame because I have a hunch the approach could be very relevant to many people. Current debugging and … Continue reading The Mystery Machine: End-to-end performance analysis of large-scale internet services

Dapper, A Large Scale Distributed Systems Tracing Infrastructure

Dapper, A Large Scale Distributed Systems Tracing Infrastructure - Sigelman et al. (Google) 2010 I'm going to dedicate the rest of this week to a series of papers addressing the important question of "how the hell do I know what is going on in my distributed system / cloud platform / microservices deployment?" As we'll … Continue reading Dapper, A Large Scale Distributed Systems Tracing Infrastructure

Cloud Computing Resource Scheduling and a Survey of its Evolutionary Approaches

Cloud Computing Resource Scheduling and a Survey of its Evolutionary Approaches - Zhan et al. 2015 In both academia and industry, the problem of cloud resource scheduling is seen to be as hard as a Nondeterministic Polynomial (NP) optimization problem, that is, an NP-hard problem, whose intractability increases exponentially with the number of variables if … Continue reading Cloud Computing Resource Scheduling and a Survey of its Evolutionary Approaches