Understanding, detecting and localizing partial failures in large system software

Understanding, detecting and localizing partial failures in large system software, Lou et al., NSDI'20 Partial failures (gray failures) occur when some but not all of the functionalities of a system are broken. On the surface everything can appear to be fine, but under the covers things may be going astray. When a partial failure occurs, … Continue reading Understanding, detecting and localizing partial failures in large system software

When correlation (or lack of it) can be causation

Rex: preventing bugs and misconfiguration in large services using correlated change analysis, Mehta et al., NSDI'20 and Check before you change: preventing correlated failures in service updates, Zhai et al., NSDI'20 Today's post is a double header. I've chosen two papers from NSDI'20 that are both about correlation. Rex is a tool widely deployed across … Continue reading When correlation (or lack of it) can be causation

Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure

Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure, Li et al., NSDI'20 Modern software systems at scale are incredibly complex ever changing environments. Despite all the pre-deployment testing you might employ, this makes it really tough to change them with confidence. Thus it's common to use some form of phased rollout, … Continue reading Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part II)

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages, Allspaw, Masters thesis, Lund University 2015 This is part 2 of our look at Allspaw's 2015 master thesis (here's part 1). Today we'll be digging into the analysis of an incident that took place at Etsy on December 4th, 2014. 1:00pm Eastern Standard … Continue reading Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part II)

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part 1)

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages, Allspaw, Masters thesis, Lund University, 2015 Following on from the STELLA report, today we're going back to the first major work to study the human and organisational side of incident management in business-critical Internet services: John Allspaw's 2015 Masters thesis. The document runs … Continue reading Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part 1)