Understanding, detecting and localizing partial failures in large system software

March 16, 2020March 15, 2020 ~ adriancolyer ~ 2 Comments

Understanding, detecting and localizing partial failures in large system software, Lou et al., NSDI'20 Partial failures (gray failures) occur when some but not all of the functionalities of a system are broken. On the surface everything can appear to be fine, but under the covers things may be going astray. When a partial failure occurs, … Continue reading Understanding, detecting and localizing partial failures in large system software

When correlation (or lack of it) can be causation

March 13, 2020March 8, 2020 ~ adriancolyer ~ 1 Comment

Rex: preventing bugs and misconfiguration in large services using correlated change analysis, Mehta et al., NSDI'20 and Check before you change: preventing correlated failures in service updates, Zhai et al., NSDI'20 Today's post is a double header. I've chosen two papers from NSDI'20 that are both about correlation. Rex is a tool widely deployed across … Continue reading When correlation (or lack of it) can be causation

Millions of tiny databases

March 4, 2020March 1, 2020 ~ adriancolyer ~ 14 Comments

Millions of tiny databases, Brooker et al., NSDI'20 This paper is a real joy to read. It takes you through the thinking processes and engineering practices behind the design of a key part of the control plane for AWS Elastic Block Storage (EBS): the Physalia database that stores configuration information. In the same spirit as … Continue reading Millions of tiny databases

Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure

February 28, 2020February 22, 2020 ~ adriancolyer ~ 5 Comments

Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure, Li et al., NSDI'20 Modern software systems at scale are incredibly complex ever changing environments. Despite all the pre-deployment testing you might employ, this makes it really tough to change them with confidence. Thus it's common to use some form of phased rollout, … Continue reading Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure

Meaningful availability

February 26, 2020February 26, 2020 ~ adriancolyer ~ 13 Comments

Meaningful availability, Hauer et al., NSDI'20 With thanks to Damien Mathieu for the recommendation. This very clearly written paper describes the Google G Suite team's search for a meaningful availability metric: one that accurately reflected what their end users experienced, and that could be used by engineers to pinpoint issues and guide improvements. > A … Continue reading Meaningful availability

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part II)

January 24, 2020January 18, 2020 ~ adriancolyer ~ 1 Comment

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages, Allspaw, Masters thesis, Lund University 2015 This is part 2 of our look at Allspaw's 2015 master thesis (here's part 1). Today we'll be digging into the analysis of an incident that took place at Etsy on December 4th, 2014. 1:00pm Eastern Standard … Continue reading Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part II)

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part 1)

January 22, 2020January 18, 2020 ~ adriancolyer ~ 4 Comments

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages, Allspaw, Masters thesis, Lund University, 2015 Following on from the STELLA report, today we're going back to the first major work to study the human and organisational side of incident management in business-critical Internet services: John Allspaw's 2015 Masters thesis. The document runs … Continue reading Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part 1)

STELLA: report from the SNAFU-catchers workshop on coping with complexity

January 20, 2020January 16, 2020 ~ adriancolyer ~ 7 Comments

STELLA: report from the SNAFU-catchers workshop on coping with complexity, Woods 2017, Coping with Complexity workshop "Coping with complexity" is about as good a three-word summary of the systems and software challenges facing us over the next decade as I can imagine. Today's choice is a report from a 2017 workshop convened with that title, … Continue reading STELLA: report from the SNAFU-catchers workshop on coping with complexity

Automating chaos experiments in production

July 5, 2019June 28, 2019 ~ adriancolyer ~ 2 Comments

Automating chaos experiments in production Basiri et al., ICSE 2019 Are you ready to take your system assurance programme to the next level? This is a fascinating paper from members of Netflix’s Resilience Engineering team describing their chaos engineering initiatives: automated controlled experiments designed to verify hypotheses about how the system should behave under gray … Continue reading Automating chaos experiments in production

Nines are not enough: meaningful metrics for clouds

June 19, 2019June 13, 2019 ~ adriancolyer ~ 16 Comments

Nines are not enough: meaningful metrics for clouds Mogul & Wilkes, HotOS'19 It’s hard to define good SLOs, especially when outcomes aren’t fully under the control of any single party. The authors of today’s paper should know a thing or two about that: Jeffrey Mogul and John Wilkes at Google1! John Wilkes was also one … Continue reading Nines are not enough: meaningful metrics for clouds