Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure

Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure, Li et al., NSDI'20 Modern software systems at scale are incredibly complex ever changing environments. Despite all the pre-deployment testing you might employ, this makes it really tough to change them with confidence. Thus it's common to use some form of phased rollout, ... Continue Reading

Meaningful availability

Meaningful availability, Hauer et al., NSDI'20 With thanks to Damien Mathieu for the recommendation. This very clearly written paper describes the Google G Suite team's search for a meaningful availability metric: one that accurately reflected what their end users experienced, and that could be used by engineers to pinpoint issues and guide improvements. A good ... Continue Reading

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part II)

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages, Allspaw, Masters thesis, Lund University 2015 This is part 2 of our look at Allspaw's 2015 master thesis (here's part 1). Today we'll be digging into the analysis of an incident that took place at Etsy on December 4th, 2014. 1:00pm Eastern Standard ... Continue Reading

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part 1)

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages, Allspaw, Masters thesis, Lund University, 2015 Following on from the STELLA report, today we're going back to the first major work to study the human and organisational side of incident management in business-critical Internet services: John Allspaw's 2015 Masters thesis. The document runs ... Continue Reading

Automating chaos experiments in production

Automating chaos experiments in production Basiri et al., ICSE 2019 Are you ready to take your system assurance programme to the next level? This is a fascinating paper from members of Netflix’s Resilience Engineering team describing their chaos engineering initiatives: automated controlled experiments designed to verify hypotheses about how the system should behave under gray ... Continue Reading