Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services

November 28, 2016November 11, 2019 ~ Adrian Colyer ~ 4 Comments

Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services Veeraraghavan et al. (Facebook) OSDI 2016 How do you know how well your systems can perform under stress? How can you identify resource utilization bottlenecks? And how do you know your tests match the condititions experienced with live ... Continue Reading

Simple testing can prevent most critical failures

October 6, 2016November 11, 2019 ~ Adrian Colyer ~ 18 Comments

Simple testing can prevent most critical failures: an analysis of production failures in distributed data-intensive systems Yuan et al. OSDI 2014 After yesterday's paper I needed something a little easier to digest today, and 'Simple testing can prevent most critical failures' certainly hit the spot. Thanks to Caitie McCaffrey from whom I first heard about ... Continue Reading

On the “naturalness” of buggy code

June 8, 2016 ~ Adrian Colyer ~ Leave a comment

On the 'naturalness' of buggy code - Ray, Hellendoorn, et al. ICSE 2016 Last week we looked at a simpler approach to building static code checkers that by understanding less about the overall code structure and just focusing in on the things that really mattered was able to produce competitive results from very small checker ... Continue Reading

Why do record/replay tests of web applications break?

May 30, 2016 ~ Adrian Colyer ~ 7 Comments

Why do Record/Replay Tests of Web Applications Break? - Hammoudi et al. ICST '16 Your web application regression tests created using record/replay tools are fragile and keep breaking. Hammoudi et al. set out to find out why. If we knew that, perhaps we could design mechanisms to automatically repair broken tests, or to build more ... Continue Reading

Uncovering bugs in Distributed Storage Systems during Testing (not in production!)

May 5, 2016 ~ Adrian Colyer ~ 4 Comments

Uncovering bugs in Distributed Storage Systems during Testing (not in production!) - Deligiannis et al. 2016 We interviewed technical leaders and senior managers in Microsoft Azure regarding the top problems in distributed system development. The consensus was that one of the most critical problems today is how to improve testing coverage so that bugs can ... Continue Reading

Reducing Crash Recoverability to Reachability

February 4, 2016 ~ Adrian Colyer ~ 1 Comment

Reducing Crash Recoverability to Reachability - Koskinen & Yang 2016. Techniques such as shadow paging and write-ahead logging can help with recovery from crashes, but even then it takes a lot of sophistication to get it right and deal with cases such as crashing during recovery itself. In today's paper Koskinen and Yang first provide ... Continue Reading

Minimizing Faulty Executions of Distributed Systems

November 18, 2015 ~ Adrian Colyer ~ 3 Comments

Minimizing Faulty Executions of Distributed Systems - Scott et al. Now that we've spent a couple of days looking at test case minimizing for sequential systems, we're ready to tackle Colin Scott et al.'s paper on doing the same for executions of distributed systems. This is the paper that describes the core system behind Colin's ... Continue Reading

Holistic Configuration Management at Facebook

October 16, 2015 ~ Adrian Colyer ~ 14 Comments

Holistic Configuration Management at Facebook - Tang et al. (Facebook) 2015 This paper gives a comprehensive description of the use cases, design, implementation, and usage statistics of a suite of tools that manage Facebook’s conﬁguration end-to-end, including the frontend products, backend systems, and mobile apps. The configuration for Facebook's site is updated thousands of times ... Continue Reading

The Art of Testing Less Without Sacrificing Quality

June 25, 2015 ~ Adrian Colyer ~ 4 Comments

The Art of Testing Less Without Sacrificing Quality - Herzig et al. 2015 Why on earth would anyone want to test less? Maybe if you could guarantee the same eventually quality, and save a couple of million dollars along the way... By nature, system and compliance tests are complex and time-consuming although they rarely find ... Continue Reading

Lineage-driven Fault Injection

March 26, 2015 ~ Adrian Colyer ~ 8 Comments

Lineage-driven Fault Injection - Alvaro et al. 2015 (** fixed broken link to SPL paper review **) This is the third of three papers looking at techniques that can help us to build more robust distributed systems. First we saw how the Statecall Policy Language can enforce rules on single node of a distributed system ... Continue Reading

the morning paper

a random walk through Computer Science research, by Adrian Colyer
Made delightfully fast by strattic

Testing