Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services

November 28, 2016July 31, 2017 ~ adriancolyer ~ 4 Comments

Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services Veeraraghavan et al. (Facebook) OSDI 2016 How do you know how well your systems can perform under stress? How can you identify resource utilization bottlenecks? And how do you know your tests match the condititions experienced with live … Continue reading Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services

Simple testing can prevent most critical failures

October 6, 2016July 31, 2017 ~ adriancolyer ~ 19 Comments

Simple testing can prevent most critical failures: an analysis of production failures in distributed data-intensive systems Yuan et al. OSDI 2014 After yesterday's paper I needed something a little easier to digest today, and 'Simple testing can prevent most critical failures' certainly hit the spot. Thanks to Caitie McCaffrey from whom I first heard about … Continue reading Simple testing can prevent most critical failures

On the “naturalness” of buggy code

June 8, 2016July 27, 2017 ~ adriancolyer

On the 'naturalness' of buggy code - Ray, Hellendoorn, et al. ICSE 2016 Last week we looked at a simpler approach to building static code checkers that by understanding less about the overall code structure and just focusing in on the things that really mattered was able to produce competitive results from very small checker … Continue reading On the “naturalness” of buggy code

Why do record/replay tests of web applications break?

May 30, 2016July 27, 2017 ~ adriancolyer ~ 7 Comments

Why do Record/Replay Tests of Web Applications Break? - Hammoudi et al. ICST '16 Your web application regression tests created using record/replay tools are fragile and keep breaking. Hammoudi et al. set out to find out why. If we knew that, perhaps we could design mechanisms to automatically repair broken tests, or to build more … Continue reading Why do record/replay tests of web applications break?

Uncovering bugs in Distributed Storage Systems during Testing (not in production!)

May 5, 2016July 27, 2017 ~ adriancolyer ~ 4 Comments

Uncovering bugs in Distributed Storage Systems during Testing (not in production!) - Deligiannis et al. 2016 We interviewed technical leaders and senior managers in Microsoft Azure regarding the top problems in distributed system development. The consensus was that one of the most critical problems today is how to improve testing coverage so that bugs can … Continue reading Uncovering bugs in Distributed Storage Systems during Testing (not in production!)

Reducing Crash Recoverability to Reachability

February 4, 2016July 27, 2017 ~ adriancolyer ~ 1 Comment

Reducing Crash Recoverability to Reachability - Koskinen & Yang 2016. Techniques such as shadow paging and write-ahead logging can help with recovery from crashes, but even then it takes a lot of sophistication to get it right and deal with cases such as crashing during recovery itself. In today's paper Koskinen and Yang first provide … Continue reading Reducing Crash Recoverability to Reachability

Minimizing Faulty Executions of Distributed Systems

November 18, 2015July 27, 2017 ~ adriancolyer ~ 3 Comments

Minimizing Faulty Executions of Distributed Systems - Scott et al. Now that we've spent a couple of days looking at test case minimizing for sequential systems, we're ready to tackle Colin Scott et al.'s paper on doing the same for executions of distributed systems. This is the paper that describes the core system behind Colin's … Continue reading Minimizing Faulty Executions of Distributed Systems

Holistic Configuration Management at Facebook

October 16, 2015July 27, 2017 ~ adriancolyer ~ 14 Comments

Holistic Configuration Management at Facebook - Tang et al. (Facebook) 2015 This paper gives a comprehensive description of the use cases, design, implementation, and usage statistics of a suite of tools that manage Facebook’s conﬁguration end-to-end, including the frontend products, backend systems, and mobile apps. The configuration for Facebook's site is updated thousands of times … Continue reading Holistic Configuration Management at Facebook

The Art of Testing Less Without Sacrificing Quality

June 25, 2015July 26, 2017 ~ adriancolyer ~ 4 Comments

The Art of Testing Less Without Sacrificing Quality - Herzig et al. 2015 Why on earth would anyone want to test less? Maybe if you could guarantee the same eventually quality, and save a couple of million dollars along the way... By nature, system and compliance tests are complex and time-consuming although they rarely find … Continue reading The Art of Testing Less Without Sacrificing Quality

Lineage-driven Fault Injection

March 26, 2015July 26, 2017 ~ adriancolyer ~ 8 Comments

Lineage-driven Fault Injection - Alvaro et al. 2015 (** fixed broken link to SPL paper review **) This is the third of three papers looking at techniques that can help us to build more robust distributed systems. First we saw how the Statecall Policy Language can enforce rules on single node of a distributed system … Continue reading Lineage-driven Fault Injection