Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services Veeraraghavan et al. (Facebook) OSDI 2016 How do you know how well your systems can perform under stress? How can you identify resource utilization bottlenecks? And how do you know your tests match the condititions experienced with live … Continue reading Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services
Tag: Testing
The art of testing software systems.
Simple testing can prevent most critical failures
Simple testing can prevent most critical failures: an analysis of production failures in distributed data-intensive systems Yuan et al. OSDI 2014 After yesterday's paper I needed something a little easier to digest today, and 'Simple testing can prevent most critical failures' certainly hit the spot. Thanks to Caitie McCaffrey from whom I first heard about … Continue reading Simple testing can prevent most critical failures
On the “naturalness” of buggy code
On the 'naturalness' of buggy code - Ray, Hellendoorn, et al. ICSE 2016 Last week we looked at a simpler approach to building static code checkers that by understanding less about the overall code structure and just focusing in on the things that really mattered was able to produce competitive results from very small checker … Continue reading On the “naturalness” of buggy code
Why do record/replay tests of web applications break?
Why do Record/Replay Tests of Web Applications Break? - Hammoudi et al. ICST '16 Your web application regression tests created using record/replay tools are fragile and keep breaking. Hammoudi et al. set out to find out why. If we knew that, perhaps we could design mechanisms to automatically repair broken tests, or to build more … Continue reading Why do record/replay tests of web applications break?
Uncovering bugs in Distributed Storage Systems during Testing (not in production!)
Uncovering bugs in Distributed Storage Systems during Testing (not in production!) - Deligiannis et al. 2016 We interviewed technical leaders and senior managers in Microsoft Azure regarding the top problems in distributed system development. The consensus was that one of the most critical problems today is how to improve testing coverage so that bugs can … Continue reading Uncovering bugs in Distributed Storage Systems during Testing (not in production!)
Reducing Crash Recoverability to Reachability
Reducing Crash Recoverability to Reachability - Koskinen & Yang 2016. Techniques such as shadow paging and write-ahead logging can help with recovery from crashes, but even then it takes a lot of sophistication to get it right and deal with cases such as crashing during recovery itself. In today's paper Koskinen and Yang first provide … Continue reading Reducing Crash Recoverability to Reachability
Minimizing Faulty Executions of Distributed Systems
Minimizing Faulty Executions of Distributed Systems - Scott et al. Now that we've spent a couple of days looking at test case minimizing for sequential systems, we're ready to tackle Colin Scott et al.'s paper on doing the same for executions of distributed systems. This is the paper that describes the core system behind Colin's … Continue reading Minimizing Faulty Executions of Distributed Systems
Holistic Configuration Management at Facebook
Holistic Configuration Management at Facebook - Tang et al. (Facebook) 2015 This paper gives a comprehensive description of the use cases, design, implementation, and usage statistics of a suite of tools that manage Facebook’s configuration end-to-end, including the frontend products, backend systems, and mobile apps. The configuration for Facebook's site is updated thousands of times … Continue reading Holistic Configuration Management at Facebook
The Art of Testing Less Without Sacrificing Quality
The Art of Testing Less Without Sacrificing Quality - Herzig et al. 2015 Why on earth would anyone want to test less? Maybe if you could guarantee the same eventually quality, and save a couple of million dollars along the way... By nature, system and compliance tests are complex and time-consuming although they rarely find … Continue reading The Art of Testing Less Without Sacrificing Quality
Lineage-driven Fault Injection
Lineage-driven Fault Injection - Alvaro et al. 2015 (** fixed broken link to SPL paper review **) This is the third of three papers looking at techniques that can help us to build more robust distributed systems. First we saw how the Statecall Policy Language can enforce rules on single node of a distributed system … Continue reading Lineage-driven Fault Injection
