Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services

Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services Veeraraghavan et al. (Facebook) OSDI 2016 How do you know how well your systems can perform under stress? How can you identify resource utilization bottlenecks? And how do you know your tests match the condititions experienced with live ... Continue Reading

Uncovering bugs in Distributed Storage Systems during Testing (not in production!)

Uncovering bugs in Distributed Storage Systems during Testing (not in production!) - Deligiannis et al. 2016 We interviewed technical leaders and senior managers in Microsoft Azure regarding the top problems in distributed system development. The consensus was that one of the most critical problems today is how to improve testing coverage so that bugs can ... Continue Reading

Holistic Configuration Management at Facebook

Holistic Configuration Management at Facebook - Tang et al. (Facebook) 2015 This paper gives a comprehensive description of the use cases, design, implementation, and usage statistics of a suite of tools that manage Facebook’s configuration end-to-end, including the frontend products, backend systems, and mobile apps. The configuration for Facebook's site is updated thousands of times ... Continue Reading