Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services

Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services Veeraraghavan et al. (Facebook) OSDI 2016 How do you know how well your systems can perform under stress? How can you identify resource utilization bottlenecks? And how do you know your tests match the condititions experienced with live ... Continue Reading

The Honey Badger of BFT protocols

The Honey Badger of BFT Protocols Miller et al. CCS 2016 The surprising success of cryptocurrencies (blockchains) has led to a surge of interest in deploying large scale, highly robust, Byzantine fault tolerant (BFT) protocols for mission critical applications, such as financial transactions. In a ‘traditional’ distributed system consensus algorithm setting we assume a relatively ... Continue Reading

The load, capacity, and availability of quorum systems

The load, capacity, and availability of quorum systems Naor & Wool, SIAM J Computing 1998 Update: fixed 'non-intersection property' to read 'non-empty intersection property.' Quite an important difference! With thanks to those who pointed out my mistake. This is the paper that Howard et al referenced in Flexible Paxos as defining the “fundamental theorem of ... Continue Reading

Distributed consensus and the implications of NVM on database management systems

Distributed consensus and the implications of NVM on database management systems Fournier, Arulraj, & Pavlo ACM Queue Vol 14, issue 3 As you may recall, Peter Bailis and ACM Queue have started a "Research for Practice" series introducing "expert curated guides to the best of CS research." Aka, reading lists for The Morning Paper! I ... Continue Reading

BigDebug: Debugging primitives for interactive big data processing in Spark

BigDebug: Debugging primitives for interactive big data processing in Spark - Gulzar et al. ICSE 2016 BigDebug provides real-time interactive debugging support for Data-Intensive Scalable Computing (DISC) systems, or more particularly, Apache Spark. It provides breakpoints, watchpoints, latency monitoring, forward and backward tracing, crash monitoring, and a real-time fix-and-resume capability. The overheads are low for ... Continue Reading