Morpheus: Towards automated SLOs for enterprise clusters

December 1, 2016July 31, 2017 ~ adriancolyer ~ 7 Comments

Morpheus: Towards automated SLOs for enterprise clusters Jyothi et al. OSDI 2016 I'm really impressed with this paper - it covers all the bases from user studies to find out what's really important to end users, to data-driven engineering, a sprinkling of algorithms, a pragmatic implementation being made available in open source, and of course, … Continue reading Morpheus: Towards automated SLOs for enterprise clusters

Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services

November 28, 2016July 31, 2017 ~ adriancolyer ~ 4 Comments

Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services Veeraraghavan et al. (Facebook) OSDI 2016 How do you know how well your systems can perform under stress? How can you identify resource utilization bottlenecks? And how do you know your tests match the condititions experienced with live … Continue reading Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services

The Honey Badger of BFT protocols

November 18, 2016July 31, 2017 ~ adriancolyer ~ 5 Comments

The Honey Badger of BFT Protocols Miller et al. CCS 2016 The surprising success of cryptocurrencies (blockchains) has led to a surge of interest in deploying large scale, highly robust, Byzantine fault tolerant (BFT) protocols for mission critical applications, such as financial transactions. In a ‘traditional’ distributed system consensus algorithm setting we assume a relatively … Continue reading The Honey Badger of BFT protocols

Simple testing can prevent most critical failures

October 6, 2016July 31, 2017 ~ adriancolyer ~ 19 Comments

Simple testing can prevent most critical failures: an analysis of production failures in distributed data-intensive systems Yuan et al. OSDI 2014 After yesterday's paper I needed something a little easier to digest today, and 'Simple testing can prevent most critical failures' certainly hit the spot. Thanks to Caitie McCaffrey from whom I first heard about … Continue reading Simple testing can prevent most critical failures

The load, capacity, and availability of quorum systems

October 3, 2016July 31, 2017 ~ adriancolyer ~ 8 Comments

The load, capacity, and availability of quorum systems Naor & Wool, SIAM J Computing 1998 Update: fixed 'non-intersection property' to read 'non-empty intersection property.' Quite an important difference! With thanks to those who pointed out my mistake. This is the paper that Howard et al referenced in Flexible Paxos as defining the “fundamental theorem of … Continue reading The load, capacity, and availability of quorum systems

Distributed consensus and the implications of NVM on database management systems

September 28, 2016July 31, 2017 ~ adriancolyer ~ 2 Comments

Distributed consensus and the implications of NVM on database management systems Fournier, Arulraj, & Pavlo ACM Queue Vol 14, issue 3 As you may recall, Peter Bailis and ACM Queue have started a "Research for Practice" series introducing "expert curated guides to the best of CS research." Aka, reading lists for The Morning Paper! I … Continue reading Distributed consensus and the implications of NVM on database management systems

Flexible Paxos: Quorum intersection revisited

September 27, 2016July 31, 2017 ~ adriancolyer ~ 6 Comments

Flexible Paxos: Quorum intersection revisited Howard et al., 2016 Paxos has been around for 18 (26) years now, and extensively studied. (For some background, see the 2 week mini-series on consensus that I put together last year). In this paper, Howard et al. make a simple(?) observation that has significant consequences for improving the fault-tolerance … Continue reading Flexible Paxos: Quorum intersection revisited

Data on the Outside versus Data on the Inside

September 13, 2016July 31, 2017 ~ adriancolyer ~ 4 Comments

Data on the Outside vs Data on the Inside Pat Helland, CIDR 2005 Another (modern) classic today, Pat Helland's wonderful 2005 paper on thinking about data in service oriented architectures. Sticking with the contemporary feel I'm going to write SOA as 'microservices' for the rest of this post. Helland shows us that we need to … Continue reading Data on the Outside versus Data on the Inside

On designing and deploying internet-scale services

September 12, 2016July 31, 2017 ~ adriancolyer ~ 8 Comments

On designing and deploying internet-scale services James Hamilton LISA '07 Want to know how to build cloud native applications? You'll be hard-pushed to find a better collection of wisdom, best practices, and hard-won experience than this 2007 paper from James Hamilton. It's amazing to think that all of this knowledge was captured and written down … Continue reading On designing and deploying internet-scale services

BigDebug: Debugging primitives for interactive big data processing in Spark

June 7, 2016July 27, 2017 ~ adriancolyer ~ 2 Comments

BigDebug: Debugging primitives for interactive big data processing in Spark - Gulzar et al. ICSE 2016 BigDebug provides real-time interactive debugging support for Data-Intensive Scalable Computing (DISC) systems, or more particularly, Apache Spark. It provides breakpoints, watchpoints, latency monitoring, forward and backward tracing, crash monitoring, and a real-time fix-and-resume capability. The overheads are low for … Continue reading BigDebug: Debugging primitives for interactive big data processing in Spark