Practical Byzantine Fault Tolerance

May 18, 2015July 26, 2017 ~ adriancolyer ~ 50 Comments

Practical Byzantine Fault Tolerance - Castro & Liskov 1999 Oh Byzantine, you conflict me. On the one hand, we know that the old model of a security perimeter around an undefended centre is hopelessly broken (witness Google moves its Corporate Applications to the Internet)- so Byzantine models, which allow for any deviation from expected behaviour … Continue reading Practical Byzantine Fault Tolerance

Extensible Distributed Coordination

May 8, 2015July 26, 2017 ~ adriancolyer

Extensible Distributed Coordination - Distler et al. 2015 Coordination services such as ZooKeeper offer a deliberately limited API. As a consequence, more complex coordination tasks have to be implemented as multiple RPCs. In Extensible Distributed Coordination, Distler et al. describe a sandboxed extension mechanism for coordination services that allows execution of client logic in the … Continue reading Extensible Distributed Coordination

Large-scale cluster management at Google with Borg

May 7, 2015July 26, 2017 ~ adriancolyer ~ 9 Comments

Large-scale cluster management at Google with Borg - Verma et al. 2015 Borg has been running all of Google's workloads for the last ten years, and the learnings from Borg are being packaged into kubernetes so that the rest of the world can benefit from them. An important paper then as the rest of us … Continue reading Large-scale cluster management at Google with Borg

Blade: A data center garbage collector

May 6, 2015July 26, 2017 ~ adriancolyer ~ 3 Comments

Blade: A data center garbage collector - Terei & Levy 2015 Thanks to Justin Mason (@jmason) for bringing today's choice to my attention. GC times are a major cause of latency in the tail - Blade aims to fix this. By taking a distributed systems perspective rather than just a single node view, Blade collaborates … Continue reading Blade: A data center garbage collector

Taming uncertainty in distributed systems with help from the network

May 5, 2015July 26, 2017 ~ adriancolyer ~ 2 Comments

Taming uncertainty in distributed systems with help from the network - Leners et al. 2015 Albatross is a membership service with a very interesting new twist: it exploits SDN functionality to actively enforce partitions! Perhaps it is not immediately obvious why that might be a good thing :). It turns out there are several benefits: … Continue reading Taming uncertainty in distributed systems with help from the network

Putting Consistency Back into Eventual Consistency

May 4, 2015July 26, 2017 ~ adriancolyer ~ 3 Comments

Putting Consistency Back into Eventual Consistency - Balegas et al. 2015 Today's choice is another pick from the recent crop of Eurosys 2015 papers. Balegas et al. show us that we don't have to put up with weak forms of eventual consistency, even in geo-replicated settings. In Building on Quicksand Helland argued that we need … Continue reading Putting Consistency Back into Eventual Consistency

Distributed Snapshots: Determining Global States of Distributed Systems

April 22, 2015July 26, 2017 ~ adriancolyer ~ 14 Comments

Distributed Snapshots: Determining Global States of Distributed Systems - Chandy & Lamport 1985. What state is your distributed system in? In the absence of a universal clock, is that even a well-formed question? And if you could take a distributed snapshot of system state, would that be useful? Through an algorithm that has simply become … Continue reading Distributed Snapshots: Determining Global States of Distributed Systems

Cross-layer scheduling in cloud systems

April 15, 2015July 26, 2017 ~ adriancolyer ~ 3 Comments

Cross-layer scheduling in cloud systems - Alkaff et al. 2015 This paper was presented last month at the 2015 International Conference on Cloud Engineering, and explores what happens when you coordinate application scheduling with network route allocation via SDN (hence: cross-layer scheduling). With clusters of 30 nodes, the authors demonstrate results that can improve the … Continue reading Cross-layer scheduling in cloud systems

Lineage-driven Fault Injection

March 26, 2015July 26, 2017 ~ adriancolyer ~ 8 Comments

Lineage-driven Fault Injection - Alvaro et al. 2015 (** fixed broken link to SPL paper review **) This is the third of three papers looking at techniques that can help us to build more robust distributed systems. First we saw how the Statecall Policy Language can enforce rules on single node of a distributed system … Continue reading Lineage-driven Fault Injection

SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems

March 25, 2015July 26, 2017 ~ adriancolyer ~ 6 Comments

SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems - Leesatapornwongsa et al. 2014 This is the second of three papers we'll be looking at this week on the theme of verifying correctness of, and catching bugs in, distributed systems. Yesterday we saw the Statecall Policy Language and associated tool chain … Continue reading SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems