PSync: A Partially Synchronous Language for Fault-Tolerant Distributed Algorithms

February 1, 2016July 27, 2017 ~ adriancolyer ~ 1 Comment

PSync: A Partially Synchronous Language for Fault-Tolerant Distributed Algorithms - Drăgoi et al. 2016 Last month we looked at the RAMCloud team's design pattern for building distributed, concurrent, fault-tolerant modules. Today's paper goes one step beyond a pattern, and introduces a domain-specific language called PSync with the goal of unifying the modeling, programming, and verification … Continue reading PSync: A Partially Synchronous Language for Fault-Tolerant Distributed Algorithms

Panopticon: An Omniscient Lock Broker for Efficient Distributed Transactions in the Datacenter

January 29, 2016July 27, 2017 ~ adriancolyer ~ 1 Comment

Panopticon: An Omniscient Lock Broker for Efficient Distributed Transactions in the Datacenter - Tasci & Demirbas, 2015 Today we return to the theme of distributed transactions, and a paper that won a best paper award from IEEE Big Data in 2015. Panopticon is a centralized lock broker (like Chubby and ZooKeeper) that manages distributed (decentralized) … Continue reading Panopticon: An Omniscient Lock Broker for Efficient Distributed Transactions in the Datacenter

Petuum: A New Platform for Distributed Machine Learning on Big Data

January 27, 2016July 27, 2017 ~ adriancolyer ~ 6 Comments

Petuum: A New Platform for Distributed Machine Learning on Big Data - Xing et al. 2015 How do you perform machine learning with big models (big here could be 100s of billions of parameters!) over big data sets (terabytes or petabytes)? Take for example state of the art image recognition systems that have embraced large-scale … Continue reading Petuum: A New Platform for Distributed Machine Learning on Big Data

A Distributed Systems Seminar Reading List…

January 24, 2016July 27, 2017 ~ adriancolyer ~ 2 Comments

I stumbled upon Murat Demirbas' 'Distributed Systems Seminar's Reading List for Spring 2016.' If you're taking part in those seminars, you're in for some very interesting papers! I was pleased to discover I've read (and written up) most of them - but there are a few that I haven't. Given the high calibre of the … Continue reading A Distributed Systems Seminar Reading List…

Experience with Rules-Based Programming for Distributed Concurrent Fault-Tolerant Code

January 19, 2016July 27, 2017 ~ adriancolyer ~ 5 Comments

Experience with Rules-Based Programming for Distributed, Concurrent, Fault-Tolerant Code - Stutsman et al. 2015 As we saw in yesterday's paper, the authors of RAMCloud settled on a very effective design pattern for writing distributed, concurrent, fault-tolerant (DCFT) modules within their system. They call this pattern 'rules-based programming' - a collection of (condition,action) pairs that can … Continue reading Experience with Rules-Based Programming for Distributed Concurrent Fault-Tolerant Code

The RAMCloud Storage System

January 18, 2016July 27, 2017 ~ adriancolyer ~ 11 Comments

The RAMCloud Storage System - Ousterhout et al. 2015 This paper is a comprehensive overview of RAMCloud, published in the ACM Transactions on Computer Systems in August 2015. It's a long read (55 pages), but there's a ton of great material here. The RAMCloud project started in 2009, so this is therefore an overview of … Continue reading The RAMCloud Storage System

Consensus on Transaction Commit

January 13, 2016July 27, 2017 ~ adriancolyer ~ 4 Comments

Consensus on Transaction Commit - Gray & Lamport 2004/5 Last year on The Morning Paper we spent a considerable amount of time looking at consensus protocols. Over the last couple of days we've been looking at the two-phase commit protocol. But isn't two-phase commit just a special purpose version of the consensus problem, where we … Continue reading Consensus on Transaction Commit

FIT: A Distributed Database Performance Trade-off

November 25, 2015July 27, 2017 ~ adriancolyer ~ 1 Comment

FIT: A Distributed Database Performance Trade-off - Faleiro & Abadi, 2015 If the CAP FITs... This paper presents the FIT trade-off for distributed transactions: you can have any two of Fairness, (strong) Isolation, and Throughput, but not all three. Which also implies you can have both strong isolation and high throughput! As a consequence of … Continue reading FIT: A Distributed Database Performance Trade-off

Fail at Scale & Controlling Queue Delay

November 19, 2015July 27, 2017 ~ adriancolyer ~ 4 Comments

Controlling Queue Delay - Nichols & Van Jacobsen, 2012, and Fail at Scale - Maurer, 2015 Fail at Scale (Maurer) Ben Maurer recently wrote a great article for ACM Queue on how Facebook achieves reliability in the face of rapid change: To keep Facebook reliable in the face of rapid change we study common patterns … Continue reading Fail at Scale & Controlling Queue Delay

Minimizing Faulty Executions of Distributed Systems

November 18, 2015July 27, 2017 ~ adriancolyer ~ 3 Comments

Minimizing Faulty Executions of Distributed Systems - Scott et al. Now that we've spent a couple of days looking at test case minimizing for sequential systems, we're ready to tackle Colin Scott et al.'s paper on doing the same for executions of distributed systems. This is the paper that describes the core system behind Colin's … Continue reading Minimizing Faulty Executions of Distributed Systems