PSync: A Partially Synchronous Language for Fault-Tolerant Distributed Algorithms - Drăgoi et al. 2016 Last month we looked at the RAMCloud team's design pattern for building distributed, concurrent, fault-tolerant modules. Today's paper goes one step beyond a pattern, and introduces a domain-specific language called PSync with the goal of unifying the modeling, programming, and verification … Continue reading PSync: A Partially Synchronous Language for Fault-Tolerant Distributed Algorithms
Tag: Distributed Systems
Panopticon: An Omniscient Lock Broker for Efficient Distributed Transactions in the Datacenter
Panopticon: An Omniscient Lock Broker for Efficient Distributed Transactions in the Datacenter - Tasci & Demirbas, 2015 Today we return to the theme of distributed transactions, and a paper that won a best paper award from IEEE Big Data in 2015. Panopticon is a centralized lock broker (like Chubby and ZooKeeper) that manages distributed (decentralized) … Continue reading Panopticon: An Omniscient Lock Broker for Efficient Distributed Transactions in the Datacenter
Petuum: A New Platform for Distributed Machine Learning on Big Data
Petuum: A New Platform for Distributed Machine Learning on Big Data - Xing et al. 2015 How do you perform machine learning with big models (big here could be 100s of billions of parameters!) over big data sets (terabytes or petabytes)? Take for example state of the art image recognition systems that have embraced large-scale … Continue reading Petuum: A New Platform for Distributed Machine Learning on Big Data
A Distributed Systems Seminar Reading List…
I stumbled upon Murat Demirbas' 'Distributed Systems Seminar's Reading List for Spring 2016.' If you're taking part in those seminars, you're in for some very interesting papers! I was pleased to discover I've read (and written up) most of them - but there are a few that I haven't. Given the high calibre of the … Continue reading A Distributed Systems Seminar Reading List…
Experience with Rules-Based Programming for Distributed Concurrent Fault-Tolerant Code
Experience with Rules-Based Programming for Distributed, Concurrent, Fault-Tolerant Code - Stutsman et al. 2015 As we saw in yesterday's paper, the authors of RAMCloud settled on a very effective design pattern for writing distributed, concurrent, fault-tolerant (DCFT) modules within their system. They call this pattern 'rules-based programming' - a collection of (condition,action) pairs that can … Continue reading Experience with Rules-Based Programming for Distributed Concurrent Fault-Tolerant Code
The RAMCloud Storage System
The RAMCloud Storage System - Ousterhout et al. 2015 This paper is a comprehensive overview of RAMCloud, published in the ACM Transactions on Computer Systems in August 2015. It's a long read (55 pages), but there's a ton of great material here. The RAMCloud project started in 2009, so this is therefore an overview of … Continue reading The RAMCloud Storage System
Consensus on Transaction Commit
Consensus on Transaction Commit - Gray & Lamport 2004/5 Last year on The Morning Paper we spent a considerable amount of time looking at consensus protocols. Over the last couple of days we've been looking at the two-phase commit protocol. But isn't two-phase commit just a special purpose version of the consensus problem, where we … Continue reading Consensus on Transaction Commit
FIT: A Distributed Database Performance Trade-off
FIT: A Distributed Database Performance Trade-off - Faleiro & Abadi, 2015 If the CAP FITs... This paper presents the FIT trade-off for distributed transactions: you can have any two of Fairness, (strong) Isolation, and Throughput, but not all three. Which also implies you can have both strong isolation and high throughput! As a consequence of … Continue reading FIT: A Distributed Database Performance Trade-off
Fail at Scale & Controlling Queue Delay
Controlling Queue Delay - Nichols & Van Jacobsen, 2012, and Fail at Scale - Maurer, 2015 Fail at Scale (Maurer) Ben Maurer recently wrote a great article for ACM Queue on how Facebook achieves reliability in the face of rapid change: To keep Facebook reliable in the face of rapid change we study common patterns … Continue reading Fail at Scale & Controlling Queue Delay
Minimizing Faulty Executions of Distributed Systems
Minimizing Faulty Executions of Distributed Systems - Scott et al. Now that we've spent a couple of days looking at test case minimizing for sequential systems, we're ready to tackle Colin Scott et al.'s paper on doing the same for executions of distributed systems. This is the paper that describes the core system behind Colin's … Continue reading Minimizing Faulty Executions of Distributed Systems