A higher order estimate of the optimum checkpoint interval for restart dumps

A higher order estimate of the optimum checkpoint interval for restart dumps - Daly 2004 TL;DR: if you know how long it takes your system to create a checkpoint/snapshot (δ), and you know the expected mean-time between failures (M), then set the checkpoint interval to be √(2δM) - δ. OK, I grant that today's paper … Continue reading A higher order estimate of the optimum checkpoint interval for restart dumps

Detecting Termination of Distributed Computations Using Markers

Detecting Termination of Distributed Computations Using Markers - Misra 1983 There's an intriguing line in the Distributed GraphLab paper that caught my eye: "Termination is evaluated using distributed consensus algorithm described in [Ref]." Today's choice is the paper by Misra in 1983 that describes this distributed termination detection algorithm. The solution is similar in spirit … Continue reading Detecting Termination of Distributed Computations Using Markers

Scaling Concurrent Log-Structured Data Stores

Scaling Concurrent Log-Structured Data Stores - Golan-Gueta et al. 2015 Key-value stores based on log-structured merge trees are everywhere. The original design was intended to mitigate slow disk I/O. Once this is achieved, as we scale to more and more cores the authors find that in-memory contention now becomes the bottleneck (see yesterday's piece on … Continue reading Scaling Concurrent Log-Structured Data Stores

Distributed Snapshots: Determining Global States of Distributed Systems

Distributed Snapshots: Determining Global States of Distributed Systems - Chandy & Lamport 1985. What state is your distributed system in? In the absence of a universal clock, is that even a well-formed question? And if you could take a distributed snapshot of system state, would that be useful? Through an algorithm that has simply become … Continue reading Distributed Snapshots: Determining Global States of Distributed Systems

Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures

Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures - David et al. 2015 Linked Lists, Hash Tables, Skip Lists, Binary Search Trees... these data structures are core to many programs. This paper studies such search data structures, supporting search, insert, and remove operations. In particular, the authors look at concurrent versions of these … Continue reading Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures

A Comprehensive study of Convergent and Commutative Replicated Data Types

A comprehensive study of Convergent and Commutative Replicated Data Types - Shapiro et al. 2011 This is the third of five Desert Island Paper choices from Jonas Bonér, and it continues the theme of avoiding coordination overhead in a principled manner whenever you can. As we saw yesterday, there are trade-offs between consistency, failure tolerance, … Continue reading A Comprehensive study of Convergent and Commutative Replicated Data Types

A Hitchhiker’s Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers

A Hitchhiker's guide to fast and efficient data reconstruction in erasure-coded data centers - Rashmi et al. So far this week we've looked at a programming languages paper and a systems paper, so for today I thought it would be fun to look at an algorithm-based paper. HDFS enables horizontally scalable low-cost storage for the … Continue reading A Hitchhiker’s Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers