Gray failure: the Achilles’ heel of cloud-scale systems

June 15, 2017 ~ Adrian Colyer ~ 18 Comments

Gray failure: the Achilles' heel of cloud-scale systems Huang et al., HotOS'17 If you're going to fail, fail properly dammit! All this limping along in degraded mode, doing your best to mask problems, turns out to be one of the key causes of major availability breakdowns and performance anomalies in cloud-scale systems. Today's HotOS'17 paper ... Continue Reading

Hybrids on Steroids: SGX-based high-performance BFT

June 5, 2017 ~ Adrian Colyer ~ 1 Comment

Hybrids on Steroids: SGX-based high performance BFT Behl et al., EuroSys'17 Byzantine fault tolerance (BFT) is the kind of fault-tolerance designed to withstand not just process crashes and network problems, but also active adversaries trying to break the system, as well as storage and memory corruptions and so on. We've taken a look at BFT ... Continue Reading

Online reconstruction of structural information from datacenter logs

May 31, 2017 ~ Adrian Colyer ~ 4 Comments

Online reconstruction of structural information from datacenter logs Chothia et al., EuroSys'17 Today's choice brings together a couple of themes that we've previously looked at on The Morning Paper: recovering system information from log files, and dataflows for stream processing. On log files (and tracing), see for example Dapper, the MysteryMachine, lprof, and Pivot tracing. ... Continue Reading

An empirical study on the correctness of formally verified distributed systems

May 29, 2017 ~ Adrian Colyer ~ 9 Comments

An empirical study on the correctness of formally verified distributed systems Fonseca et al., EuroSys'17 "Is your distributed system bug free?" "I formally verified it!" "Yes, but is your distributed system bug free?" There's a really important discussion running through this paper - what does it take to write bug-free systems software? I have a ... Continue Reading

Efficient memory disaggregation with Infiniswap

May 5, 2017November 11, 2019 ~ Adrian Colyer ~ 3 Comments

Efficient memory disaggregation with Infiniswap Gu et al., NSDI '17 If we move performance numbers onto a human scale (let 1ns of processor time = 1 second of human time) then it's easier to get an intuition - for me at least - of the relative cost of different operations. In this world, it takes ... Continue Reading

CherryPick: Adaptively unearthing the best cloud configurations for big data analytics

May 4, 2017November 11, 2019 ~ Adrian Colyer ~ 14 Comments

CherryPick: Adaptively unearthing the best cloud configurations for big data analytics Alipourfard et al., NSDI'17 For big data analytics jobs, especially recurring jobs, finding a good cloud configuration (number and type of machines, CPU, memory ,disk and network options) can make a big different to overall cost and runtimes. Likewise, a poor choice can seriously ... Continue Reading

vCorfu: A cloud-scale object store on a shared log

May 3, 2017November 11, 2019 ~ Adrian Colyer ~ 2 Comments

vCorfu: A cloud-scale object store on a shared log Wei et al., NSDI'17 vCorfu builds on the idea of a distributed shared log that we looked at yesterday with CORFU, to construct a distributed object store. We show that vCorfu outperforms Cassandra, a popular state-of-the-art NoSQL store, while providing strong consistency (opacity, read-own-writes), efficient transactions, ... Continue Reading

Corfu: A distributed shared log

May 2, 2017November 11, 2019 ~ Adrian Colyer ~ 12 Comments

Corfu: A distributed shared log Balakrishnan et al., ACM TOCS, 2013 (If you experience any difficulty in accessing the pdf in the above link please let me know, it should be open for you on the ACM DL. UPDATE, many readers are still seeing a paywall for the above paper link, here's an alternative open ... Continue Reading

HopFS: Scaling hierarchical file system metadata using NewSQL databases

March 6, 2017November 11, 2019 ~ Adrian Colyer ~ 10 Comments

HopFS: Scaling hierarchical file system metadata using NewSQL databases Niazi et al., FAST 2017 If you're working with big data and Hadoop, this one paper could repay your investment in The Morning Paper many times over (ok, The Morning Paper is free - but you do pay with your time to read it). You know ... Continue Reading

Incremental consistency guarantees for replicated objects

January 13, 2017November 11, 2019 ~ Adrian Colyer ~ 3 Comments

Incremental consistency guarantees for replicated objects Guerraoui et al., OSDI 2016 We know that there's a price to be paid for strong consistency in terms of higher latencies and reduced throughput. We also know that there's a price to be paid for weaker consistency in terms of application correctness and / or programmer difficulty. Furthermore, ... Continue Reading

the morning paper

a random walk through Computer Science research, by Adrian Colyer
Made delightfully fast by strattic

Distributed Systems