Making Sense of Performance in Data Analytics Frameworks - Ousterhout et al. 2015 We all know the causes of poor performance in big data analytics workloads: network I/O, disk I/O, and straggler tasks. Ousterhout et al. set out to try and quantify this, and found that what we think we know isn't necessarily so. Yet … Continue reading Making Sense of Performance in Data Analytics Frameworks
Category: Distributed Systems
Core distributed systems topics, for example consistency, availability and so on.
ApproxHadoop: Bringing Approximations to MapReduce Frameworks
ApproxHadoop: Bringing Approximations to MapReduce Frameworks - Goiri et al. 2015 Yesterday we saw how including networking concerns in scheduling decisions can increase throughput for MapReduce jobs (and Storm topologies) by ~30%. Today we look at an even more effective strategy for getting the most out of your Hadoop cluster: doing less work! On one … Continue reading ApproxHadoop: Bringing Approximations to MapReduce Frameworks
Cross-layer scheduling in cloud systems
Cross-layer scheduling in cloud systems - Alkaff et al. 2015 This paper was presented last month at the 2015 International Conference on Cloud Engineering, and explores what happens when you coordinate application scheduling with network route allocation via SDN (hence: cross-layer scheduling). With clusters of 30 nodes, the authors demonstrate results that can improve the … Continue reading Cross-layer scheduling in cloud systems
Scalable Atomic Visibility with RAMP Transactions
Scalable Atomic Visibility with RAMP Transactions - Bailis et al. 2014 RAMP transactions came up last week as part of the secret sauce in Coordination avoidance in database systems that contributed to a 25x improvement on the TPC-C benchmark. So what exactly are RAMP transactions and why might we need them? As soon as you … Continue reading Scalable Atomic Visibility with RAMP Transactions
Lineage-driven Fault Injection
Lineage-driven Fault Injection - Alvaro et al. 2015 (** fixed broken link to SPL paper review **) This is the third of three papers looking at techniques that can help us to build more robust distributed systems. First we saw how the Statecall Policy Language can enforce rules on single node of a distributed system … Continue reading Lineage-driven Fault Injection
SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems
SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems - Leesatapornwongsa et al. 2014 This is the second of three papers we'll be looking at this week on the theme of verifying correctness of, and catching bugs in, distributed systems. Yesterday we saw the Statecall Policy Language and associated tool chain … Continue reading SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems
Combining static model checking with dynamic enforcement using the Statecall Policy Language
Combining static model checking with dynamic enforcement using the Statecall Policy Language - Madhavapeddy 2009 We know that getting distributed systems right is hard, and subtle, 'deep' bugs can lurk in both algorithms and implementations. Can we do better than informal reasoning coupled with some unit and integration tests? Evidence suggests we have to do … Continue reading Combining static model checking with dynamic enforcement using the Statecall Policy Language
Building on quicksand
Building on Quicksand - Helland & Campbell 2009 Last week we looked at Consistency analysis in Bloom, and Coordination Avoidance in Database Systems. A common theme in both of these is that some collaboration between the application (understanding of application level semantics) and datastore is key to unlocking the next level of performance. We can … Continue reading Building on quicksand
Coordination Avoidance in Database Systems
Coordination Avoidance in Database Systems - Bailis et al. 2014 The very title of this paper speaks to the theme we've been looking at so far this week - how to reduce the amount of coordination needed in a distributed system. (Which seems fitting having just spent the prior two weeks looking at how costly … Continue reading Coordination Avoidance in Database Systems
A Comprehensive study of Convergent and Commutative Replicated Data Types
A comprehensive study of Convergent and Commutative Replicated Data Types - Shapiro et al. 2011 This is the third of five Desert Island Paper choices from Jonas Bonér, and it continues the theme of avoiding coordination overhead in a principled manner whenever you can. As we saw yesterday, there are trade-offs between consistency, failure tolerance, … Continue reading A Comprehensive study of Convergent and Commutative Replicated Data Types