Redundancy does not imply fault tolerance: analysis of distributed storage reactions to single errors and corruptions

March 8, 2017November 11, 2019 ~ Adrian Colyer ~ 10 Comments

Redundancy does not imply fault tolerance: analysis of distributed storage reactions to single errors and corruptions Ganesan et al., FAST 2017 It's a tough life being the developer of a distributed datastore. Thanks to the wonderful work of Kyle Kingsbury (aka, @aphyr) and his efforts on Jepsen.io, awareness of data loss and related issues in ... Continue Reading

HopFS: Scaling hierarchical file system metadata using NewSQL databases

March 6, 2017November 11, 2019 ~ Adrian Colyer ~ 10 Comments

HopFS: Scaling hierarchical file system metadata using NewSQL databases Niazi et al., FAST 2017 If you're working with big data and Hadoop, this one paper could repay your investment in The Morning Paper many times over (ok, The Morning Paper is free - but you do pay with your time to read it). You know ... Continue Reading

Ground: A data context service

January 23, 2017November 11, 2019 ~ Adrian Colyer ~ 6 Comments

Ground: A Data Context Service Hellerstein et al. , CIDR 2017 An unfortunate consequence of the disaggregated nature of contemporary data systems is the lack of a standard mechanism to assemble a collective understanding of the origin, scope, and usage of the data they manage. Put more bluntly, many organisations have only a fuzzy picture ... Continue Reading

Generic attacks on secure outsourced databases

November 16, 2016November 11, 2019 ~ Adrian Colyer ~ 2 Comments

Generic Attacks on Secure Outsourced Databases Kellaris et al. CCS 2016 Here’s a really interesting paper that helps to set some boundaries around what we can expect from encrypted databases in the cloud. Independently of the details of any one system (or encryption scheme), the authors look at what data it is possible to recover ... Continue Reading

Scaling Spark in the real world: performance and usability

November 4, 2016November 11, 2019 ~ Adrian Colyer ~ 5 Comments

Scaling Spark in the real world: performance and usability Armbrust et al. VLBD 2015 A short and easy paper from the Databricks team to end the week. Given the pace of development in the Apache Spark world, a paper published in 2015 about enhancements to Spark will of course be a little dated. But this ... Continue Reading

Replex: A scalable, highly available multi-index data store

October 27, 2016November 11, 2019 ~ Adrian Colyer ~ 2 Comments

Replex: A scalable, highly available multi-index data store Tai et al. USENIX 2016 Today’s choice won a best paper award at USENIX this year. Replex addresses the problem of key-value stores in which you also want to have an efficient query capability by values other than the primary key. … NoSQL databases achieve scalability by ... Continue Reading

Incremental knowledge base construction using DeepDive

October 7, 2016November 11, 2019 ~ Adrian Colyer ~ 14 Comments

Incremental knowledge base construction using DeepDive Shin et al., VLDB 2015 When I think about the most important CS foundations for the computer systems we build today and will build over the next decade, I think about Distributed systems Database systems / data stores (dealing with data at rest) Stream processing (dealing with data in ... Continue Reading

Write-limited sorts and joins for persistent memory

September 30, 2016November 11, 2019 ~ Adrian Colyer ~ 1 Comment

Write-limited sorts and joins for persistent memory Viglas, VLDB 2014 This is the second of the two research-for-practice papers for this week. Once more the topic is how database storage algorithms can be optimised for NVM, this time examining the asymmetry between reads and writes on NVM. This is premised on Viglas’ assertion that: Writes ... Continue Reading

Let’s talk about storage and recovery methods for non-volatile memory database systems

September 29, 2016November 11, 2019 ~ Adrian Colyer ~ 6 Comments

Let's talk about storage and recovery methods for non-volatile memory database systems Arulraj et al., SIGMOD 2015 Update: fixed a bunch of broken links. I can't believe I only just found out about this paper! It's exactly what I've been looking for in terms of an analysis of the impacts of NVM on data storage ... Continue Reading

DBSherlock: A performance diagnostic tool for transactional databases

July 14, 2016November 11, 2019 ~ Adrian Colyer ~ 5 Comments

DBSherlock: A performance diagnostic tool for transactional databases Yoon et al. SIGMOD ’16 …tens of thousands of concurrent transactions competing for the same resources (e.g. CPU, disk I/O, memory) can create highly non-linear and counter-intuitive effects on database performance. If you’re a DBA responsible for figuring out what’s going on, this presents quite a challenge. ... Continue Reading

the morning paper

a random walk through Computer Science research, by Adrian Colyer
Made delightfully fast by strattic

Datastores