Incremental knowledge base construction using DeepDive

October 7, 2016July 31, 2017 ~ adriancolyer ~ 16 Comments

Incremental knowledge base construction using DeepDive Shin et al., VLDB 2015 When I think about the most important CS foundations for the computer systems we build today and will build over the next decade, I think about Distributed systems Database systems / data stores (dealing with data at rest) Stream processing (dealing with data in … Continue reading Incremental knowledge base construction using DeepDive

Write-limited sorts and joins for persistent memory

September 30, 2016July 31, 2017 ~ adriancolyer ~ 1 Comment

Write-limited sorts and joins for persistent memory Viglas, VLDB 2014 This is the second of the two research-for-practice papers for this week. Once more the topic is how database storage algorithms can be optimised for NVM, this time examining the asymmetry between reads and writes on NVM. This is premised on Viglas’ assertion that: Writes … Continue reading Write-limited sorts and joins for persistent memory

Let’s talk about storage and recovery methods for non-volatile memory database systems

September 29, 2016July 31, 2017 ~ adriancolyer ~ 6 Comments

Let's talk about storage and recovery methods for non-volatile memory database systems Arulraj et al., SIGMOD 2015 Update: fixed a bunch of broken links. I can't believe I only just found out about this paper! It's exactly what I've been looking for in terms of an analysis of the impacts of NVM on data storage … Continue reading Let’s talk about storage and recovery methods for non-volatile memory database systems

DBSherlock: A performance diagnostic tool for transactional databases

July 14, 2016July 31, 2017 ~ adriancolyer ~ 5 Comments

DBSherlock: A performance diagnostic tool for transactional databases Yoon et al. SIGMOD ’16 …tens of thousands of concurrent transactions competing for the same resources (e.g. CPU, disk I/O, memory) can create highly non-linear and counter-intuitive effects on database performance. If you’re a DBA responsible for figuring out what’s going on, this presents quite a challenge. … Continue reading DBSherlock: A performance diagnostic tool for transactional databases

Realtime data processing at Facebook

July 11, 2016July 31, 2017 ~ adriancolyer ~ 4 Comments

Realtime Data Processing at Facebook Chen et al. SIGMOD 2016 ‘Realtime Data Processing at Facebook’ provides us with a great high-level overview of the systems Facebook have built to support real-time workloads. At the heart of the paper is a set of five key design decisions for building such systems, together with an explanation of … Continue reading Realtime data processing at Facebook

Efficiently compiling efficient query plans for modern hardware

May 23, 2016July 27, 2017 ~ adriancolyer ~ 4 Comments

Efficiently Compiling Efficient Query Plans for Modern Hardware- Neumann, VLDB 2011 Updated with direct links to Databricks blog post now that it is published. A couple of weeks ago I had a chance to chat with Reynold Xin and Richard Garris from Databricks / Spark at RedisConf, where we were both giving talks. Reynold and … Continue reading Efficiently compiling efficient query plans for modern hardware

BTrDB: Optimizing Storage System Design for Timeseries Processing

May 4, 2016July 27, 2017 ~ adriancolyer ~ 4 Comments

BTrDB: Optimizing Storage System Design for Timeseries Processing - Anderson & Culler 2016 It turns out you can accomplish quite a lot with 4,709 lines of Go code! How about a full time-series database implementation, robust enough to be run in production for a year where it stored 2.1 trillion data points, and supporting 119M … Continue reading BTrDB: Optimizing Storage System Design for Timeseries Processing

Gorilla: A fast, scalable, in-memory time series database

May 3, 2016July 27, 2017 ~ adriancolyer ~ 16 Comments

Gorilla: A fast, scalable, in-memory time series database - Pelkonen et al. 2015 Error rates across one of Facebook's sites were spiking. The problem had first shown up through an automated alert triggered by an in-memory time-series database called Gorilla a few minutes after the problem started. One set of engineers mitigated the immediate issue. … Continue reading Gorilla: A fast, scalable, in-memory time series database

The RAMCloud Storage System

January 18, 2016July 27, 2017 ~ adriancolyer ~ 11 Comments

The RAMCloud Storage System - Ousterhout et al. 2015 This paper is a comprehensive overview of RAMCloud, published in the ACM Transactions on Computer Systems in August 2015. It's a long read (55 pages), but there's a ton of great material here. The RAMCloud project started in 2009, so this is therefore an overview of … Continue reading The RAMCloud Storage System

Granularity of Locks and Degree of Consistency in a Shared Data Base – Part II

January 6, 2016July 27, 2017 ~ adriancolyer ~ 3 Comments

Granularity of Locks and Degree of Consistency in a Shared Data Base - Gray et al. 1975 This is part 3 of a 7 part series on (database) 'Techniques Everyone Should Know.' Today we'll look at the second part of this paper which introduces the notion of differing degrees of consistency, and how we can … Continue reading Granularity of Locks and Degree of Consistency in a Shared Data Base – Part II