An empirical guide to the behavior and use of scalable persistent memory

March 18, 2020March 15, 2020 ~ adriancolyer

An empirical guide to the behavior and use of scalable persistent memory, Yang et al., FAST'20 We've looked at multiple papers exploring non-volatile main memory and its implications (e.g. most recently 'Efficient lock-free durable sets'). One thing they all had in common is an evaluation using some kind of simulation of the expected behaviour of … Continue reading An empirical guide to the behavior and use of scalable persistent memory

Understanding, detecting and localizing partial failures in large system software

March 16, 2020March 15, 2020 ~ adriancolyer ~ 2 Comments

Understanding, detecting and localizing partial failures in large system software, Lou et al., NSDI'20 Partial failures (gray failures) occur when some but not all of the functionalities of a system are broken. On the surface everything can appear to be fine, but under the covers things may be going astray. When a partial failure occurs, … Continue reading Understanding, detecting and localizing partial failures in large system software

When correlation (or lack of it) can be causation

March 13, 2020March 8, 2020 ~ adriancolyer ~ 1 Comment

Rex: preventing bugs and misconfiguration in large services using correlated change analysis, Mehta et al., NSDI'20 and Check before you change: preventing correlated failures in service updates, Zhai et al., NSDI'20 Today's post is a double header. I've chosen two papers from NSDI'20 that are both about correlation. Rex is a tool widely deployed across … Continue reading When correlation (or lack of it) can be causation

Characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook

March 11, 2020March 8, 2020 ~ adriancolyer ~ 2 Comments

Characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook, Cao et al., FAST'20 You get good at what you practice. Or in the case of key-value stores, what you benchmark. So if you want to design a system that will offer good real-world performance, it's really useful to have benchmarks that accurately represent real-world workloads. … Continue reading Characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook

Building an elastic query engine on disaggregated storage

March 9, 2020March 8, 2020 ~ adriancolyer ~ 1 Comment

Building an elastic query engine on disaggregated storage, Vuppalapati, NSDI'20 This paper describes the design decisions behind the Snowflake cloud-based data warehouse. As the saying goes, 'all snowflakes are special' - but what is it exactly that's special about this one? When I think about cloud-native architectures, I think about disaggregation (enabling each resource type … Continue reading Building an elastic query engine on disaggregated storage

Millions of tiny databases

March 4, 2020March 1, 2020 ~ adriancolyer ~ 14 Comments

Millions of tiny databases, Brooker et al., NSDI'20 This paper is a real joy to read. It takes you through the thinking processes and engineering practices behind the design of a key part of the control plane for AWS Elastic Block Storage (EBS): the Physalia database that stores configuration information. In the same spirit as … Continue reading Millions of tiny databases

Firecracker: lightweight virtualization for serverless applications

March 2, 2020March 1, 2020 ~ adriancolyer ~ 3 Comments

Firecracker: lightweight virtualisation for serverless applications, Agache et al., NSDI'20 Finally the NSDI'20 papers have opened up to the public (as of last week), and what a great looking crop of papers it is. We looked at a couple of papers that had pre-prints available last week, today we'll be looking at one of the … Continue reading Firecracker: lightweight virtualization for serverless applications

Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure

February 28, 2020February 22, 2020 ~ adriancolyer ~ 5 Comments

Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure, Li et al., NSDI'20 Modern software systems at scale are incredibly complex ever changing environments. Despite all the pre-deployment testing you might employ, this makes it really tough to change them with confidence. Thus it's common to use some form of phased rollout, … Continue reading Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure

Meaningful availability

February 26, 2020February 26, 2020 ~ adriancolyer ~ 13 Comments

Meaningful availability, Hauer et al., NSDI'20 With thanks to Damien Mathieu for the recommendation. This very clearly written paper describes the Google G Suite team's search for a meaningful availability metric: one that accurately reflected what their end users experienced, and that could be used by engineers to pinpoint issues and guide improvements. > A … Continue reading Meaningful availability

AnyLog: a grand unification of the Internet of things

February 24, 2020February 22, 2020 ~ adriancolyer ~ 3 Comments

AnyLog: a grand unification of the Internet of Things, Abadi et al., CIDR'20 The Web provides decentralised publishing and direct access to unstructured data (searching / querying that data has turned out to be a pretty centralised affair in practice though). AnyLog wants to do for structured (relational) data what the Web has done for … Continue reading AnyLog: a grand unification of the Internet of things