Moment-based quantile sketches for efficient high cardinality aggregation queries

October 31, 2018October 27, 2018 ~ adriancolyer ~ 1 Comment

Moment-based quantile sketches for efficient high cardinality aggregation queries Gan et al., VLDB'18 Today we’re temporarily pausing our tour through some of the OSDI’18 papers in order to look at a great sketch-based data structure for quantile queries over high-cardinality aggregates. That’s a bit of a mouthful so let’s jump straight into an example of … Continue reading Moment-based quantile sketches for efficient high cardinality aggregation queries

Noria: dynamic, partially-stateful data-flow for high-performance web applications

October 29, 2018October 27, 2018 ~ adriancolyer ~ 3 Comments

Noria: dynamic, partially-stateful data-flow for high-performance web applications Gjengset, Schwarzkopf et al., OSDI'18 I have way more margin notes for this paper than I typically do, and that’s a reflection of my struggle to figure out what kind of thing we’re dealing with here. Noria doesn’t want to fit neatly into any existing box! We’ve … Continue reading Noria: dynamic, partially-stateful data-flow for high-performance web applications

RobinHood: tail latency aware caching – dynamic reallocation from cache-rich to cache-poor

October 26, 2018October 20, 2018 ~ adriancolyer ~ 11 Comments

RobinHood: tail latency aware caching - dynamic reallocation from cache-rich to cache-poor Berger et al., OSDI'18 It’s time to rethink everything you thought you knew about caching! My mental model goes something like this: we have a set of items that probably follow a power-law of popularity. We have a certain finite cache capacity, and … Continue reading RobinHood: tail latency aware caching – dynamic reallocation from cache-rich to cache-poor

Maelstrom: mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently

October 24, 2018October 20, 2018 ~ adriancolyer ~ 4 Comments

Maelstrom: mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently Veeraraghavan et al., OSDI'18 Here’s a really valuable paper detailing four plus years of experience dealing with datacenter outages at Facebook. Maelstrom is the system Facebook use in production to mitigate and recover from datacenter-level disasters. The high level idea is simple: drain traffic … Continue reading Maelstrom: mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently

LegoOS: a disseminated, distributed OS for hardware resource disaggregation

October 22, 2018October 20, 2018 ~ adriancolyer ~ 6 Comments

LegoOS: a disseminated, distributed OS for hardware resource disaggregation Shan et al., OSDI'18 One of the interesting trends in hardware is the proliferation and importance of dedicated accelerators as general purposes CPUs stopped benefitting from Moore’s law. At the same time we’ve seen networking getting faster and faster, causing us to rethink some of the … Continue reading LegoOS: a disseminated, distributed OS for hardware resource disaggregation

Orca: differential bug localization in large-scale services

October 19, 2018October 11, 2018 ~ adriancolyer ~ 8 Comments

Orca: differential bug localization in large-scale services Bhagwan et al., OSDI'18 Earlier this week we looked at REPT, the reverse debugging tool deployed live in the Windows Error Reporting service. Today it’s the turn of Orca, a bug localisation service that Microsoft have in production usage for six of their large online services. The focus … Continue reading Orca: differential bug localization in large-scale services

REPT: reverse debugging of failures in deployed software

October 17, 2018October 11, 2018 ~ adriancolyer ~ 9 Comments

REPT: reverse debugging of failures in deployed software Cui et al., OSDI'18 REPT (‘repeat’) won a best paper award at OSDI’18 this month. It addresses the problem of debugging crashes in production software, when all you have available is a memory dump. In particular, we’re talking about debugging Windows binaries. To effectively understand and fix … Continue reading REPT: reverse debugging of failures in deployed software

Capturing and enhancing in situ system observability for failure detection

October 15, 2018October 11, 2018 ~ adriancolyer ~ 2 Comments

Capturing and enhancing in situ system observability for failure detection Huang et al., OSDI'18 The central idea in this paper is simple and brilliant. The place where we have the most relevant information about the health of a process or thread is in the clients that call it. Today the state of the practice is … Continue reading Capturing and enhancing in situ system observability for failure detection

Automatic discovery of tactics in spatio-temporal soccer match data

October 12, 2018October 11, 2018 ~ adriancolyer ~ 2 Comments

Automatic discovery of tactics in spatio-temporal soccer match data Decroos et al., KDD'18 Here’s a fun paper to end the week. Data collection from sporting events is now widespread. This fuels an endless thirst for team and player statistics. In terms of football (which shall refer to the game of soccer throughout this write-up) that … Continue reading Automatic discovery of tactics in spatio-temporal soccer match data

Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding

October 10, 2018October 7, 2018 ~ adriancolyer ~ 3 Comments

Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding Hundman et al., KDD'18 How do you effectively monitor a spacecraft? That was the question facing NASA’s Jet Propulsion Laboratory as they looked forward towards exponentially increasing telemetry data rates for Earth Science satellites (e.g., around 85 terabytes/day for a Synthetic Aperture Radar satellite). Spacecraft are … Continue reading Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding