Debugging distributed systems with why-across-time provenance

November 12, 2018November 8, 2018 ~ adriancolyer ~ 3 Comments

Debugging distributed systems with why-across-time provenance Whittaker et al., SoCC'18 This value is 17 here, and it shouldn’t be. Why did the get request return 17? Sometimes the simplest questions can be the hardest to answer. As the opening sentence of this paper states: Debugging distributed systems is hard. The kind of why questions we’re … Continue reading Debugging distributed systems with why-across-time provenance

ApproxJoin: approximate distributed joins

November 9, 2018November 8, 2018 ~ adriancolyer ~ 1 Comment

ApproxJoin: approximate distributed joins Le Quoc et al., SoCC'18 GitHub: https://ApproxJoin.github.io The join is a fundamental data processing operation and has been heavily optimised in relational databases. When you’re working with large volumes of unstructured data though, say with a data processing framework such as Flink or Spark, joins become distributed and much more expensive. … Continue reading ApproxJoin: approximate distributed joins

ASAP: fast, approximate graph pattern mining at scale

November 7, 2018November 8, 2018 ~ adriancolyer ~ 5 Comments

ASAP: fast, approximate graph pattern mining at scale Iyer et al., OSDI'18 I have a real soft spot for approximate computations. In general, we waste a lot of resources on overly accurate analyses when understanding the trends and / or the neighbourhood is quite good enough (do you really need to know it’s 78.763895% vs … Continue reading ASAP: fast, approximate graph pattern mining at scale

Sharding the shards: managing datastore locality at scale with Akkio

November 5, 2018November 4, 2018 ~ adriancolyer ~ 3 Comments

Sharding the shards: managing datastore locality at scale with Akkio Annamalai et al., OSDI'18 In Harry Potter, the Accio Summoning Charm summons an object to the caster of the spell, sometimes transporting it over a significant distance. In Facebook, Akkio summons data to a datacenter with the goal of improving data access locality for clients. … Continue reading Sharding the shards: managing datastore locality at scale with Akkio

The FuzzyLog: a partially ordered shared log

November 2, 2018October 28, 2018 ~ adriancolyer ~ 1 Comment

The FuzzyLog: a partially ordered shared log Lockerman et al., OSDI'18 If you want to build a distributed system then having a distributed shared log as an abstraction to build upon — one that gives you an agreed upon total order for all events — is such a big help that it’s practically cheating! (See … Continue reading The FuzzyLog: a partially ordered shared log

Moment-based quantile sketches for efficient high cardinality aggregation queries

October 31, 2018October 27, 2018 ~ adriancolyer ~ 1 Comment

Moment-based quantile sketches for efficient high cardinality aggregation queries Gan et al., VLDB'18 Today we’re temporarily pausing our tour through some of the OSDI’18 papers in order to look at a great sketch-based data structure for quantile queries over high-cardinality aggregates. That’s a bit of a mouthful so let’s jump straight into an example of … Continue reading Moment-based quantile sketches for efficient high cardinality aggregation queries

Noria: dynamic, partially-stateful data-flow for high-performance web applications

October 29, 2018October 27, 2018 ~ adriancolyer ~ 3 Comments

Noria: dynamic, partially-stateful data-flow for high-performance web applications Gjengset, Schwarzkopf et al., OSDI'18 I have way more margin notes for this paper than I typically do, and that’s a reflection of my struggle to figure out what kind of thing we’re dealing with here. Noria doesn’t want to fit neatly into any existing box! We’ve … Continue reading Noria: dynamic, partially-stateful data-flow for high-performance web applications

RobinHood: tail latency aware caching – dynamic reallocation from cache-rich to cache-poor

October 26, 2018October 20, 2018 ~ adriancolyer ~ 11 Comments

RobinHood: tail latency aware caching - dynamic reallocation from cache-rich to cache-poor Berger et al., OSDI'18 It’s time to rethink everything you thought you knew about caching! My mental model goes something like this: we have a set of items that probably follow a power-law of popularity. We have a certain finite cache capacity, and … Continue reading RobinHood: tail latency aware caching – dynamic reallocation from cache-rich to cache-poor

Maelstrom: mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently

October 24, 2018October 20, 2018 ~ adriancolyer ~ 4 Comments

Maelstrom: mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently Veeraraghavan et al., OSDI'18 Here’s a really valuable paper detailing four plus years of experience dealing with datacenter outages at Facebook. Maelstrom is the system Facebook use in production to mitigate and recover from datacenter-level disasters. The high level idea is simple: drain traffic … Continue reading Maelstrom: mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently

LegoOS: a disseminated, distributed OS for hardware resource disaggregation

October 22, 2018October 20, 2018 ~ adriancolyer ~ 6 Comments

LegoOS: a disseminated, distributed OS for hardware resource disaggregation Shan et al., OSDI'18 One of the interesting trends in hardware is the proliferation and importance of dedicated accelerators as general purposes CPUs stopped benefitting from Moore’s law. At the same time we’ve seen networking getting faster and faster, causing us to rethink some of the … Continue reading LegoOS: a disseminated, distributed OS for hardware resource disaggregation