An overview of end-to-end entity resolution for big data

December 14, 2020December 10, 2020 ~ Adrian Colyer ~ Leave a comment

An overview of end-to-end entity resolution for big data, Christophides et al., ACM Computing Surveys, Dec. 2020, Article No. 127 The ACM Computing Surveys are always a great way to get a quick orientation in a new subject area, and hot off the press is this survey on the entity resolution (aka record linking) problem. ... Continue Reading

Efficient lock-free durable sets

December 2, 2019May 25, 2020 ~ Adrian Colyer ~ 15 Comments

Efficient lock-free durable sets Zuriel et al., OOPSLA'19 Given non-volatile memory (NVRAM), the naive hope for persistence is that it would be a no-op: what happens in memory, stays in memory. Unfortunately, a very similar set of issues to those concerned with flushing volatile memory to persistent disk exist here too, just at another level. ... Continue Reading

Mergeable replicated data types – Part II

November 27, 2019May 25, 2020 ~ Adrian Colyer ~ Leave a comment

Mergeable replicated data types - part II Kaki et al., OOPLSA '19 Last time out we saw how Mergeable Replicated Data Types (MRDTs) use a bijection between the natural domain of a data type and relational sets to define merge semantics between two concurrently modified versions given their lowest common ancestor (LCA). Today we’re picking ... Continue Reading

Mergeable replicated data types – Part I

November 25, 2019May 25, 2020 ~ Adrian Colyer ~ 16 Comments

Mergeable replicated data types Kaki et al., OOPSLA'19 This paper was published at OOPSLA, but perhaps it’s amongst the distributed systems community that I expect there to be the greatest interest. Mergeable Replicated Data Types (MRDTs) are in the same spirit as CRDTs but with the very interesting property that they compose. Furthermore, a principled ... Continue Reading

DDSketch: a fast and fully-mergeable quantile sketch with relative-error guarantees

September 6, 2019May 25, 2020 ~ Adrian Colyer ~ 1 Comment

DDSketch: a fast and fully-mergeable quantile sketch with relative-error guarantees Masson et al., VLDB'19 Datadog handles a ton of metrics - some customers have endpoints generating over 10M points per second! For response times (latencies) reporting a simple metric such as ‘average’ is next to useless. Instead we want to understand what’s happening at different ... Continue Reading

Meta-learning neural Bloom filters

July 19, 2019May 25, 2020 ~ Adrian Colyer ~ 11 Comments

Meta-learning neural bloom filters Rae et al., ICML'19 Bloom filters are wonderful things, enabling us to quickly ask whether a given set could possibly contain a certain value. They produce this answer while using minimal space and offering O(1) inserts and lookups. It’s no wonder Bloom filters and their derivatives (the family of approximate set ... Continue Reading

Designing far memory data structures: think outside the box

June 26, 2019May 25, 2020 ~ Adrian Colyer ~ 14 Comments

Designing far memory data structures: think outside the box Aguilera et al., HotOS'19 Last time out we looked at some of the trade-offs between RInKs and LInKs, and the advantages of local in-memory data structures. There’s another emerging option that we didn’t talk about there: the use of far-memory, memory attached to the network that ... Continue Reading

The data calculator: data structure design and cost synthesis from first principles and learned cost models

January 23, 2019May 25, 2020 ~ Adrian Colyer ~ 10 Comments

The Data Calculator: data structure design and cost synthesis from first principles and learned cost models Idreos et al., SIGMOD'18 This paper preceded the work on data continuums that we looked at last time, and takes a more general look at interactive and semi-automated design of data structures. A data structure here is defined as ... Continue Reading

Design continuums and the path toward self-designing key-value stores that know and learn

January 21, 2019May 25, 2020 ~ Adrian Colyer ~ 3 Comments

Design continuums and the path toward self-designing key-value stores that know and learn Idreos et al., CIDR'19 We’ve seen systems that help to select the best data structure from a pre-defined set of choices (e.g. ‘Darwinian data structure selection’), systems that synthesise data structure implementations given an abstract specification (‘Generalized data structure synthesis’), systems that ... Continue Reading

Moment-based quantile sketches for efficient high cardinality aggregation queries

October 31, 2018May 25, 2020 ~ Adrian Colyer ~ 1 Comment

Moment-based quantile sketches for efficient high cardinality aggregation queries Gan et al., VLDB'18 Today we’re temporarily pausing our tour through some of the OSDI’18 papers in order to look at a great sketch-based data structure for quantile queries over high-cardinality aggregates. That’s a bit of a mouthful so let’s jump straight into an example of ... Continue Reading

the morning paper

a random walk through Computer Science research, by Adrian Colyer
Made delightfully fast by strattic

Algorithms and data structures