An overview of end-to-end entity resolution for big data, Christophides et al., ACM Computing Surveys, Dec. 2020, Article No. 127 The ACM Computing Surveys are always a great way to get a quick orientation in a new subject area, and hot off the press is this survey on the entity resolution (aka record linking) problem. It’s an … Continue reading An overview of end-to-end entity resolution for big data
Tag: Algorithms and data structures
Handy algorithms and data structures – e.g., for approximations
Efficient lock-free durable sets
Efficient lock-free durable sets Zuriel et al., OOPSLA'19 Given non-volatile memory (NVRAM), the naive hope for persistence is that it would be a no-op: what happens in memory, stays in memory. Unfortunately, a very similar set of issues to those concerned with flushing volatile memory to persistent disk exist here too, just at another level. … Continue reading Efficient lock-free durable sets
Mergeable replicated data types – Part II
Mergeable replicated data types - part II Kaki et al., OOPLSA '19 Last time out we saw how Mergeable Replicated Data Types (MRDTs) use a bijection between the natural domain of a data type and relational sets to define merge semantics between two concurrently modified versions given their lowest common ancestor (LCA). Today we’re picking … Continue reading Mergeable replicated data types – Part II
Mergeable replicated data types – Part I
Mergeable replicated data types Kaki et al., OOPSLA'19 This paper was published at OOPSLA, but perhaps it’s amongst the distributed systems community that I expect there to be the greatest interest. Mergeable Replicated Data Types (MRDTs) are in the same spirit as CRDTs but with the very interesting property that they compose. Furthermore, a principled … Continue reading Mergeable replicated data types – Part I
DDSketch: a fast and fully-mergeable quantile sketch with relative-error guarantees
DDSketch: a fast and fully-mergeable quantile sketch with relative-error guarantees Masson et al., VLDB'19 Datadog handles a ton of metrics - some customers have endpoints generating over 10M points per second! For response times (latencies) reporting a simple metric such as ‘average’ is next to useless. Instead we want to understand what’s happening at different … Continue reading DDSketch: a fast and fully-mergeable quantile sketch with relative-error guarantees
Meta-learning neural Bloom filters
Meta-learning neural bloom filters Rae et al., ICML'19 Bloom filters are wonderful things, enabling us to quickly ask whether a given set could possibly contain a certain value. They produce this answer while using minimal space and offering O(1) inserts and lookups. It’s no wonder Bloom filters and their derivatives (the family of approximate set … Continue reading Meta-learning neural Bloom filters
Designing far memory data structures: think outside the box
Designing far memory data structures: think outside the box Aguilera et al., HotOS'19 Last time out we looked at some of the trade-offs between RInKs and LInKs, and the advantages of local in-memory data structures. There’s another emerging option that we didn’t talk about there: the use of far-memory, memory attached to the network that … Continue reading Designing far memory data structures: think outside the box
The data calculator: data structure design and cost synthesis from first principles and learned cost models
The Data Calculator: data structure design and cost synthesis from first principles and learned cost models Idreos et al., SIGMOD'18 This paper preceded the work on data continuums that we looked at last time, and takes a more general look at interactive and semi-automated design of data structures. A data structure here is defined as … Continue reading The data calculator: data structure design and cost synthesis from first principles and learned cost models
Design continuums and the path toward self-designing key-value stores that know and learn
Design continuums and the path toward self-designing key-value stores that know and learn Idreos et al., CIDR'19 We’ve seen systems that help to select the best data structure from a pre-defined set of choices (e.g. ‘Darwinian data structure selection’), systems that synthesise data structure implementations given an abstract specification (‘Generalized data structure synthesis’), systems that … Continue reading Design continuums and the path toward self-designing key-value stores that know and learn
Moment-based quantile sketches for efficient high cardinality aggregation queries
Moment-based quantile sketches for efficient high cardinality aggregation queries Gan et al., VLDB'18 Today we’re temporarily pausing our tour through some of the OSDI’18 papers in order to look at a great sketch-based data structure for quantile queries over high-cardinality aggregates. That’s a bit of a mouthful so let’s jump straight into an example of … Continue reading Moment-based quantile sketches for efficient high cardinality aggregation queries