Explaining outputs in modern data analytics

February 1, 2017July 31, 2017 ~ adriancolyer ~ 9 Comments

Explaining outputs in modern data analytics Chothia et al. ETH Zurich Technical Report, 2016 Yesterday we touched on some of the difficulties of explanation in the context of machine learning, and last week we looked at some of the extensions to ExSPAN to track network provenance. Lest you be under any remaining misapprehension that explanation … Continue reading Explaining outputs in modern data analytics

European Union regulations on algorithmic decision making and a “right to explanation”

January 31, 2017July 31, 2017 ~ adriancolyer ~ 15 Comments

European Union regulations on algorithmic decision-making and a “right to explanation” Goodman & Flaxman, 2016 In just over a year, the General Data Protection Regulation (GDPR) becomes law in European member states. This paper focuses on just one particular aspect of the new law, article 22, as it relates to profiling, non-discrimination, and the right … Continue reading European Union regulations on algorithmic decision making and a “right to explanation”

How good are query optimizers, really?

January 30, 2017July 31, 2017 ~ adriancolyer ~ 2 Comments

How good are query optimizers, really? Leis et al., VLBD 2015 Last week we looked at cardinality estimation using index-based sampling, evaluated using the Join Order Benchmark. Today's choice is the paper that introduces the Join Order Benchmark (JOB) itself. It's a great evaluation paper, and along the way we'll learn a lot about mainstream … Continue reading How good are query optimizers, really?

Cardinality estimation done right: index-based join sampling

January 27, 2017July 31, 2017 ~ adriancolyer ~ 2 Comments

Cardinality estimation done right: Index-based join sampling Cardinality estimation done right: Index-based join sampling Leis et al., CIDR 2017 Let's finish up our brief look at CIDR 2017 with something closer to the core of database systems research - query optimisation. For good background on this topic a great place to start is Selinger's 1979 … Continue reading Cardinality estimation done right: index-based join sampling

The truth, the whole truth, and nothing but the truth: a pragmatic guide to assessing empirical evaluations

January 26, 2017July 31, 2017 ~ adriancolyer ~ 6 Comments

The truth, the whole truth, and nothing but the truth: A pragmatic guide to assessing empirical evaluations Blackburn et al. ACM Transactions on Programming Languages and Systems 2016 Yesterday we looked at some of the ways analysts may be fooled into thinking they've found a statistically significant result when in fact they haven't. Today's paper … Continue reading The truth, the whole truth, and nothing but the truth: a pragmatic guide to assessing empirical evaluations

Toward sustainable insights, or why polygamy is bad for you

January 25, 2017July 31, 2017 ~ adriancolyer ~ 18 Comments

Toward sustainable insights, or why polygamy is bad for you Binning et al., CIDR 2017 Buckle up! Today we're going to be talking about statistics, p-values, and the multiple comparisons problem. Some good background resources here are: Statistics Done Wrong, by Alex Reinhart p-values on wikipedia Misunderstandings of p-values, also on wikipedia For my own … Continue reading Toward sustainable insights, or why polygamy is bad for you

Data provenance at internet scale: architecture, experiences, and the road ahead

January 24, 2017July 31, 2017 ~ adriancolyer ~ 5 Comments

Data provenance at Internet scale: Architecture, experiences, and the road ahead Chen et al., CIDR 2017 Provenance within the context of a single database has been reasonably well studied. In this paper though, Chen et al., explore what happens when you try to trace provenance in a distributed setting and at larger scale. The context … Continue reading Data provenance at internet scale: architecture, experiences, and the road ahead

Ground: A data context service

January 23, 2017July 31, 2017 ~ adriancolyer ~ 6 Comments

Ground: A Data Context Service Hellerstein et al. , CIDR 2017 An unfortunate consequence of the disaggregated nature of contemporary data systems is the lack of a standard mechanism to assemble a collective understanding of the origin, scope, and usage of the data they manage. Put more bluntly, many organisations have only a fuzzy picture … Continue reading Ground: A data context service

Dependency-driven analytics: a compass for uncharted data oceans

January 20, 2017July 31, 2017 ~ adriancolyer ~ 4 Comments

Dependency-driven analytics: a compass for uncharted data oceans Mavlyutov et al. CIDR 2017 Like yesterday's paper, today's paper considers what to do when you simply have too much data to be able to process it all. Forget data lakes, we're in data ocean territory now. This is a problem Microsoft faced with their large clusters … Continue reading Dependency-driven analytics: a compass for uncharted data oceans

Prioritizing attention in fast data: principles and promise

January 19, 2017July 31, 2017 ~ adriancolyer ~ 2 Comments

Prioritizing attention in fast data: principles and promise Bailis et al., CIDR 2017 Today it's two for the price of one as we get a life lesson in addition to a wonderfully thought-provoking piece of research. I'm sure you'd all agree that we're drowning in information - so much content being pumped out all of … Continue reading Prioritizing attention in fast data: principles and promise