The truth, the whole truth, and nothing but the truth: a pragmatic guide to assessing empirical evaluations

January 26, 2017July 31, 2017 ~ adriancolyer ~ 6 Comments

The truth, the whole truth, and nothing but the truth: A pragmatic guide to assessing empirical evaluations Blackburn et al. ACM Transactions on Programming Languages and Systems 2016 Yesterday we looked at some of the ways analysts may be fooled into thinking they've found a statistically significant result when in fact they haven't. Today's paper … Continue reading The truth, the whole truth, and nothing but the truth: a pragmatic guide to assessing empirical evaluations

Toward sustainable insights, or why polygamy is bad for you

January 25, 2017July 31, 2017 ~ adriancolyer ~ 18 Comments

Toward sustainable insights, or why polygamy is bad for you Binning et al., CIDR 2017 Buckle up! Today we're going to be talking about statistics, p-values, and the multiple comparisons problem. Some good background resources here are: Statistics Done Wrong, by Alex Reinhart p-values on wikipedia Misunderstandings of p-values, also on wikipedia For my own … Continue reading Toward sustainable insights, or why polygamy is bad for you

Data provenance at internet scale: architecture, experiences, and the road ahead

January 24, 2017July 31, 2017 ~ adriancolyer ~ 5 Comments

Data provenance at Internet scale: Architecture, experiences, and the road ahead Chen et al., CIDR 2017 Provenance within the context of a single database has been reasonably well studied. In this paper though, Chen et al., explore what happens when you try to trace provenance in a distributed setting and at larger scale. The context … Continue reading Data provenance at internet scale: architecture, experiences, and the road ahead

Ground: A data context service

January 23, 2017July 31, 2017 ~ adriancolyer ~ 6 Comments

Ground: A Data Context Service Hellerstein et al. , CIDR 2017 An unfortunate consequence of the disaggregated nature of contemporary data systems is the lack of a standard mechanism to assemble a collective understanding of the origin, scope, and usage of the data they manage. Put more bluntly, many organisations have only a fuzzy picture … Continue reading Ground: A data context service

Dependency-driven analytics: a compass for uncharted data oceans

January 20, 2017July 31, 2017 ~ adriancolyer ~ 4 Comments

Dependency-driven analytics: a compass for uncharted data oceans Mavlyutov et al. CIDR 2017 Like yesterday's paper, today's paper considers what to do when you simply have too much data to be able to process it all. Forget data lakes, we're in data ocean territory now. This is a problem Microsoft faced with their large clusters … Continue reading Dependency-driven analytics: a compass for uncharted data oceans

Prioritizing attention in fast data: principles and promise

January 19, 2017July 31, 2017 ~ adriancolyer ~ 2 Comments

Prioritizing attention in fast data: principles and promise Bailis et al., CIDR 2017 Today it's two for the price of one as we get a life lesson in addition to a wonderfully thought-provoking piece of research. I'm sure you'd all agree that we're drowning in information - so much content being pumped out all of … Continue reading Prioritizing attention in fast data: principles and promise

SnappyData: A unified cluster for streaming, transactions, and interactive analytics

January 18, 2017July 31, 2017 ~ adriancolyer ~ 4 Comments

SnappyData: A unified cluster for streaming, transactions, and interactive analytics Mozafari et al., CIDR 2017 Update: fixed broken paper link, thanks Zteve. On Monday we looked at Weld which showed how to combine disparate data processing and analytic frameworks using a common underlying IR. Yesterday we looked at Peloton that adapts to mixed OLTP and … Continue reading SnappyData: A unified cluster for streaming, transactions, and interactive analytics

Self-driving database management systems

January 17, 2017July 31, 2017 ~ adriancolyer ~ 3 Comments

Self-driving database management systems Pavlo et al., CIDR 2017 We've previously seen many papers looking into how distributed and database systems technologies can support machine learning workloads. Today's paper choice explores what happens when you do it the other way round - i.e., embed machine learning into a DBMS in order to continuously optimise its … Continue reading Self-driving database management systems

Weld: A common runtime for high performance data analytics

January 16, 2017July 31, 2017 ~ adriancolyer ~ 8 Comments

Weld: A common runtime for high performance data analytics Palkar et al. CIDR 2017 This is the first in a series of posts looking at papers from CIDR 2017. See yesterday's post for my conference overview. We have a proliferation of data and analytics libraries and frameworks - for example, Spark, TensorFlow, MxNet, Numpy, Pandas, … Continue reading Weld: A common runtime for high performance data analytics

Innovation, experience-based insight and vision at CIDR ’17

January 15, 2017January 15, 2017 ~ adriancolyer ~ 2 Comments

Last week was CIDR 2017, the biennial Conference on Innovative Data Systems Research. CIDR encourages authors to take a whole system perspective and especially values "innovation, experience-based insight, and vision." That's a very good match with the attributes of papers I like to cover on The Morning Paper. So what innovation, insight, and vision does … Continue reading Innovation, experience-based insight and vision at CIDR ’17