The truth, the whole truth, and nothing but the truth: A pragmatic guide to assessing empirical evaluations Blackburn et al. ACM Transactions on Programming Languages and Systems 2016 Yesterday we looked at some of the ways analysts may be fooled into thinking they've found a statistically significant result when in fact they haven't. Today's paper … Continue reading The truth, the whole truth, and nothing but the truth: a pragmatic guide to assessing empirical evaluations
Author: adriancolyer
Toward sustainable insights, or why polygamy is bad for you
Toward sustainable insights, or why polygamy is bad for you Binning et al., CIDR 2017 Buckle up! Today we're going to be talking about statistics, p-values, and the multiple comparisons problem. Some good background resources here are: Statistics Done Wrong, by Alex Reinhart p-values on wikipedia Misunderstandings of p-values, also on wikipedia For my own … Continue reading Toward sustainable insights, or why polygamy is bad for you
Data provenance at internet scale: architecture, experiences, and the road ahead
Data provenance at Internet scale: Architecture, experiences, and the road ahead Chen et al., CIDR 2017 Provenance within the context of a single database has been reasonably well studied. In this paper though, Chen et al., explore what happens when you try to trace provenance in a distributed setting and at larger scale. The context … Continue reading Data provenance at internet scale: architecture, experiences, and the road ahead
Ground: A data context service
Ground: A Data Context Service Hellerstein et al. , CIDR 2017 An unfortunate consequence of the disaggregated nature of contemporary data systems is the lack of a standard mechanism to assemble a collective understanding of the origin, scope, and usage of the data they manage. Put more bluntly, many organisations have only a fuzzy picture … Continue reading Ground: A data context service
Dependency-driven analytics: a compass for uncharted data oceans
Dependency-driven analytics: a compass for uncharted data oceans Mavlyutov et al. CIDR 2017 Like yesterday's paper, today's paper considers what to do when you simply have too much data to be able to process it all. Forget data lakes, we're in data ocean territory now. This is a problem Microsoft faced with their large clusters … Continue reading Dependency-driven analytics: a compass for uncharted data oceans
Prioritizing attention in fast data: principles and promise
Prioritizing attention in fast data: principles and promise Bailis et al., CIDR 2017 Today it's two for the price of one as we get a life lesson in addition to a wonderfully thought-provoking piece of research. I'm sure you'd all agree that we're drowning in information - so much content being pumped out all of … Continue reading Prioritizing attention in fast data: principles and promise
SnappyData: A unified cluster for streaming, transactions, and interactive analytics
SnappyData: A unified cluster for streaming, transactions, and interactive analytics Mozafari et al., CIDR 2017 Update: fixed broken paper link, thanks Zteve. On Monday we looked at Weld which showed how to combine disparate data processing and analytic frameworks using a common underlying IR. Yesterday we looked at Peloton that adapts to mixed OLTP and … Continue reading SnappyData: A unified cluster for streaming, transactions, and interactive analytics
Self-driving database management systems
Self-driving database management systems Pavlo et al., CIDR 2017 We've previously seen many papers looking into how distributed and database systems technologies can support machine learning workloads. Today's paper choice explores what happens when you do it the other way round - i.e., embed machine learning into a DBMS in order to continuously optimise its … Continue reading Self-driving database management systems
Weld: A common runtime for high performance data analytics
Weld: A common runtime for high performance data analytics Palkar et al. CIDR 2017 This is the first in a series of posts looking at papers from CIDR 2017. See yesterday's post for my conference overview. We have a proliferation of data and analytics libraries and frameworks - for example, Spark, TensorFlow, MxNet, Numpy, Pandas, … Continue reading Weld: A common runtime for high performance data analytics
Innovation, experience-based insight and vision at CIDR ’17
Last week was CIDR 2017, the biennial Conference on Innovative Data Systems Research. CIDR encourages authors to take a whole system perspective and especially values "innovation, experience-based insight, and vision." That's a very good match with the attributes of papers I like to cover on The Morning Paper. So what innovation, insight, and vision does … Continue reading Innovation, experience-based insight and vision at CIDR ’17
