ASAP: fast, approximate graph pattern mining at scale Iyer et al., OSDI'18 I have a real soft spot for approximate computations. In general, we waste a lot of resources on overly accurate analyses when understanding the trends and / or the neighbourhood is quite good enough (do you really need to know it’s 78.763895% vs … Continue reading ASAP: fast, approximate graph pattern mining at scale
Tag: Analytics
Data analytics
Filter before you parse: faster analytics on raw data with Sparser
Filter before you parse: faster analytics on raw data with Sparser Palkar et al., VLDB'18 We’ve been parsing JSON for over 15 years. So it’s surprising and wonderful that with a fresh look at the problem the authors of this paper have been able to deliver an order-of-magnitude speed-up with Sparser in about 4Kloc. The … Continue reading Filter before you parse: faster analytics on raw data with Sparser
Popular is cheaper: curtailing memory costs in interactive analytics engines
Popular is cheaper: curtailing memory costs in interactive analytics engines Ghosh et al., EuroSys'18 (If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site). We’re sticking with the optimisation of data analytics today, but at the other end of … Continue reading Popular is cheaper: curtailing memory costs in interactive analytics engines
Toward sustainable insights, or why polygamy is bad for you
Toward sustainable insights, or why polygamy is bad for you Binning et al., CIDR 2017 Buckle up! Today we're going to be talking about statistics, p-values, and the multiple comparisons problem. Some good background resources here are: Statistics Done Wrong, by Alex Reinhart p-values on wikipedia Misunderstandings of p-values, also on wikipedia For my own … Continue reading Toward sustainable insights, or why polygamy is bad for you
Dependency-driven analytics: a compass for uncharted data oceans
Dependency-driven analytics: a compass for uncharted data oceans Mavlyutov et al. CIDR 2017 Like yesterday's paper, today's paper considers what to do when you simply have too much data to be able to process it all. Forget data lakes, we're in data ocean territory now. This is a problem Microsoft faced with their large clusters … Continue reading Dependency-driven analytics: a compass for uncharted data oceans
SnappyData: A unified cluster for streaming, transactions, and interactive analytics
SnappyData: A unified cluster for streaming, transactions, and interactive analytics Mozafari et al., CIDR 2017 Update: fixed broken paper link, thanks Zteve. On Monday we looked at Weld which showed how to combine disparate data processing and analytic frameworks using a common underlying IR. Yesterday we looked at Peloton that adapts to mixed OLTP and … Continue reading SnappyData: A unified cluster for streaming, transactions, and interactive analytics
Weld: A common runtime for high performance data analytics
Weld: A common runtime for high performance data analytics Palkar et al. CIDR 2017 This is the first in a series of posts looking at papers from CIDR 2017. See yesterday's post for my conference overview. We have a proliferation of data and analytics libraries and frameworks - for example, Spark, TensorFlow, MxNet, Numpy, Pandas, … Continue reading Weld: A common runtime for high performance data analytics
Shasta: Interactive reporting at scale
Shasta: Interactive Reporting At Scale Manoharan et al., SIGMOD 2016 You have vast database schemas with hundreds of tables, applications that need to combine OLTP and OLAP functionality, queries that may join 50 or more tables across disparate data sources, oh, and the user is waiting, so you'd better deliver the results online with low … Continue reading Shasta: Interactive reporting at scale
Searching and mining trillions of time series subsequences under Dynamic Time Warping
Searching and mining trillions of time series subsequences under dynamic time warping - Rakthanmanon et al. SIGKDD 2012 What an astonishing paper this is! By 2012, Dynamic Time Warping had been shown to be the time series similarity measure that generally performs the best for matching, but because of its computational complexity researchers and practitioners … Continue reading Searching and mining trillions of time series subsequences under Dynamic Time Warping
Towards parameter-free data mining
Towards Parameter-Free Data Mining - Keogh et al. SIGKDD 2004 Another time series paper today from the Facebook Gorilla references. Keogh et al. describe an incredibly simple and easy to implement scheme that does surprisingly well with clustering, anomaly detection, and classification tasks over time series data. As per the title of the paper, it … Continue reading Towards parameter-free data mining