ASAP: fast, approximate graph pattern mining at scale

November 7, 2018November 8, 2018 ~ adriancolyer ~ 5 Comments

ASAP: fast, approximate graph pattern mining at scale Iyer et al., OSDI'18 I have a real soft spot for approximate computations. In general, we waste a lot of resources on overly accurate analyses when understanding the trends and / or the neighbourhood is quite good enough (do you really need to know it’s 78.763895% vs … Continue reading ASAP: fast, approximate graph pattern mining at scale

Filter before you parse: faster analytics on raw data with Sparser

August 20, 2018August 18, 2018 ~ adriancolyer ~ 3 Comments

Filter before you parse: faster analytics on raw data with Sparser Palkar et al., VLDB'18 We’ve been parsing JSON for over 15 years. So it’s surprising and wonderful that with a fresh look at the problem the authors of this paper have been able to deliver an order-of-magnitude speed-up with Sparser in about 4Kloc. The … Continue reading Filter before you parse: faster analytics on raw data with Sparser

Popular is cheaper: curtailing memory costs in interactive analytics engines

June 15, 2018June 9, 2018 ~ adriancolyer ~ 2 Comments

Popular is cheaper: curtailing memory costs in interactive analytics engines Ghosh et al., EuroSys'18 (If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site). We’re sticking with the optimisation of data analytics today, but at the other end of … Continue reading Popular is cheaper: curtailing memory costs in interactive analytics engines

Toward sustainable insights, or why polygamy is bad for you

January 25, 2017July 31, 2017 ~ adriancolyer ~ 18 Comments

Toward sustainable insights, or why polygamy is bad for you Binning et al., CIDR 2017 Buckle up! Today we're going to be talking about statistics, p-values, and the multiple comparisons problem. Some good background resources here are: Statistics Done Wrong, by Alex Reinhart p-values on wikipedia Misunderstandings of p-values, also on wikipedia For my own … Continue reading Toward sustainable insights, or why polygamy is bad for you

Dependency-driven analytics: a compass for uncharted data oceans

January 20, 2017July 31, 2017 ~ adriancolyer ~ 4 Comments

Dependency-driven analytics: a compass for uncharted data oceans Mavlyutov et al. CIDR 2017 Like yesterday's paper, today's paper considers what to do when you simply have too much data to be able to process it all. Forget data lakes, we're in data ocean territory now. This is a problem Microsoft faced with their large clusters … Continue reading Dependency-driven analytics: a compass for uncharted data oceans

SnappyData: A unified cluster for streaming, transactions, and interactive analytics

January 18, 2017July 31, 2017 ~ adriancolyer ~ 4 Comments

SnappyData: A unified cluster for streaming, transactions, and interactive analytics Mozafari et al., CIDR 2017 Update: fixed broken paper link, thanks Zteve. On Monday we looked at Weld which showed how to combine disparate data processing and analytic frameworks using a common underlying IR. Yesterday we looked at Peloton that adapts to mixed OLTP and … Continue reading SnappyData: A unified cluster for streaming, transactions, and interactive analytics

Weld: A common runtime for high performance data analytics

January 16, 2017July 31, 2017 ~ adriancolyer ~ 8 Comments

Weld: A common runtime for high performance data analytics Palkar et al. CIDR 2017 This is the first in a series of posts looking at papers from CIDR 2017. See yesterday's post for my conference overview. We have a proliferation of data and analytics libraries and frameworks - for example, Spark, TensorFlow, MxNet, Numpy, Pandas, … Continue reading Weld: A common runtime for high performance data analytics

Shasta: Interactive reporting at scale

January 10, 2017July 31, 2017 ~ adriancolyer ~ 2 Comments

Shasta: Interactive Reporting At Scale Manoharan et al., SIGMOD 2016 You have vast database schemas with hundreds of tables, applications that need to combine OLTP and OLAP functionality, queries that may join 50 or more tables across disparate data sources, oh, and the user is waiting, so you'd better deliver the results online with low … Continue reading Shasta: Interactive reporting at scale

Searching and mining trillions of time series subsequences under Dynamic Time Warping

May 11, 2016July 27, 2017 ~ adriancolyer ~ 11 Comments

Searching and mining trillions of time series subsequences under dynamic time warping - Rakthanmanon et al. SIGKDD 2012 What an astonishing paper this is! By 2012, Dynamic Time Warping had been shown to be the time series similarity measure that generally performs the best for matching, but because of its computational complexity researchers and practitioners … Continue reading Searching and mining trillions of time series subsequences under Dynamic Time Warping

Towards parameter-free data mining

May 10, 2016July 27, 2017 ~ adriancolyer ~ 3 Comments

Towards Parameter-Free Data Mining - Keogh et al. SIGKDD 2004 Another time series paper today from the Facebook Gorilla references. Keogh et al. describe an incredibly simple and easy to implement scheme that does surprisingly well with clustering, anomaly detection, and classification tasks over time series data. As per the title of the paper, it … Continue reading Towards parameter-free data mining