Last week was CIDR 2017, the biennial Conference on Innovative Data Systems Research. CIDR encourages authors to take a whole system perspective and especially values “innovation, experience-based insight, and vision.” That’s a very good match with the attributes of papers I like to cover on The Morning Paper. So what innovation, insight, and vision does CIDR have to offer us to see us through until the next edition in 2019?
In total there were 32 papers presented. The conference organisers themselves divided the agenda up into sessions on stream processing, data integration, engines (2 sessions), polystores, tools, applications, concurrency control, and analytics. But that breakdown doesn’t really tell us much. I didn’t attend CIDR, so I’ve had to pull together what I see as the main stories from the event through reading the papers. If you’re reading this and you did attend, I’m sure readers of The Morning Paper would love to hear a short summary of what the conference was like on the ground too.
The main unifying theme that jumped out at me was heterogeneity. Heterogeneity of workload types, store types, frameworks, processors, you name it. Instead of trying to develop ever more specialised approaches in different niches, many of the papers address how to move forward in a world which accepts heterogeneity is inevitable, indeed even embraces it. Differing requirements point to differing solutions, but how are we to make sense of the resulting menagerie?
First and foremost, we have heterogenous processing requirements. OLTP and OLAP are often no longer separate islands, but different parts of the same system (HTAP). The pressure on this integration only increases as we move towards systems that give results in ‘user time’ (soft real-time). Modern systems don’t just combine OLTP and OLAP though, as streaming data also becomes an important first class citizen. We’ll look at ‘SnappyData’ which is representative of the idea.
We also have a broad set of data processing libraries and frameworks (NumPy, Pandas, TensorFlow, Spark,…) and given their different specialisms and rate of innovation a future in which these all collapse into one also seems unlikely. Applications often end up using a heterogeneous mix of such frameworks, but this leads to some gross inefficiencies as data moves between them. We’ll look at ‘Weld‘ that convincingly demonstrates this doesn’t have to be the case.
Polystores such as ‘BigDAWG’ recognise that data is fundamentally heterogenous. If we match datastores to the characteristics of the data they store, then we’ll thus end up with a mix of stores. We’d still like to expose that mix through a single uniform interface on top though.
Thus we can see blends of three different approaches to dealing with heterogeneity: put a uniform interface over the top, build multi-mode engines in the middle, and put a common IR (Weld) underneath. ‘The case for heterogeneous HTAP’ even looks at how to take advantage of heterogeneous hardware (mixed CPUs and GP-GPUs).
‘Adaptive Schema Databases’, ‘Adaptive Concurrency Control’, and ‘Self-Driving Database Systems’ all describe varying ways of adapting to changes in workload and data over time. The Self-Driving Database Systems paper gets bonus points for integrating deep-learning to do it!
Discovery, Governance and Provenance
Combine huge volumes of data with a plethora of heterogeneous stores within an enterprise and you have a fundamental problem of simply knowing what data you have (c.f. Google’s GOODs system that we looked at last year). How that data gets discovered, managed and tracked, where it is used and where it came from is another important challenge (backed by considerable teeth with the latest rounds of regulation). ‘The Data Civilizer System’, ‘Establishing Common Ground’, and ‘Data Provenance at Internet Scale’ all address aspects of this problem.
Volume and Velocity
Finally, the sheer volume and velocity of data means we need smarter ways to extract value from it. ‘Dependency-driven analytics’ looks at how Microsoft solved this problem for the petabytes of system log data their clusters produce, and ‘Prioritizing attention’ shows us that when fast data comes at us fast enough, we need to be careful in how we manage the scarce resource of both machine and human attention.