Shark: SQL and Rich Analytics at Scale

Shark: SQL and Rich Analytics at Scale, Xin et al 2013.

Given the Databricks Spark result reported last week, it seems timely to look at a system built on top of Spark, Shark, that ultimately informed the Spark SQL project.

[Shark] leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g., iterative machine learning) at scale, and efficiently recovers from failures mid-query.

Shark shows a 2 Order-of-Magnitude performance improvement over Hadoop and Hive for machine learning and SQL queries respectively. Compared to traditional MPP systems, which employ a coarse-grained recovery mechanism, Shark provides fine-grained recovery based on Spark’s RDDs. Why does this matter? Because of the three challenges facing data analytics:

First, data volumes are expanding dramatically, creating the need to scale out across clusters of hundreds of commodity machines.Second, such high scale increases the incidence of faults and stragglers (slow tasks), complicating parallel database design. Third, the complexity of data analysis has also grown: modern data analysis employs sophisticated statistical methods, such as machine learning algorithms, that go well beyond the roll-up and drill-down capabilities of traditional enterprise data warehouse systems

Shark runs mostly in memory:

Shark builds on a recently-proposed distributed shared memory abstraction called Resilient Distributed Datasets (RDDs) [33] to perform most computations in memory while offering fine-grained fault tolerance

An analysis of existing workloads showed that in-memory processing can be very effective – 95% of queries could be processed with just 64GB memory/node even though total data under management is in excess of 100PB!

To support SQL queries effectively, the Spark engine is extended to with columnar storage and compression and the ability to re-optimize running queries for tuning. Several other properties of the Spark engine, such as control over data partitioning, are also exploited.

After publication of this paper, work on Shark was suspended earlier this year in favour of applying the lessons learned to the development of the Spark SQL project. See this Databricks blog post for more details.