Discretized Streams: Fault Tolerant Stream Computing at Scale – Zaharia et al. 2013
This is the Spark Streaming paper, and it sets out very clearly the problem that Discretized Streams were designed to solve: dealing effectively with faults and stragglers when processing streams in large clusters. This is hard to do in the traditional continuous operator model in which long-running operators receive and process each update in turn, maintaining their own internal state along the way.
Specifically, given the continuous operator model, systems perform recovery through two approaches: replication, where there are two copies of each node, or upstream backup, where nodes buffer sent messages and replay them to a new copy of a failed node. Neither approach is attractive in large clusters: replication costs 2× the hardware, while upstream backup takes a long time to recover, as the whole system must wait for a new node to serially rebuild the failed node’s state by rerunning data through an operator. In addition, neither approach handles stragglers: in upstream backup, a straggler must be treated as a failure, incurring a costly recovery step, while replicated systems use synchronization protocols like Flux to coordinate replicas, so a straggler will slow down both replicas.
Discretized Streams (D-Streams) work differently: instead of managing long-lived operators they structure a streaming computation as a series of stateless deterministic batch computations. Unlike traditional batch systems that write intermediate state to disk though, D-Streams keep data in memory using RDDs:
we use a data structure called Resilient Distributed Datasets (RDDs), which keeps data in memory and can recover it without replication by tracking the lineage graph of operations that were used to build it. With RDDs, we show that we can attain sub-second end-to-end latencies. We believe that this is sufficient for many real-world big data applications, where the timescale of the events tracked (e.g., trends in social media) is much higher.
It’s worth stressing that D-Streams / Spark Streaming targets sub-second latency, but not latency in the few hundred milliseconds or below range. The key design is to provide both sub-second latency and sub-second recovery from faults and stragglers. Examples of use cases well matched to these requirements include site activity statistics, cluster monitoring, and tweet spam detection. In evaluation the authors showed per-node throughput comparable to commercial streaming databases, combined with linear scalability out to 100 nodes processing over 60M records/second.
When a node fails, it recomputes the RDD partitions that were on it by re-running the tasks that built them from the original input data stored reliably in the cluster. The system also periodically checkpoints state RDDs (e.g., by asynchronously replicating every tenth RDD) to prevent infinite recomputation, but this does not need to happen for all data, because recovery is often fast: the lost partitions can be recomputed in parallel on separate nodes. In a similar way, if a node straggles, we can speculatively execute copies of its tasks on other nodes, because they will produce the same result. We note that the parallelism usable for recovery in D-Streams is higher than in upstream backup, even if one ran multiple operators per node. D-Streams expose parallelism across both partitions of an operator and time.
The micro-batching in D-Streams partitions input records based on their arrival time. If instead you want to group records based on an external time, you can either add some delay (slack period) to wait for them al to come in, or handle late arrivals explicity in your application code. On top of the core model, the authors show it is possible to build computations that span several intervals including windowing, aggregation, and state tracking.
The D-Streams model is implemented in Spark Streaming, which led to a number of improvements in the core of Spark itself as well. These optimizations also improved Spark’s performance in the batch case by 2x.
Finally, because D-Streams use the same execution model as batch platforms, they compose seamlessly with batch and interactive queries. We used this capability in Spark Streaming to let users combine these models in powerful ways, and showed how it can add rich features to two real applications.
(Those applications were video distribution monitoring, and crowdsourced traffic estimation).