Making Sense of Performance in Data Analytics Frameworks – Ousterhout et al. 2015
We all know the causes of poor performance in big data analytics workloads: network I/O, disk I/O, and straggler tasks. Ousterhout et al. set out to try and quantify this, and found that what we think we know isn’t necessarily so. Yet another reminder that you really need to measure your system and find out where the bottlenecks are, and not rush ahead into premature optimisations (or worse, expensive equipment orders) in the hope they will help. Of course, that means having the instrumentation in place to capture the information you need in the first place.
These findings should not be taken as the last word on performance of analytics frameworks: our study focuses on a small set of workloads, and represents only one snapshot in time. As data-analytics frameworks evolve, we expect bottlenecks to evolve as well. As a result, the takeaway from this work should be the importance of instrumenting systems for blocked time analysis, so that researchers and practitioners alike can understand how best to focus performance improvements. Looking forward, we argue that systems should be built with performance understandability as a first-class concern.
The authors studied Spark workloads and had to add the needed instrumentation themselves. They perform ‘blocked task analysis’. This enables them to measure the time spent blocked on disk I/O and network I/O, and thus calculate the theoretical maximum improvement from eliminating all such blocking. With respect to the conventional wisdom, Ousterhout et al. find that:
- Network optimization can reduce job completion time by a median of 2% at best.
- Optimizing or eliminating disk accesses can reduce job completion time by a median of 19% at best.
- Optimizing stragglers can reduce job completion time by a median of 10% at best.
… the fact that the prevailing wisdom about performance is so incorrect on the workloads we do consider suggests that there is much more work to be done before our community can claim to understand the performance of data analytics frameworks. To facilitate performing blocked time analysis on a broader set of workloads,we have added almost all of our instrumentation to Spark and made our analysis tools publicly available.We have also published the detailed benchmark traces that we collected so that other researchers may reproduce our analysis or perform their own.
If we’re not normally network bound or disk bound, then tasks must be CPU bound. Why should that be? Too much time spent encoding and decoding compressed and serialized data structures!
We were surprised at the results given the oft-quoted mantra that I/O is often the bottleneck, and also the fact that fundamentally, the computation done in data analytics job is often very simple. For example, queries 1a, 1b, and 1c in the big data benchmark select a filtered subset of a table.Given the simplicity of that computation, we would not have expected the query to be CPU bound. One reason for this result is that today’s frameworks often store compressed data (in increasingly sophisticated formats, e.g. Parquet ), trading CPU time for I/O time. We found that if we instead ran queries on uncompressed data, most queries became I/O bound. A second reason that CPU time is large is an artifact of the decision to write Spark in Scala, which is based on Java: after being read from disk, data must be deserialized from a byte buffer to a Java object.
Ousterhout at al. studied the BDBench (big data benchmark), the TPC-DS (decision support) benchmark, and a production workload from a Databricks Spark cluster.
The big data benchmark (BDBench) was developed to evaluate the differences between analytics frameworks and was derived from a benchmark developed by Pavlo et al. The input dataset consists of HTML documents from the Common Crawl document corpus combined with SQL summary tables generated using Intel’s Hadoop benchmark tool. The benchmark consists of four queries including two exploratory SQL queries, one join query, and one page-rank-like query… The TPC-DS benchmark was designed to model multiple users running a variety of decision-support queries including reporting, interactive OLAP, and data mining queries.
The results were also cross-checked as far as possible with aggregate statistics and traces published from Facebook’s Hadoop cluster, Microsoft’s Cosmos cluster, and Google’s MapReduce cluster.
A 20-machine cluster was used for the tests, experiments conducted with larger cluster sizes suggest that the results will equally apply there too:
The results in this paper have focused on one cluster size for each of the benchmarks run. To understand how our results might change in a much larger cluster, we ran the TPC-DS on-disk workload on a cluster with three times as many machines and using three times more input data. Figure 14 compares our key results on the larger cluster to the results from the 20-machine cluster described in the remainder of this paper, and illustrates that the potential improvements from eliminating disk I/O, eliminating network I/O, and eliminating stragglers on the larger cluster is comparable to the corresponding improvements on the smaller cluster.
… none of the workloads we studied could improve by a median of more than 2% as a result of optimizing network performance. We did not use especially high bandwidth machines in getting this result: the m2.4xlarge instances we used have a 1Gbps network link.
Sanity checking against the workloads from Google, Facebook, and Microsoft showed that production workloads are not expected to be dramatically different. The network is less important than disk simply because the amount of data transferred over the network is much less than that transferred to/from disk.
We found this to be true in a cluster with 1Gbps network links; in clusters with 10Gbps or 100Gbps networks, network performance will be even less important.
What then of studies that show improvements to be made from investment in networking? The authors suggest some of those advantages may have come from choosing workloads that are not representative, or from using estimations of the networking costs that led to ‘misleading conclusions about the importance of the network.’
Last week we looked at ‘Cross-Layer Scheduling in Cloud Systems‘ which showed throughput improvements on the order of 30% on a Facebook workload by including network considerations and optimizations in scheduling decisions. Always get measurements from your own workloads to guide optimisations!
Previous work has suggested that reading input data from disk can be a bottleneck in analytics frameworks; for example, Spark describes speedups of 40× for generating a data analytics report as a result of storing input and output data in memory using Spark, compared to storing data on-disk and using Hadoop for computation. PACMan reported reducing average job completion times by 53% as a result of caching data in-memory.
Ousterhout et al. found a median improvement from eliminating disk I/O blocking of 19%, and 2-5%
for in-memory workloads (which still use the disk for shuffling data).
To shed more light on this measurement,we compared resource utilization while tasks were running, and found that CPU utilization is typically close to 100% whereas median disk utilization is at most 25%. One reason for the relatively high use of CPU by the analytics workloads we studied is deserialization and compression; the shift towards more sophisticated serialization and compression formats has decreased the I/O and increased the CPU requirements of analytics frameworks. Because of high CPU times, optimizing hardware performance by using more disks, using flash storage, or storing serialized in-memory data will yield only modest improvements to job completion time; caching deserialized data has the potential for much larger improvements due to eliminating deserialization time.
It is suggested that previous studies gained most of their benefits from this elimination of serialization and compression costs, rather than from eliminating disk I/O blocking.
Martin Thompson echoed similar sentiments in his work on Simple Binary Encoding:
Of the many applications I profile with performance issues, message encoding/decoding is often the most significant cost. I’ve seen many applications that spend significantly more CPU time parsing and transforming XML and JSON than executing business logic. SBE is designed to make this part of a system the most efficient it can be.
As well as studying the potential performance benefits of eliminating stragglers, Ousterhout et al. also study the causes of stragglers.
… straggler causes vary across workloads and even within queries for a particular workload. However, common patterns are that garbage collection can cause most of the stragglers for some queries, and many stragglers can be attributed to long times spent reading to or writing from disk (this is not inconsistent with our earlier results showing a 19% median improvement from eliminating disk: the fact that some straggler tasks are caused by long times spent blocked on disk does not necessarily imply that overall, eliminating time blocked on disk would yield a large improvement in job completion time). Another takeaway is that many stragglers are caused by inherent factors like output size and running before code has been jitted, so cannot be alleviated by straggler mitigation techniques.
The authors were able to reduce occurrences of stragglers by using more recent versions of the Spark framework, and switching EC2 instance file systems from ext3 to ext4. The potential improvement from eliminating stragglers altogether is determined by replacing the progress rate of every task slower than the median with the median progress rate. Given this definition:
…the median improvement from eliminating stragglers is 5-10% for the big data benchmark and TPC-DS workloads, and lower for the production workloads,which had fewer stragglers.
Other studies have shown greater improvements from eliminating stragglers:
Stragglers would have had a much larger impact for the workloads we ran if all of the tasks in each job had run as a single wave; in this case, the median improvement from eliminating stragglers would have been 23-41%.