Detecting Discontinuities in Large-Scale Systems

Detecting Discontinuities in Large-Scale Systems – Malik et al 2014.

The 7th IEEE/ACM International Conference on Utility and Cloud Computing is coming to London in a couple of weeks time. Many of the papers don’t seem to be online yet, but here’s one that is. Malik et al. tackle the problem of long-term forecasting for infrastructure provisioning, and in particular identifying discontinuities in performance data so that models are trained on the most relevant data.

One of the fundamental problems faced by analysts in preparing data for use in forecasting is the timely identification of data discontinuities. A discontinuity is an abrupt change in a time-series pattern of a performance counter that persists but does not recur. Analysts need to identify discontinuities in performance data so that they can a) remove the discontinuities from the data before building a forecast model and b) retrain an existing forecast model on the performance data from the point in time where a discontinuity occurred.

We’re also treated to a good overview of the forecasting process in general.

Practitioners and data scientists spend considerable time (e.g. up to 80%) in preparing data for their forecast algorithms!

Where does all this time go?

The accuracy of forecasting results depends on the quality of the performance data (i.e., performance counters; such as CPU utilization, bandwidth consumption, network traffic and Disk IOPS) fed to the forecasting algorithms, i.e., missing value imputation, calculating and adjusting times stamp drifts of logged performance data across hundreds of VMs, identification and removal of outliers and anomalies and in cases, scaling and standardizing the data to remove bias among performance counters.

One of the fundamental problems faced by analysts in preparing data for long-term forecast is the identification and removal of data discontinuities.

Discontinuities, like anomalies, are abrupt changes in time-series patterns. Unlike anomalies, which are temporary, discontinuities persist. They do not appear instantaneousy, but over a brief period called a transition period.

detecting a discontinuity provides analysts a reference point to retrain their forecasting models and make necessary adjustments.

After cleaning the logs (e.g. dealing with missing or empty counter values), Principle Component Analysis is used to “select the least correlated subset of performance counters that can still explain the maximum variations in the data.” The performance data needs to be normalized for PCA to work well…

To eliminate PCA bias towards those variables with a larger variance, we standardized the performance counters via Unit Variance scaling, i.e., by dividing the observations of each counter variable by the variable’s standard deviation. Scaled performance counter data are then further mean centered to reduce the risk of collinearity.

PCA was chosen due to its “superior performance in identifying performance counters that are sensitive to minute changes in both workload and environment as compared to many other supervised and unsupervised machine learning techniques.”

The determined principle component performance counters are then fed into an anomaly detector. This phase works by finding changes in the data that cannot be easily represented (approximated) by a quadratic function:

When working with training data, we discover (potential) discontinuities by presuming that discontinuities cannot be well modelled by a low order polynomial function. Given a performance counter time series data {v[t]}, we approximate the series by the quadratic function _f(t) = c+bt+at² that performance counter time series data {v[t]}, we approximate the series by the quadratic function f(t) = c+bt+at² that minimizes the least squared error (LSE). We presume that series containing sudden dramatic changes, anomalies, or discontinuities will not be fit as well by this approximation and so have a larger LSE.

From this set of discovered anomalies, discontinuities are then identified by looking at the distribution of the performance counter in question before and after the anomaly transition period. For discontinuities, the change will persist, whereas for ordinary anamolies it will not. The Wilcoxon rank-sum test is used for this comparison.

A disappointment in the paper (for this reader) is that much of the testing was done based on deliberately injecting anomalies and discontinuities into existing data. As a side-effect though, that means we are treated to a discussion on the most common causes of anomalies and discontinuities IRL:

Causes of temporary anomalies:

  • 80% of the performance anomalies in large software systems are due to software inconsistencies and human errors
  • the most common anomaly occuring in the field is related to transient memory issues (memory spikes)
  • large enterprises report that periodic CPU saturation is one of the fundamental field problems
  • interfering workloads are a major cause of performance degradation in data centers (resulting from competition for resources).

Causes of discontinuities:

  • Increase in workload due to promotions, new products, mergers and acquisitions
  • Change in transaction patterns, where a ‘transaction’ in this context is a sequence of events. Most likely caused by a new version of the software deployed in the data center.
  • Upgrades to infrastructure hardware or software

Finally, the approach was back-tested against 7 years worth of real production logs for which analysts had already identified the discontinuities. Precision and recall were both at 92% with the optimum algorithm settings for this data set.