The Art of Testing Less Without Sacrificing Quality

The Art of Testing Less Without Sacrificing Quality – Herzig et al. 2015

Why on earth would anyone want to test less? Maybe if you could guarantee the same eventually quality, and save a couple of million dollars along the way…

By nature, system and compliance tests are complex and time-consuming although they rarely find a defect. Large complex software products tend to run on millions of configuration in the field and emulating these configurations requires multiple test infrastructures and procedures that are expensive to run in terms of cost and time. Making tests faster is desirable but usually requires enormous development efforts. Simply removing tests increases the risk of expensive bugs being shipped as part of the final product… At the same time, long running test processes increasingly conflict with the need to deliver software products in shorter periods of time while maintaining or increasing product quality and reliability. Increasing productivity through running less tests is desirable but threatens product quality, as code defects may remain undetected.

So long as you run every test eventually you’re guaranteed to catch all defects that would have been caught. That means there’s a trade-off you can make between the costs of executing a test now, and the (increased) costs of possibly finding a defect later in the cycle. Herzig et al. present THEO, which evaluates these costs and chooses the path with the lower expected cost. The evaluation is done in the context of Microsoft’s Office, Windows, and Dynamics products, which means we also get some really interesting data on these costs at Microsoft.

The results (in simulation) cover more than 26 months of industrial product execution and more than 37 million test executions, and the savings are impressive:

THEO would have reduced the number of test executions by up to 50% cutting down test time by up to 47%. At the same time, product quality was not sacrificed as the process ensures that all tests are ran at least once on all code changes. Removing tests would result in between 0.2% and 13% of defects being caught later in the development process, thus increasing the cost of fixing those defects. Nevertheless simulation shows that THEO produced an overall cost reduction of up to $2 million per development year, per product. Through reducing the overall test time, THEO would also have other impacts on the product development process, such as increasing code velocity and productivity. These improvements are hard to quantify but are likely to increase the cost savings estimated in this paper. The technique and results described in this paper have convinced an increasing number of product teams, within Microsoft, to provide dedicate resources to explore ways to integrate THEO into their actual live production test environments.

Let’s take a look at the model THEO uses.

THEO does not require any additional test case instrumentation, and uses only the test name, time taken for the test to run, and test result as input. This data is already collected by test runners. The test result could be a pass or a fail, and if it’s a fail it could be because of a genuine defect or because of a problem with the test case itself (false positive). By tying failed tests back to raised bug reports, THEO is able to distinguish between these two scenarios . If the bug report results in a (non test-)code fix the test is presumed to have found a code defect, otherwise it is assumed to have been a false alarm. If the failure was not investigated the outcome is ‘unknown.’ Tests run in different execution contexts (and may have different probabilities of finding bugs in different contexts). An execution context is a collection of properties – for the study BuildType, Architecture, Language, and Branch were used.

This is a crucial point as a test may show different execution behaviors for different execution contexts. For example, a test might find more issues on code of one branch than another depending on the type of changes performed on that branch. For example, tests cases testing core functionality might find more defects on a branch containing kernel changes than on a branch managing media changes. Thus, our approach will not only differentiate between test cases, but also bind historic defect detection capabilities of a test to its execution context.

At the core of the model are estimates of the probability that a given test execution will find a genuine defect (true positive), and that it will raise a false alarm (false positive).

Pdefect(test,context) = #detectedDefects(test,context) / #executions(test,context)

PfalsePositive(test,context) = #falseAlarms(test,context) / #executions(test,context)

All pretty straightforward…

Both probability measurements consider the entire history from the beginning of monitoring until the moment the test is about to be executed. Consequently, probability measures get more stable and more reliable the more historic information we gathered for the corresponding test.

Given these probabilities, all that remains is to estimate the associated costs. If the estimated cost of skipping a test, Costskip, is less than the estimated cost of executing a test, Costexec then THEO will skip the test (with the proviso that every test eventually gets executed).

The cost of executing a test is captured as a function of the machine time spent on test execution and time wasted on investigating false positives:

Costexec = Costmachine + (PfalsePositive * Costinspect)

For the Microsoft development environment, Costmachine was estimated at $0.03/hour, corresponding roughly to the cost of a memory intense Azure image including power and hardware consumption as well as maintenance.

Costinspect at Microsoft is $9.60. “It considers the size of the test inspection teams, the number of inspections performed and the average salary of engineers on the team.”

The cost of skipping a test is captured as a function of the probability that the test would have found a genuine defect, the (increased) cost of fixing a defect found later in the cycle, the delay before the defect is found, and the number of engineers affected.

Costskip = Pdefect * Costescaped * Timedelay*#Engineers

The constant Costescaped represents the average cost of an escaped defect. This cost depends on the number of people that will be affected by the escaped defect and the time duration the defect remains undetected. We used a value of $4.20 per developer and hour of delay for Costescaped. This value represents the average cost of a bug elapsing within Microsoft. Depending on the time the defect remains undetected and the number of additional engineers affected, elapsing a defect from a development branch into the main trunk branch in Windows can cost tens of thousands of dollars.

The #Engineers is determined by counting the number of engineers whose code changes pass the code branch. Timedelay is the average timespan required to fix historic defects on the corresponding branch.

The final piece of the puzzle is ensuring that every test gets executed eventually:

To ensure this happens we use two separate criteria, depending on the development process:

  • Option 1: For single branch development processes, e.g. Microsoft Office, we enforce each test to execute at least every third day. Since all code changes are applied to the same branch, re-execution of each test for each execution context periodically ensures that each code change has to go through the same verification procedures as performed originally.
  • Option 2: For multi-branch development processes, e.g. Microsoft Windows, we enforce to execute a combination of test and execution context on the branch closest to trunk on which the test had been executed originally.