Coverage and its Discontents

Coverage and its Discontents – Groce, Alipour, and Gopinath, 2014

Yesterdays discussion of the effectiveness of code coverage seemed to leave us with as many questions as it answered. Today’s choice provides an overview of the situation and helps us to focus on some key questions relating to test effectiveness. We’re bang up to date with this paper, as it is being presented at the Onward 2014 conference tomorrow!

Everyone wants to know one thing about a test suite: will it detect enough bugs? Unfortunately in most settings that matter, answering the question directly is impractical or impossible.

Groce et al. set out five questions of interest to research and practitioners, of which numbers 1 and 5 resonate most strongly with me:

1. Can I stop testing? Is this suite good enough that I can assume any undetected defects are either very few in number or are so low-probability that they are unlikely to cause trouble? Similarly, will this suite likely detect future defects introduced into this program?

and

5. What tests should I add to this test suite, to improve its ability to detect faults?

The first of these highlights an important question “will this suite detect likely future defects?” That is, will we catch regressions to the functionality that is already working for users? Many modern systems of interest share more of the characteristics of organisms than they do the older fixed artifact product release model: they are under continuous evolution, change, and replacement – always the same system but never the same parts.  Continuous evolution requires confidence that changes are non-breaking (or at least that you can detect and kill bad mutants quickly!).

Some questions not asked in the paper that I’d love to see answers to include: what is the impact of test methodology on suite effectiveness (e.g, as per Matthew Sackman’s question on twitter yesterday – does TDD produce better suites)? ; what types of test are most effective (e.g. unit, integration, system, acceptance, ….)? ;  what metrics are most helpful in guiding test efforts (coverage may be one, what about other metrics of code complexity, and data about areas where bugs have been found ‘in the wild’? what if I had access to the kind of overview data that e.g. Semmle* collects?)? ; what testing methods give the best ROI over the long-term (i.e. trading of time spent creating and maintaining tests vs their effectiveness)? ; what is the optimum balance for quality assurance between time invested in test suites vs other techniques such as code reviews? and so on…

* disclaimer: Accel are investors in Semmle.

All of these questions seem to rely on some kind of objective measure of the effectiveness of a test suite – and to my eye it’s clear that ‘code coverage’ is not a good substitute in the context of these kinds of questions. For example, a simple code coverage objective may favour extensive unit tests, yet we learned earlier in this series that the interactions between components are a major source of problems.

And so:

 The widespread use of code coverage in both practice and research, therefore, raises a question: is code coverage actually useful as a replacement for measuring real fault detection? … Most experienced testers can immediately answer that measuring code coverage is not a completely adequate replacement for measuring fault detection.

The paper goes on to define Strong and Weak Coverage Hypotheses. The Strong form says that coverage is a meaningful indicator of fault coverage in its own right, and the weak form says that coverage may be useful for distinguishing between the effectiveness of suites generated using different approaches (but not necessarily for comparing within approaches).  In the view of the authors, there is not sufficient evidence to prove either hypothesis.

One of the more interesting parts of the paper (for me) comes at the end with a discussion of the recent high-profile Apple iOS SSL vulnerability, and Heartbleed. This section has the witty headline:

The Plural of Anecdote is Not Confidence

Though when I thought about that, I realised that Bayes may beg to differ! We like it when things are certain, but pragmatically we always have to deal with probabilities. Bayes theorem gives us a way of accumulating anecdotes (evidence), and if they all point in the same direction, this will over time increase our level of confidence in a hypothesis.

Anyway, it turns out that code coverage may have been useful in helping to detect the iOS SSL vulnerability, but almost certainly would not have been for detecting Heartbleed.

This leads to a state of general discontent… Even the traditional ‘cautious’ uses of coverage (never as a sign of a good suite, but only as a pointer to weaknesses in a suite) lacks truly solid supporting evidence. What is to be done, until more evidence is available?

Don’t rely solely on coverage!