Coverage is not strongly correlated with test suite effectiveness – Inozemtseva and Holmes, 2014
Is code coverage a useful metric?
Inozemtseva and Holmes won an ACM Distinguished Paper award at ICSE 2014 for this paper which asks whether code coverage is a good indicator of test suite effectiveness. How do we know if our test suite is any good?
Testing is an important part of producing high quality software, but its effectiveness depends on the quality of the test suite: some suites are better at detecting faults than others.
Code coverage is an often tracked metric, but it turns out researchers haven’t been able to agree on whether it is a useful metric or not. The first question addressed was simply “is a bigger test suite a likely indicator of a better test suite?”
Our results suggest that for large Java programs, there is a moderate to very high correlation between the effectiveness of a test suite and the number of test methods it contains.
If we assume competent developers who add tests to a suite with the goal of testing for things that were not previously tested then this result just follows common sense. Number of test methods would be a dreadful target to set though – it’s artificial and easily gamed. Number of test methods may correlate with something that is fundamental to the effectiveness of a test suite, but just having more methods for the sake of it clearly isn’t what we’re after.
What happens if we do the same analysis using coverage metrics?
Our results suggest that for many large Java programs, there is a moderate to high correlation between the effectiveness and the coverage of a test suite, when the influence of suite size is ignored.
If we once more assume competent developers who add tests to a suite with the goal of testing for things that were not previously tested then this also seems to follow common sense. The new tests that get added, if they test for things not previously tested, would also tend to increase coverage. Coverage may correlate with something that is fundamental to the effectiveness of a test suite, but is it a meaningful indicator of test suite effectiveness in its own right?
Our results suggest that for large Java programs, the correlation between coverage and effectiveness drops when suite size is controlled for… it is not generally safe to assume that effectiveness is correlated with coverage.
Another interesting result along the way is that it doesn’t seem to matter what type of code coverage you use (statement, branch etc.) – they all perform about the same in these tests.
So what’s the bottom line? A reinforcement of something many practitioners instinctively understand:
While coverage measures are useful for identifying under-tested parts of a program, and low coverage may indicate that a test suite is inadequate, high coverage does not indicate that a test suite is effective.
So, we can use code coverage reports as a tool to help us understand areas that may warrant further testing (alongside things such as hot-spot analysis of the areas in the code where we are finding lots of bugs), but we can’t use it to tell us if we have an effective test suite (vs. e.g. testing a bunch of logic-free getters and setters!).
Of course, developers still want to measure the quality of their test suites, meaning they need a metric that does correlate with fault detection ability.
What is that metric? We don’t know. Come back tomorrow for more on this topic…