To type or not to type: quantifying detectable bugs in JavaScript

To type or not to type: quantifying detectable bugs in JavaScript Gao et al., ICSE 2017

This is a terrific piece of work with immediate practical applications for many project teams. Is it worth the extra effort to add static type annotations to a JavaScript project? Should I use Facebook’s Flow or Microsoft’s TypeScript if so? Will they really catch bugs that would otherwise have made it to master?

TL;DR: both Flow and TypeScript are pretty good, and conservatively either of them can prevent about 15% of the bugs that end up in committed code.

“That’s shocking. If you could make a change to the way we do development that would reduce the number of bugs being checked in by 10% or more overnight, that’s a no-brainer. Unless it doubles development time or something, we’d do it.” – engineering manager at Microsoft.

(It’s amusing to me that this quote comes from a manager at the company where they actually develop TypeScript! You’d think if anyone would know about the benefits…. Big companies eh).

Let’s dig in.

The flame wars

Static vs dynamic typing is always one of those topics that attracts passionately held positions. In the static typing camp we have the benefits of catching bugs earlier in the development cycle, eliminating altogether some classes of bugs in deployed systems, improved things like code-assist and other tool support, and enabling compiler optimisations. In the dynamic typing camp we have cleaner looking code, and greater flexibility and code malleability.

JavaScript is dynamically typed.

Three companies have viewed static typing as important enough to invest in static type systems for JavaScript: first Google released Closure, then Microsoft published TypeScript, and most recently Facebook announced Flow.

Shedding some light through empirical data

Inspired by previous studies in other areas, the authors study historical bugs in real world JavaScript projects in GitHub.

The fact that long-running JavaScript projects have extensive version histories, coupled with the existing of static type systems that support gradual typing and can be applied to JavaScript programs with few modifications, enables us to under-approximately quantify the beneficial impact of static type systems on code quality.

In other words, if the developers had been taking advantage of TypeScript or Flow at the time, would the bug have made it past the type checker? If not, it’s reasonable to assume it would never have been committed into the repository in the first place.

Here’s an example of a trivial bug that type annotations can detect:

Through this process, we end up with an under-estimation of the total benefits that might be available through static type checking.

In the Venn diagram below, we see on the left the universe of bugs that can potentially be detected by a statically checked type system. Type checking may help catch some of these faster during development. On the right we see bugs that have made it into public repositories. Only a subset of these have clearly linked fixes / patches. This study looks at the intersection of type-system detectable bugs and those that have public fixes.

We consider public bugs because they are observable in software repository histories. Public bugs are more likely to be errors understanding the specification because they are usually tested and reviewed, and, in the case of field bugs, deployed. Thus, this experiment under-approximates static type systems’ positive impact on software quality, especially when one considers all their other potential benefits on documentation, program performance, code completion, and code navigation.

Finding bugs to study

The goal is to find a corpus of bugs to study that is representative of the overall class, and large enough to support statistical significance. Finding representative bugs is addressed by uniform sampling. The authors sample commits that are linked to a GitHub issue from a snapshot of all publicly available JavaScript projects on GitHub. Each is then manually assessed to determine whether or not it really is an attempt to fix a bug (as opposed to a feature enhancement, refactoring, etc.). For the commits that pass this filter, the parent provides the code containing the bug.

To report results that generalize to the population of public bugs, we used the standard sample size computation to determine the number of bugs needed to achieve a specified confidence interval. On 19/08/2015, there were 3,910,969 closed bug reports in JavaScript projects on GitHub. We use this number to approximate the population. We set the confidence level and confidence interval to be 95% and 5%, respectively. The result shows that a sample of 384 bugs is sufficient for the experiment, which we rounded to 400 for convenience.

At the end of the sampling process, the bug pool contained bugs from 398 different projects (two projects happened to have 2 bugs each in the corpus). Most of these bug fixing commits ended up being quite small: about 48% of them touched only 5 or fewer lines of code, with a median of 6.

Bug assessment process

To figure out how many of these bugs could have been detected by TypeScript and Flow, we need some rules for how far we’re prepared to go in adding type annotations, and long long we’re prepared to spend on it. A preliminary study on a smaller sample of 78 bugs showed that for almost 90% a conclusion could be reached within 10 minutes, so the maximum time an author was allowed to spend annotating a bug was set at 10 minutes.

Each bug is assessed both with TypeScript (2.0) and with Flow (0.30). To reduce learning effects (knowledge gained from annotating with one system speeding annotation with the other), the type system to try first is chosen at random. The process is then to read the bug report and the fix and spend up to the allotted ten minutes adding type annotations. Sometimes the tools can detect the bug without needing any annotations to be added at all. Other times the report will make it clear that the bug is not type related – for example a misunderstanding of the intended application functionality. In this case the bug is marked as type-system undetectable.

We are not experts in type systems nor any project in our corpus. To combat this, we have striven to be conservative: we annotate variables whose types are difficult to infer with any. Then we type check the resulting program. We ignore type errors that we consider unrelated to this goal. We repeat this process until we confirm that b is ts-detectable because ts throws an error within the patched region and the added annotations are consistent (Section II), or we deem b is not ts-detectable, or we exceed the time budget M.

Details of the approach used to gradually add type annotations are provided in section III.D.

Does typing help to detect public bugs?

Of the 400 public bugs we assessed, Flow successfully detects 59 of them, and TypeScript 58. We, however, could not always decide whether a bug is ts-detectable within 10 minutes, leaving 18 unknown. The main obstacles we encountered during the assessment include complicated module dependencies, the lack of annotated interfaces for some modules, tangled fixes that prevented us from isolating the region of interest, and the general difficulty of program comprehension.

The 18 unknown bugs are then investigated to conclusion, at which point the score is 60 each for TypeScript and Flow.

Running the binomial test on the results shows that, at the confidence level of 95%, the true percentage of detectable bugs for Flow and TypeScript falls into [11.5%, 18.5%] with mean 15%.

Flow and TypeScript largely detect the same bugs:

Which is better: Flow or TypeScript?

The bugs that the two systems can detect largely overlap, with just 3 bugs that are only detected by TypeScript, and 3 that are only detectable by Flow.

All three Flow-detectable bugs that TypeScript fails to catch are related to concatenating possibly undefined or null values to a value of type string. Two of the three TypeScript-detectable bugs that Flow fails to detect are due to Flow’s incomplete support for using a string literal as in index. The remaining bug slips through the net due to Flow’s permissive handling of the window object.

Flow has builtin support for popular modules, like Node.js, so when a project used only those modules, Flow worked smoothly. Many projects, however, use unsupported modules. In these cases, we learned to greatly appreciate TypeScript community’s DefinitelyTyped project. Flow would benefit from being able to use DefinitelyTyped; TypeScript would benefit from automatically importing popular DefinitelyTyped definitions.

Yes, but what’s the cost of adding all those annotations?

Another consideration, both when comparing TypeScript and Flow, and when deciding whether to use either, is the cost of adding the annotations in terms of time and ‘token pollution’ in the source code.

… on average Flow requires 1.7 tokens to detect a bug, and TypeScript 2.4. Two factors contribute to this discrepancy; first, Flow implements stronger type inference, mitigating its reliance on type annotations; second, Flow’s syntax for nullable types is more compact.

The authors also measured the time taken to add the required annotations, with Flow coming out on top:

Thanks to Flow’s type inference, in many cases, we do not need to read the bug report and the fix in order to devise and add a consistent type annotation, which leads to the noticeable difference in annotation time.

Over to you

Is a 15% reduction in bugs making it all the way through your development pipeline worth it to you? And if so, will you go with the Flow or settle for TypeScript? Flow’s lighter-weight approach may suit dynamic-typing advocates better, whereas TypeScript’s DefinitelyTyped library and approach may perhaps be more appealing to strong-typing advocates coming from other languages.

Whatever you decide, at least you now have some data points to help make a more informed decision.

the morning paper

a random walk through Computer Science research, by Adrian Colyer
Made delightfully fast by strattic