Moving Fast with Software Verification

Moving Fast with Software Verification – Calcagno et al. 2015

This is a story of transporting ideas from recent theoretical research in reasoning about programs into the fast-moving engineering culture of Facebook. The context is that most of the authors landed at Facebook in September of 2013, when we brought the INFER static analyser with us from the verification startup Monoidics. INFER itself is based on recent academic research in program analysis, which applied a relatively recent development in logics of programs, separation logic. As of this writing INFER is deployed and running continuously to verify select properties of every code modification in Facebook’s mobile apps; these include the main Facebook apps for Android and iOS, Facebook Messenger, Instagram, and other apps which are used by over a billion people in total.

How do you mesh formal verification “proponents of which sometimes even used to argue that programs should be developed only after a prior specifications had been written down,” with a continuous delivery model? This strikes me as very similar to the problem of integrating security into a continuous delivery pipeline too. On the web, Facebook pushes new changes to code twice a day – but mobile platforms are now even more important than the web. With the mobile platforms (iOS and Android), you can’t just push new features and bug fixes the minute they are ready – Facebook can only distribute a new version to the Apple App Store or Google Play but the user controls if/when they update.

This difference has dramatic impact on bug fixes. On the web when a bug is discovered a fix can be shipped to the servers as part of a periodic release or in exceptional cases immediately via a “hotfix”. And on the web mechanisms exist to automatically update the JavaScript client software running in the browser, allowing fixes to quickly and automatically be deployed as soon as they have been developed. On current mobile platforms updates must typically be explicitly authorised by the device owner, so there is no guarantee that a fix will be deployed in a timely manner, if ever, once it is developed.

Furthermore mobile platforms have relatively low fault tolerance so a runtime error can often cause termination of the entire app.

Thus mobile development at Facebook presents a strong dichotomy: on one hand it employs continuous development; on the other hand it could benefit from techniques like formal verification to prevent bugs before apps are shipped.

The first safety properties the INFER team decided to tackle were the implicit safety properties that null pointer exceptions and resource leaks cannot occur in Android-based code, and additionally memory leaks in iOS code. The formal verification is integrated with regression testing as part of the continuous delivery pipeline. The authors identify four (challenging) requirements for verification in such a scenario:

  1. It must be fully automated and integrated
  2. It must scale to millions of lines of code
  3. It must give precise and useful results: “a developer’s time is an importance resource, an imprecise tool providing poor results would be seen as a waste of that resource.”
  4. It must generate results quickly: “it has to report in minutes, before programmers commit or make further changes.”

These requirements are challenging. In our context we are talking about analysis of large Android and iPhone apps (millions of lines of code are involved in the codebases). The analysis must be able to run on thousands of code diffs in a day, and it should report in under 10 minutes on average to fit in with the developer workflow. There are intra-procedural analyses and linters which fit these scaling requirements, and which are routinely deployed at Facebook and other companies with similar scale codebases and workflows. But if an analysis is to detect or exclude bugs involving chains of procedure calls, as one minimally expects of verification techniques, then an inter-procedural analysis is needed, and making inter-procedural analyses scale to this degree while maintaining any degree of accuracy has long been a challenge.

To address these challenges INFER exploits several recent advances in automatic verification:

It’s underlying formalism is separation logic. It implements a compositional, bottom-up variant of the classic RHS inter-procedural analysis algorithm based on procedure summaries. There are two main novelties. First, it uses compact summaries, based on the ideas of footprints and frame inference from separation logic, to avoid the need for huge summaries that explicitly tabulate most of the input-output possibilities. Second, it uses a variation on the notion of abductive inference to discover those summaries.

The overall development process works as follows:

  1. The programmer creates a diff
  2. The diff goes through a phase of peer-reviews: “a loop of interactions aimed at making the code change robust and efficient as well as being understandable, readable, and maintainable by others.”
  3. When the reviewers are satisfied they accept the code-change and the diff is pushed to the main code-base.
  4. Every two weeks a version of the code is frozen into the release candidate. This goes into a testing period where it is available to Facebook’s employees for internal use.
  5. After two weeks of internal use, the release candidate is rolled out to Facebook users.

During phase 2, regression tests are automatically run and before accepting any code change a reviewer requires that all the tests pass. Tests run asynchronously and the results are automatically available in the collaboration tool phabricator used for peer review. INFER is run at phase 2. The process is completely automatic. Once the code is submitted for peer review, an analysis is run asynchronously in one of Facebook’s datacenters and results are reported on phabricator in the form of comments.

To be able to comment on a diff within 10 minutes, INFER uses incremental analysis and a caching system for analysis results. The full Android and iOS code bases are fully analysed nightly as well. This full analysis can take over 4 hours, and produces the cache used when processing diffs.

Ultimately, one of the biggest challenges we faced was a social challenge: to get programmers to react to bugs reported by the tool and fix genuine errors. Programmers need to accumulate trust in the analyser and they should see it as something helping them to build better software rather than something slowing them down

To build this trust, the team started conservatively concentrating on out-of-memory errors and null pointer exceptions and training INFER on the existing database of crashes and bugs to target false positives and negatives. “Having a dedicated static analysis team within Facebook helps tremendously with the social challenge.

INFER is a good start, but the authors believe there is much more the research community can do as a whole:

Finally, although there have been some successes, we should say that from an industrial perspective advanced program analysis techniques are generally under- developed. Simplistic techniques based on context insensitive pattern matching (“linters”) are deployed often and do provide value, and it is highly non-trivial to determine when or where many of the ingenious ideas being proposed in the scientific literature can be deployed practically. Part of the problem, we suggest, is that academic research has focused too much on whole-program analysis, or on specify-first, both of which severely limit the number of use cases. There are of course many other relevant problem areas – error reporting, fix suggestion, precision of abstract domains, to name a few – but we believe that automatic formal verification techniques have the potential for much greater impact if compositional analyses can become better developed and understood.