Node.fz: fuzzing the server-side event-driven architecture

Node.fz: Fuzzing the server-side event-driven architecture Davis et al., EuroSys’17

This paper provides a fascinating look at common causes of concurrency bugs in server-side event driven architecture (EDA) based applications. By far the most popular framework supporting this style is Node.js of course.

The Node.js package ecosystem, npm, is the largest ever, with over 400,000 packages and over 1.75 billion package downloads per week. Node.js has been deployed in industry, including at eBay, PayPal, and LinkedIn, and is also being embraced on IoT platforms including Cylon.js and IBM’s Node-Red.

Concurrency bugs? Hang on a minute, doesn’t Node run a single-threaded event loop in which all callbacks are executed atomically? Yes it does, but to avoid blocking the event loop callbacks use cooperative multitasking and partition their operations into multiple asynchronous steps, leading to callback chains (and sometimes, callback hell!). Each one of those steps is indeed executed atomically. But the order in which the steps are executed? Unless care is taken, that can catch you out!

Composing responses using this callback chain technique has important implications for software correctness: developers must provide guarantees for both ordering and atomicity. The EDA offers no guarantee of the order in which the event loop will process callbacks… Developers must therefore either write commutative callbacks or introduce their own ordering constraints.

An ordering violation (OV) occurs when operation A should always be invoked before operation B, but this order is not enforced. An atomicity violation (AV) occurs when two operations are intended to happen consecutively, but another operation can be interleaved between them and affect the result.

… many concurrency bugs we found in this study are due to a false assumption of atomicity across callback chains.

Long-standing readers of The Morning Paper (or indeed, students of computer science in the 1970’s) may not be surprised to see concurrency bugs that are dual of those which can occur in threaded systems, because Lauer and Needham showed back in 1978 that thread-based and event-base models are duals of each other (equivalent). For background see the following editions of The Morning Paper from December 2014:

The main body of the Node.fz paper is in two parts: the first part is a study of concurrency bugs found in Node.js programs in the wild, and the second part describes a tool called Node.fz which the authors built to help flush them out. Let’s look at each of those in turn.

A study of concurrency bugs

The authors searched across GitHub for bugs mentioning ‘race’ or ‘race-condition’ in JavaScript-based projects, narrowed the results down to server-side projects, and finally selected 12 patched bugs for careful study, based on how well documented they were. The findings are interesting, but at the end of the day they’re based on an analysis of just 12 bugs!

The following table contains all the gory details of the bugs, their causes, and their consequences.

We identified two general patterns: atomicity violations (AV) and ordering violations (OV)… The most frequent type of bug in our bug study was an AV.

Told you so! 😉 : A short note on atomicity and ordering, The Morning Paper, Feb. 2016

A special type of ordering violation the authors call a COV (commutative ordering violation) happens when applications launch multiple asynchronous requests and intend to run a callback only when all of them have completed. The violation occurs when the final callback runs before all of the requests have completed.

The key findings are:

  • Events involved in race conditions stem from diverse sources such as network traffic, timers, user method calls, and the timing of worker pool work processing and “done” events.
  • Race conditions are not only on shared memory (e.g., writes to variables and arrays), but also on system resources (e.g., queries to a database, I/O to a file system).
  • Race conditions may result in severe consequences including server crashes and inconsistent database states.

A common solution to atomicity bugs seems to be to move the intended-to-be-consecutive accesses into the same callback. Ordering violations can be fixed using nested callbacks, or equivalent approaches with promises.

On a final note for the bug study, we want to emphasize that even developers familiar with effective EDA patterns still make mistakes. We do not believe that complex EDA-based software is significantly easier to get right than multi-threaded software, it just relies on a different paradigm.

Flushing out bugs with Node.fz

The race conditions that these bugs rely on can be difficult to trigger.

… non-determinism, arising from the order in which inputs and intermediate events are handled by the event loop and the worker pool, masks the OVs and AVs that cause inter- and intra-callback chain races.

So Node.fz deliberately amplifies Node’s inherent non-determinism, fuzzing (shuffling and delaying) events, in the hope that this will flush bugs out quicker.

  • By shuffling entries in the event queue before executing each callback, Node.fz yields schedules with alternative input and intermediate event arrival orders.
  • By shuffling the entries in the worker pool’s task and done queues, Node.fz produces schedules with alternative worker pool task processing and completion order.

§4.3 in the paper describes the details of the Node.fz implementation. All the changes are in the libuv event library used by node. This is where the core event loop and worker pool reside. Containing changes just to libuv also isolates Node.fz from the many changes to the Node.js source.

Node.fz will only make fuzzing decisions which are legal according the Node.js documentation:

  • there is no upper bound on how late timers can be
  • epoll doesn’t guarantee immediate notification of ready file descriptors (and input arrival times can vary)
  • there is no guarantee about the order of task handling from the worker pool task queue
  • there is no guarantee about the order of the handling of done tasks relative to each other or to other events in the event loop.

An earlier version of Node.fz also shuffled timers, but several test suites in Node.js rely on Node’s current implementation ordering, causing failures under Node.fz so shuffling of timers was removed. With this change, running the Node.js test suite with a Node implementation linked with the Node.fz version of libuv passes all tests bar one (which required increasing the open file descriptor count before it would pass).

Based on the strength of the Node.js test suite, we conclude that Node.fz is a legal, viable alternative to Node.js.

It’s legal and viable, but is it useful?

The two key questions from the evaluation from my perspective are:

  • Does Node.fz improve the reproducibility of the bugs in the study?
  • Does Node.fz uncover novel bugs?

For the tests, the following scheduler parameters were used:

We ran the test case used to reproduce each of the known bugs 100 times for each version of Node.js. We ran 100 tests because this is roughly the number of rounds of testing we ourselves use before declaring our own software “relatively bug free”; a tool that cannot cause a bug to manifest in 100 iterations is probably impractical.

(Let’s not go down the “I ran it a few times and it seems to work so let’s put it in production” rabbit-hole right now!).

Here are the reproduction experiment results:

In the chart above, nodeV is vanilla Node,js, nodeNFZ is the Node.fz version, but with fuzzing turned off, and nodeFZ is the full fuzzing Node.fz.

Overall, the use of even the generic standard parameterization clearly offers a marked improvement in bug reproduction, indicating that it will also increase the rate of novel bug manifestation.

When running the full test suites for the applications containing the study bugs, Node.fz found a total of three new bugs after 50 iterations. That doesnt’ seem like very many…

We believe that we discovered relatively few novel bugs for three reasons. First, the software we studied is relatively mature, so many race conditions have already been addressed. Second, without expertise on each piece of software, we could only report bugs that caused a crash or a test failure; others may have gone unnoticed. Third, manual inspection of the suites suggested that tests are typically unit tests rather than functional or system tests, and we feel that the latter types of tests are more likely to expose race conditions in software.

I emailed the first author to see if Node.fz is available anywhere for you to try it with your own projects. The code is currently being updated to work with the latest version of Node / libuv, but should be available within a couple of weeks. I’ll keep you posted via twitter and an update to this post once it’s there.

Update: Node.fz is now available here https://github.com/VTLeeLab/NodeFz. Pull requests to help port the work to the latest version of Node.js (libuv) are welcome.