Understanding real-world concurrency bugs in Go

Understanding real-world concurrency bugs in Go Tu, Liu et al., ASPLOS’19

The design of a programming (or data) model not only makes certain problems easier (or harder) to solve, but also makes certain classes of bugs easier (or harder) to create, detect, and subsequently fix. Today’s paper choice studies concurrency mechanisms in Go. Before we dive in, it might be interesting to pause for a moment and consider your own beliefs about Go, which may well include some of the following:

  • Go was explicitly designed to make concurrent programming easier and less error-prone
  • Go makes concurrent programming easier and less error-prone
  • Go programs make heavy use of message passing via channels, which is less error prone than shared memory synchronisation
  • Go programs have less concurrency bugs
  • Go’s built-in deadlock and data race detectors will catch any (most?)
    bugs you do let slip into your code

The first of those statements is true. For the remaining statements, you can use the data from this research to re-evaluate how strongly you want to hold those opinions…

We perform the first systematic study on concurrency bugs in real Go programs. We studied six popular Go software [projects] including Docker, Kubernetes, and gRPC. We analyzed 171 concurrency bugs in total, with more than half of them caused by non-traditional, Go-specific problems. Apart from root causes of these bugs, we also studied their fixes, performed experiments to reproduce them, and evaluated them with two publicly-available Go bug detectors.

The six applications studied were Docker, Kubernetes, etcd, CockroachDB, gRPC, and BoltDB, so that’s a lot of important real-world Go-code right there.

The analysis begins by studying how these applications actually make use of Go concurrency primitives, before going on to study concurrency related bugs from their issue trackers. These bugs are categorised on two main dimensions: the observed behaviour (blocking or non-blocking), and the type of concurrency primitive that is the cause (shared memory or message passing). Let’s begin with a very quick recap of the main concurrency mechanisms in Go.

Concurrency in Go

A major design goal of Go is to improve traditional multi-threaded programming languages and make concurrent programming easier and less error-prone. For this purpose, Go centers its multi-threading design around two principles: 1) making threads (called goroutines) lightweight and easy to create and 2) using explicit messaging (via channels) to communicate across threads.

Goroutines are lightweight user-level threads (‘green’ threads). A goroutine is created by adding the keyword go before a function call, including to an anonymous function. Local variables declared before an anonymous function are accessible within it and potentially shared. Channels are used to send data and states across goroutines, and may be buffered or unbuffered. When using unbuffered channels a goroutine will block on send (receive) until another goroutine is receiving (sending). The select statement allows a goroutine to wait on multiple channel operations, if more than one case is valid Go selects one at random. Go also has traditional synchronisation primitives including mutexes, condition variables, and atomic variables.

How Go concurrency primitives are used in practice

The six applications make relatively heavy use of goroutines, especially with anonymous functions.

An especially interesting comparison can be made in the case of gRPC, which has both a C implementation and a Go implementation. The following table shows the ratio of the number of goroutines created in gRPC-Go compared to gRPC-C, when processing the same number of requests.

In this comparison goroutines tend to shorter lifetimes than the threads created in the C version, but are created much more frequently. This more frequent use of goroutines is to be expected as its something the Go language encourages.

If we look at the use of concurrency primitives across the board in all of the applications, one more surprising finding is that shared memory synchronisation operations are still used more often than message passing:

The most frequently used message-passing primitive is chan, responsible for between 18.5% and 43% of all usages. So we have a situation in which traditional shared memory communication is still heavily used, in conjunction with significant amounts of message passing primitives. From a bug perspective that means we have all the exciting bug possibilities that shared memory communication affords, together with all of the bug possibilities message passing affords, and bugs caused be the interaction of these two styles!

Go concurrency bugs

The authors searched the GitHub commit histories of the applications to find commits fixing concurrency bugs (3,211). From these 171 were randomly selected for study.

Bugs are categorised into blocking bugs and non-blocking bugs. A blocking bug occurs when one or more goroutines are unintentionally stuck in their execution and cannot move forward. This definition is broader deadlocks and can include situations with no circular waits, but instead waits for resources that no other goroutines supply. 85 of the bugs were blocking, and 86 were non- blocking.

Bugs are also categorised in a second dimension according to whether they relate to shared memory protection (105 bugs) or message passing (66 bugs).

Blocking bugs

Looking at the blocking bugs first of all, 42% of these are related to shared memory, and 58% are related to message passing. Recall as well that shared memory primitives are more widely used than message passing primitives.

Contrary to the common belief that message passing is less error-prone, more blocking bugs in our studied Go applications are caused by wrong message passing than by wrong shared memory protection.

The shared memory based bugs include all the usual suspects, with a few new twists due to Go’s implementation of RWMutex and Wait (see §5.1.1).

For message-passing related bugs, many of these are due to missing sends or receives on channels, or closing channels.

All blocking bugs caused by message passing are related to Go’s new message passing semantics like channel. They can be difficult to detect especially when message passing operations are used together with other synchronization primitives. Contrary to common belief, message passing can cause more blocking bugs than shared memory.

Investigating the fixes for these bugs showed that once understood, the fixes are fairly simple, and the type of fix is correlated with the bug cause. This suggests that fully-automated or semi-automated tools for fixing blocking bugs in Go may be a promising direction.

Go’s built-in deadlock detector was only able to detect two of the 21 blocking bugs reproduced in the study.

Non-blocking bugs

More non-blocking bugs are caused by misuse of shared memory primitives than message passing. About half of these are caused by ‘traditional’ shared memory bug patterns. There are also several bugs caused by a lack of understanding of Go language features, especially the sharing of local variables declared before an anonymous function used in a goroutine, and the semantics of WaitGroups.

New programming models and new libraries that Go introduced to ease multi-thread programming can themselves be the cause of more concurrency bugs.

While message-passing based non-blocking bugs were comparatively less common, “the intricate design of message passing in a language can cause these bugs to be especially hard to find when combining with other language-specific features.”

Interestingly, programmers fixing bugs caused by misuse of shared memory primitives showed a preference for using message passing to do so.

Go’s data race detector was able to detect half of the reproduced non-blocking bugs in the study.

Wrapping up

Surprisingly, our study shows that it is as easy to make concurrency bugs with message passing as with shared memory, sometimes even more.

Programmers have to have a clear understanding of:

  • goroutine creation with anonymous functions
  • the usage of buffered vs unbuffered channels
  • the non-determinism of waiting for multiple channel operations using select
  • correct use of the special library time

Although each of these features were designed to ease multi-threaded programming, in reality, it is difficult to write correct Go programs with them.

I regret that I didn’t have space in this write-up to include the many illustrative code samples highlighting details of bugs. If you’re actively developing in Go, the full paper is well worth a read to study them.