How bad can it git? Characterizing secret leakage in public GitHub repositories

How bad can it git? Characterizing secret leakage in public GitHub repositories Meli et al., NDSS’19

On the one hand you might say there’s no new news here. We know that developers shouldn’t commit secrets, and we know that secrets leaked to GitHub can be discovered and exploited very quickly. On the other hand, this study goes much deeper, and also provides us with some very actionable information.

…we go far beyond noting that leakage occurs, providing a conservative longitudinal analysis of leakage, as well as analyses of root causes and the limitations of current mitigations.

In my opinion, the best time to catch secrets is before they are ever committed in the first place. A git pre-commit hook using the regular expressions from this paper’s appendix looks like a pretty good investment to me. The pre-commit hook approach is taken by TruffleHog, though as of the time this paper was written, TruffleHog’s secret detection mechanisms were notably inferior (detecting only 25-29%) to those developed in this work (§ VII.D). You might also want to look at git-secrets which does this for AWS keys, and is extensible with additional patterns. For a belt and braces approach, also consider a GitHub integration that alerts you if a key does slip through the net, so that you can immediately revoke it.

Update: Rod Johnson has put together a great blog post describing how to tackle this challenge using Atomist. You can see the ‘secret-beagle’ implementation on GitHub, and even watch the twitch stream of Rod and Jessica Kerr pairing on the implementation! Disclaimer: Accel is an investor in Atomist, and I was an early advisor to the company.

To motivate you to go ahead and do that, here are the headline results from the paper:

Headline results

It’s easy to detect secret leakage in real-time using the GitHub APIs, even in an implementation using just a single API key and staying within the default rate limits. The approach in this paper achieves 99% coverage of target files.
Despite best intentions, lots of keys do get leaked (a median of 1,793 unique keys every day).
Using the search approach outlined in the paper, the median time to discovery for a key leaked to GitHub is 20 seconds, with times ranging from half a second to over 4 minutes. In other words, in the time it takes you to go “oh s*&#”!, Did I just…?” and take a look, it’s probably already too late.
A very high percentage (89%) of leaked keys genuinely appear to be sensitive. I.e., they are not test keys.
With ‘multi-factor’ secrets (e.g. Google OAuth IDs), when one part of the pair is leaked, there’s an 80% chance the other part is too.
Many leaked secrets remain in GitHub repos for a long time (81% remain after 16 days).
Even rewriting history is not sufficient to hide a key from view once it has been leaked. (You need to revoke it immediately, which clearly you were going to do, right?)

Without protection in place, it’s just too easy of a human mistake to make. (That is, if you have a process that lets this happen, don’t blame the developer!).

Our data shows that even high-value targets run by experienced developers can leak secrets. In one case, we found what we believe to be AWS credentials for a major website relied upon by millions of college applications in the United States, possibly leaked by a contractor. We also found AWS credentials for the website of a major government agency in a Western European country…

How to detect secrets in GitHub

There are (at least) two good sources of information for secret detection: the GitHub search API and the GitHub public dataset maintained in Google BigQuery. The first phase of the process is to query for candidate files which may contain secrets, using a carefully crafted set of search terms:

Given a set of candidate files, the next thing you’re going to need is a set of regular expressions for popular key formats. The authors examined the key structures of a number of popular systems and services to uncover these. The resulting regular expressions proved to be highly accurate in uncovering secrets, and you’ll find them in the appendix. For example:

(Enlarge)

The regular expressions can then be used to scan the candidate files from the first phase, with any matches considered “candidate secrets”. These candidate secrets are then passed through a set of filters designed to reduce false negatives. For example, the key pattern “AKIAXXXEXAMPLEKEYXXX” as used perhaps in a test case, would be filtered out at this stage. The set of keys that pass through these filters are considered to be true leaked secrets. The authors manually verified a random sample of keys and from this process estimated that 89.1% of all discovered secrets truly are sensitive.

GitHub search

The GitHub search API is rate limited. But using a single API key, it is still possible to run all of the candidate file discovery queries once every thirty minutes. The authors found that this frequent enough to give ~99% coverage of all possible files that could have been discovered with no rate limiting in place.

This result shows that a single user operating legitimately within the API rate limits imposed by GitHub is able to achieve near perfect coverage of all files being committed on GitHub for our sensitive search queries. Of course, a motivated attacker could obtain multiple API keys and achieve full coverage.

In another test, the authors deliberately pushed “secrets” to a known repository and queried the API continuously until the string appeared. This was done once a minute over a 24 hour period.

We found that the median time to discovery was 20 seconds, with times ranging from half a second to over 4 minutes, and no discernible impact from time-of-day. Importantly, this experiment demonstrates that our Search API approach is able to discover secrets almost immediately, achieving near real-time discovery of secrets.

GitHub Search API collection ran from October 31st 2017 to April 20th 2018. During this time 4.4M candidate files were collected, from which 307K files contained just over 400K secrets in total. The search uncovered a median of 1,793 unique secrets per day.

Google BigQuery dataset

Google maintain a BigQuery dataset of public GitHub repos with license files. This was queried on April 4th, 2018, when it was possible to scan 2.3B files from 3.3M repos. 73,799 unique secret strings were uncovered.

Validity Filters

Three validity filters were used to remove false positives:

An entropy filter, which catches secrets with very low entropy
A words filter, which catches secrets containing common dictionary words of length at least 5
A pattern filter looking for repeated characters (e.g. ‘AAAA’), ascending characters (‘ABCD’) and descending characters (‘DBCA’)

At least according to these filters though, only a tiny proportion of the secrets detected by the regular expression were actually dummy secrets. 99.29% of strings matched by the regexes passed the filters. If you were implementing this in your own organisation, it would be much simpler to adopt a convention such as ‘any dummy key must contain the word EXAMPLE’ and then just look for that.

What kinds of secrets did the team find?

The authors found leaked examples of every key type they looked for! The following table gives the breakdown, with the most commonly leaked keys being for Google APIs.

I suspect that the probability of leakage is pretty constant across all key types, and the table more reflects the popularity of the different APIs as used in GitHub repos.

History rewriting

It is obvious that adversaries who monitor commits in real time can discover leaked secrets, even if they are naively removed. However, we discovered that even if commit histories are rewritten, secrets can still be recovered…. we discovered we could recover the full contents of deleted commits from GitHub with only the commit’s SHA-1 ID.

The required commit hashes can be recovered with trivial effort via the Events API. Historical data from this API is also available through the GitTorrent project.

The last word

This work shows that secret leakage on public repository platforms is rampant and far from a solved problem, placing developers and services at persistent risk of compromise and abuse.

GitHub has a token scanning feature in beta which looks for tokens from a limited set of providers, and notifies the provider (not you) if a key is committed to a public repository.