Let’s briefly look at how the authors collected their data before diving deeper into what the results themselves tell us.
Data gathering methodology
The team crawled the Alexa Top 75K websites (ALEXA) and also a random sample of 75K websites drawn from the
- Manually constructing a catalogue of all releases versions of the 72 most popular open source libraries (using popularity statistics from Bower and Wappalyzer).
- Using static and dynamic analysis techniques to cope with the fact that developers often reformat, restructure, or append code making it difficult to detect library usage in the wild
- Implementing an in-browser causality tracker to understand why specific libraries are loaded by a given site.
To figure out why a certain library is being loaded, the authors develop a causality tree chrome extension. Nodes in the tree are snapshots of elements in the DOM at a specific point in time, and edges denote “created by” relationships.
The median causality tree in ALEXA contains 133 nodes, and the median depth is 4 inclusions.
jQuery remains by far the most popular library, found on 84.5% of ALEXA sites. Note also SWFObject (Adobe Flash) still used on 10.7% of ALEXA sites despite being discontinued in 2013.
When externally loaded, scripts are mostly loaded from CDNs (note also the domain parking sites popping up in the long tail of the COM sites):
Overall though, there seems to be a pretty even split between internally hosted and CDN-delivered script libraries:
Distribution of vulnerable libraries
37.8% of ALEXA sites use at least one library version known to the authors to be vulnerable.
Highly-ranked websites tend to be less likely to include vulnerable libraries, but they are also less likely to include any detected library at all. Towards the lower ranks, both curves increase at a similar pace until they stabilise. While only 21 % of the Top 100 websites use a known vulnerable library, this percentage increases to 32.2 % in the Top 1 k before it stabilises in the Top 5 k and remains around the overall average of 37.8 % for all 75 k websites.
37.4% of the COM sites use at least one vulnerable library. Within the ALEXA grouping, financial and government sites are the worst, with 52% and 50% of sites containing vulnerable libraries respectively.
The following table shows the percentage of vulnerable copies in the wild for jQuery, jQ-UI, Angular, Handlebars, and YUI 3.
In ALEXA, 36.7% of jQuery inclusions are known vulnerable, when at most one inclusion of a specific library version is counted per site. Angular has 40.1% vulnerable inclusions, Handlebars has 86.6%, and YUI 3 has 87.3% (it is not maintained any more). These numbers illustrate that inclusions of known vulnerable versions can make up even a majority of all inclusions of a library.
Many libraries it turns out are not directly included by the site, but are pulled in by other libraries that are. “Library inclusions by ad, widget, or tracker code appear to be more vulnerable than unrelated inclusions.”
Another interesting analysis is the age of the included libraries – the data clearly shows that the majority of web sites use library versions released a long time ago, suggesting that developers rarely update their library dependencies once they have deployed a site. 61.7% of ALEXA sites are at least one patch version behind on one of their included libraries, and the median ALEXA site uses a version released 1,177 days before the newest release of the library. Literally years out of date.
If you like a little non-determinism in your web app (I find it always make debugging much more exciting ;) ), then another interesting find is that many sites include the same libraries (and multiple versions thereof) many times over!
We discuss some examples using jQuery as a case study. About 20.7 % of the websites including jQuery in ALEXA (17.2 % in COM) do so two or more times. While it may be necessary to include a library multiple times within different documents from different origins, 4.2 % of websites using jQuery in ALEXA include the same version of the library two or more times into the same document (5.1 % in COM), and 10.9 % (5.7 %) include two or more different versions of jQuery into the same document. Since jQuery registers itself as a window-global variable, unless special steps are taken only the last loaded and executed instance can be used by client code. For asynchronously included instances, it may even be difficult to predict which version will prevail in the end.
What can be done?
So where does all this leave us?
From a remediation perspective, the picture painted by our data is bleak. We observe that only very small fraction of potentially vulnerable sites (2.8 % in ALEXA, 1.6 % in COM) could become free of vulnerabilities by applying patch-level updates, i.e., an update of the least significant version component, such as from 1.2.3 to 1.2.4, which would generally be expected to be backwards compatible. The vast majority of sites would need to install at least one library with a more recent major or minor version, which might necessitate additional code changes due to incompatibilities.
Version aliasing could potentially help (specifying only a library prefix, and allowing the CDN to return the latest version), but only a tiny percentage of sites use it (would you trust the developers of those libraries not to break your site, completely outside of your control?). Note that:
Google recently discontinued this service, citing caching issues and “lack of compatibility between even minor versions.”
We need proper dependency management which makes it clear which versions of libraries are being used, coupled with knowledge within the supply chain of vulnerabilities. “This functionality would ideally be integrated into the dependency management system of the platform so that a warning can be shown each time a developer includes a known vulnerable component from the central repository.”
Of course, that can only work if we have some way of figuring out which libraries are vulnerable in the first place. The state of the practice here is pretty damning :
Consider jQuery, one of the most widely used libraries:
Since we also know that many libraries are only indirectly loaded by web sites, and are brought in through third-party components such as advertising, tracking, and social media code, even web developers trying to stay on top of the situation may be unaware that they are indirectly introducing vulnerable code into their websites.