DéjàVu: A map of code duplicates on GitHub Lopes et al., OOPSLA ‘17
‘DéjàVu’ drew me in with its attention grabbing abstract:
That means there’s an 82% chance the file you’re looking at has a duplicate somewhere else in GitHub. My immediate thought is “that can’t possibly be right!” The results seem considerably less dramatic once you understand the dominant cause though.
The motivation for the study was to aid in selecting random samples of code bases to be used as the basis for other studies (it’s common in software engineering research to analyse projects on GitHub). What these results show is that simple random selection is likely to lead to samples including high duplication, which may bias the results of research. The clone map produced by the authors can be used to understand the similarity relations in samples of projects, or to curate sample to reduce duplicates. DéjàVu is a publicly available index of code duplication.
I’ll break the rest of this write-up into three sections – first let’s understand what the authors mean when they say “duplicate”, then we’ll look at the duplication findings, and finally we’ll get to the question of what’s causing there to be so much duplication in the first place.
At the outset of this work, we were planning to study different granularities of duplication. As the results came in, the staggering rate of file-level duplication drove us to select three simple levels of similarity. A file hash gives a measure of files that are copied across projects without changes. A token hash captures minor changes in spaces, comments, and ordering. Lastly, SourcererCC captures files with 80% token similarity. This gives an idea of how many files have been edited after cloning.
To create the token hash all comments, white space and terminals are removed from the file, and then tokens are grouped by frequency. This results in strings such as “… (void, 2), (print, 2), (System, 1), …”. The token hash is an MD5 of the tokenized output string.
In addition to file-level duplication, the authors also look at project overlap – i.e. projects that contain some number of files in common. A statement “A is cloned in B at x%” means that x% of the files in project A can also be found in project B.
What the analysis reveals
Looking across the whole corpus, the authors find the following levels of duplication:
There’s one outlier we should probably remove straight away – the empty file with size 0 is duplicated 2.2M times! Another trivial file that is frequently duplicated is a file containing just 1 empty line. The authors redid the analysis, this time excluding small files with less than 50 tokens. This leaves us with the following picture when looking at a file level:
When we look at the project level, we find high levels of cloning between projects too:
Why so many duplicates?
The answers can be found in project dependencies, popular frameworks, and in code generation.
node\_modules directory, they are ultimately responsible for almost 70% of the entire files. If ever you have felt like you are downloading the universe when running
npm install, here’s the data to prove it: including nested dependencies (nesting up to 47 levels deep was discovered, with median 5) the number of unique included projects has median 63, and maximum 1261.
Some of the worst culprits are projects generated by the Angular Full Stack Generator, and by Yeoman.
The presence of external libraries within the projects’ source code shows a form of dependency management that occurs across languages, namely, some dependencies are source-copied to the projects and committed to the projects repositories, independent of being installed through a package manager or not.
Investigating SourcererCC duplicates reveals another kind of cloning. 20 clone pairs were randomly selected for analysis and categorised into (i) intentional copy-paste clones, (ii) unintentional accidental clones, and (iii) auto-generated clones.
It is interesting to note that clones in categories ii) and iii) are both unavoidable and created because of the use of the popular frameworks.
The majority of the clone pairs fall into the auto-generated category – generated for example by Apache Axis, Android, and JAXB in the Java universe, Django in Python (
The source control system upon which GitHub is built, Git, encourages forking projects and independent development of those forks… However, there is a lot more duplication of code that happens in GitHub that does not go through the fork mechanism, and instead, goes in via copy and paste of files and even entire libraries.
npm shrinkwrap, but this is not without it’s own challenges. A compromise is to make a dedicated node_modules repository and commit that into git, see for example ‘My node_modules are in git again.’ A quick search on GitHub still shows plenty of node_modules commit activity as of the time of writing.