To distribute or not to distribute? Why licensing bugs matter

To distribute or not to distribute? Why licensing bugs matter Vendome et al., ICSE’18

Software licensing can quickly get quite complicated, with over 100 known open source licenses out there, and distributions often including components with a mix of licenses. Unsurprisingly, developers find it hard to determine appropriate licenses for their work, and to interpret the implications of including third-party software under different licenses.

We present a large-scale qualitative study aimed at characterizing licensing bugs, with the goal of understanding the types of licensing bugs developers face, their legal and technical implications, and how such bugs are fixed.

The result is a helpful catalogue of seven different categories of licensing bugs, with 21 sub-categories in total between them. Although the authors are not lawyers (as far as I can tell), it still constitutes a very useful list of things to think about. “Our proposed catalog can serve as a reference for developers and lawyers dealing with potential licensing issues.”

The catalogue is drawn from an open coding exercise based on a statistically significant sample of 1,200 discussions randomly selected from a population of 59,426 discussions across a collection of issue trackers and mailing lists. The mailing lists were Apache’s legal-discuss, Debian’s debian-legal, Fedora’s fedora-legal-list, Gnome’s legal-last and OpenStack’s open-discuss. For issue trackers, the authors looked for issues using the keyword license on all 136 Bugzilla issue trackers in the Bugzilla installation list, as well as the issue trackers of 86,032 GitHub projects (selected to try and make sure these were not toy projects).

Who cares about licensing?

Before diving into the catalogue itself, it’s worth briefly reviewing the different stakeholders involved in licensing issues: there are holders of IP (e.g. trademark holders, patent holders, copyright holders), lawyers, and lawmakers, and then we can also call out:

  • Integrators, that reuse open source software within their own systems
  • Package maintainers, who are responsible for maintaining packages and integrating patches or bug fixes.
  • Distributors – any individual or entity distributing software
  • Developers (in general)
  • Community – either people involved in a specific open source community, or the open source community as a whole.

Catalog overview

The taxonomy is composed of 21 distinct sub-categories organised in 7 distinct high-level categories. Due to space limitations we only discuss a subset of the sub-categories (14). The complete taxonomy description and frequencies of each category can be found in the attached appendix.

That appendix sounds like a useful resource. Unfortunately it’s not included in the only openly hosted version of the paper I could find (on the first author’s personal site, and linked at the top of this post). The descriptions we do get are still very useful though.

It is important to remark that the results discuss the interpretation of developers and/or legal practitioners. Therefore it is possible that the legality of these interpretations or discussions may change (e.g., new interpretations can causes new legal precedents in the U.S.A.), on the enforceability may change in different jurisdictions.

Selected licensing issues explored

Let’s take a brief look inside each of the seven major categories.

Laws and their interpretations

At the base level, there is confusion over what is copyrightable? Software is copyrightable, but higher level designs and ideas may fall out of scope. Disagreements on the scope of copyright can lead to difficulties.

A related issue is understanding what is a derivative work? (A work partially owned by the copyright author on which it is originally based). “… one of the most important features of open source licenses is that they should allow the creation and redistribution of derivative works.” It’s often unclear whether B should be treated as a derivative work of A, or just something that uses / bundles A. For example, Linus Torvalds asserts that merely using the kernel by making system calls does not constitute creating a derivative work. There is still plenty of disagreement even on this though.

This is all further complicated by the fact that copyright, trademark, and patent laws are national in scope. Thus we often find clauses relating to choice of jurisdiction.

… we observed that clauses related to choice of jurisdiction were a controversial topic within Debian in terms of their impact on software’s freeness. However, the distribution may be impacted by external factors like trade restrictions to a particular country or distribution of what a country considers sensitive material. While organizations or communities may want to facilitate global reuse, the organizations and individuals must comply with these trade laws.

Policies of the ecosystem

This category concerns issues relating to the licensing policies of specific open source communities such as the Apache Foundation, Eclipse Software Foundation, and Debian. These give community guidelines that projects within the foundation are expected to follow. For example, projects at Eclipse under the EPL cannot ship external libraries under the LGPL as part of their distribution. This makes for more complex user installation procedures if users have to assemble the last mile themselves.

The FSF has specific guidelines on whether software with various licenses can be combined/derived alongside software licensed under the FSF licences.

You need to think broader than just source code, images, fonts, databases, text files and so on all need consideration…

Since IP clearance/evaluation extends to all bundled artifacts (not only source code and binaries), a non-free image or font could prevent the distribution of the software.

Potential license violations

Some licenses are incompatible with each other, and issues can arise when including dependencies or reusing source code that is incompatible with either the declared license, or with the the license of other reused components. Generally such an issue impacts the ability to distributed the software. As a specific example, Apache License 2.0 is incompatible with GPL v2. (See the full list here).

Non-source code licensing

When evaluating license compliance, you also need to consider non-source code artefacts, and in particular the need to make the source of those artefacts available. In GPL for example, source is defined as “the preferred form of the work for making modifications to it“. So if you distributed a PDF of a document, you would also need to distribute the source that generates that PDF.

Documentation, like source code, is also protected by copyright.

Even documentation shipped in HTML format has been questioned, since HTML is not the preferred form for making changes.

Similar issues occur with other media such as fonts, images, and audio. An MP3 is likely not the preferred form for editing audio for example.

Licensing content

A license inconsistency occurs when there is a mismatch between the documented license and the actual source code licensing, e.g. inconsistencies between software licensing an the spec file documenting included licenses.

Other IP issues

Do you have the rights to use a contribution? . This is the arena of CLAs (Contributor License Agreements) and CTAs (Copyright Transfer Agreements). The fundamental difference between the two is that in the former case the author retains the copyright, and grants a license. In the latter case the author transfers the copyright. Without either of these, how can you protect the integrity of your software package?

Projects that require CTAs/CLAs do it to reduce their legal risks… It is important to note that CLAs/CTAs are optional in the sense that an organization is not required to use them. However, it demonstrates that these open source communities would rather reject contributions than increase the legal risk of distributing code that may contain a license violation.

Another thorny area is patents. From the debian-legal mailing list. It’s hard (bordering on impossible?) to know what patents may apply to a piece of software, including patents going through the approval process which may later be granted. A number of licenses include specific clauses relating to patents and their litigation.

You also need to be careful to respect trademarks.

Licensing semantics

The final category includes licensing bugs relating to difficulties and/or confusion over the use of dual licensing or understanding the implications of particular clauses. As an example, developers considering migration to GPL 2.0+ need to consider the “or later” clause. How do you know you will agree with the terms of a future version of the GPL?