Database Protocol

Why did Science Commons create this protocol?

We spent almost two years working on the issue of database licensing for the sciences. Our goal was to provide scientific providers and users of data with legal and technical tools for databases – tools that would accelerate the rate of scientific discovery through advanced use of data.

To achieve this goal, we held a series of meetings, varying in size and scope, with a set of international stakeholders representing both the developed world and the developing world. We worked with diverse scientific communities including the life sciences, biodiversity, and geospatial communities. We asked for, and received, the advice of many members of the international Creative Commons community.

We heard a clear message to provide a clear set of guidelines on how scientists and data can interact, legally, in a manner that maximizes the scientific utility of data. We also heard a real desire to look forward to a world in which databases live not in single-use terms, but in a web of federated databases, and to provide education on how current uses of the law might impact that usage. We heard a lot of confusion about how the various international data regimes might interact in a federated data world, and a desire to clarify that world.

Our conclusion was that the answer is to make databases available in a manner that is legally accurate, simple for scientists to understand, and imposes low costs on both providers and consumers of data. The protocol lays out how to achieve these goals.

Why aren’t you just recommending a license?

In the beginning of our work, we released a FAQ on databases and Creative Commons licenses. This protocol is the evolution of that effort.

As we dug into the field, it became clear that, unlike the cultural and software spaces, there was already a huge amount of information that met the goals of the project. The human genome, for example, is available in a legally accurate and simple format, with very low costs to its users. We wanted to make sure our work didn’t create a different legal space from data products like the genome (or other major life sciences databases).

We also wanted to fit into existing, community driven efforts. That’s why we spent months working with Jordan Hatcher and Talis, the people behind the Community Database License, to ensure that the newer version of the License would conform to the protocol. That’s why we worked with the Open Knowledge Foundation, to ensure that the protocol fit into the OKF’s definition of Open Knowledge. Another of our goals from the beginning was to ensure that we avoided legal barriers to the dissemination and reuse of data – the problem of data integration is already enough of a technical headache that adding the law can make it essentially impossible.

You used to recommend the use of Creative Commons licenses. Why did you change?

Our FAQ on databases recommended the use of Creative Commons licenses under specific scenarios in order to help data providers and users address copyrightable elements of a database. However, in practice, applying the guidelines of the FAQ proved to be difficult for both data providers and users due to uncertainty over the scope and applicability of copyright and other legal protections, as well as their jurisdictional variations. As we did our research, it became clear that using the traditional Creative Commons licenses in any way created a dependence on copyright, and that such a dependence came with negative unintended consequences. As we note in the protocol, even experienced attorneys have trouble knowing precisely where copyright starts and stops in any given database – it is therefore not appropriate to impose this level of burden on scientists. One nightmare scenario is the scientist unwittingly infringing copyright in a data integration project. Another is the scientist who posts a database of genome data – believing she is protected by a non-commercial clause perhaps – who later sees that genome data in commercial product (but, because that data itself was not copyrightable, removing it and republishing it would very likely be legal). Only a reconstruction of the public domain achieves all the goals of the protocol: simplicity, accuracy, and low costs.

Why do you think it’s bad to use copyright licenses to encourage share-alike and attribution concepts on databases?

We recommend against share-alike and attribution requirements that use copyright for several reasons. Two are noted in the question above – the problem of category errors, and the problem of unment expectations. Another is the problem of “attribution stacking” – when you federate a query from 50,000 databases (not now, perhaps, but definitely within the 70-year duration of copyright!) will you be liable to a lawsuit if you don’t formally attribute all 50,000 owners? In addition, “share-alike” licenses typically impose the condition that some or all derivative products be identically licensed. Such conditions have been known to create significant “license compatibility” problems under existing license schemes that employ them. In the context of data, license compatibility problems will likely create significant barriers for data integration and reuse for both providers and users of data.

But more broadly, copyright licenses aren’t, and shouldn’t be, the answer to all problems of closure and openness. There is an enormous amount of data in the public domain. Using copyright on data, even in pursuit of worthy goals like the propagation of the commons through share-alike, simply carries along the problems of copyrighted works to a place where in many cases those problems don’t exist.

Science itself is a broad area. Discipline by discipline, scientists have worked out norms for data in lieu of using copyright licenses. The Bermuda Rules on the Human Genome Project are a good example: a group of biologists got together and hammered out how genome data was posted online – within 24 hours of coming off the machines, the sequences went online, and the genome centers who deposited the data would get the first right to publish on the chromosome whose genetic sequence they were depositing, and anyone could publish on other aspects. Despite the absence of legal enforcement, these rules worked. Scientific norms like “naming and shaming” and peer review of grant proposals were plenty powerful to enforce the rules, and no one was exposed to lawsuits based on copyright.

Contractually constructed restrictions or obligations on data are yet another problem. The HapMap database of human genetic variation began its life with a “click through” agreement attempting to impose provisions that would mitigate potential patent problems on using the database. But after more than a year, the HapMap consortium abandoned that license – it was preventing the integration of the database into larger systems, due to fears on behalf of integrators of how that restriction might propagate. Thus, contracts as well as copyright licenses can frustrate the goal of “Freedom to Integrate” and the rapid progress of science that data integration promises.

Thus, given the potential for significantly negative unintended consequences of using copyright, the size of the public domain, and the power of norms inside science, we believe that copyright licenses and contractual restrictions are simply the wrong tool, even if those licenses and contracts are used with the best of intentions.

Why are you calling this Open Access Data?

Science Commons was founded on the ideas behind Open Access to the scholarly literature. And there’s a rich debate about “Free” and “Open” in lots of places that we simply wished to avoid in order to minimize confusion. There was no formal definition for Open Access Data and, as we moved through our research, it became clear to us that we wanted to situate a data definition for OA within the broader definition of OA to the literature.

How will you certify implementations as conforming to the protocol?

We are launching with certifications of the Open Data Commons License (which came out of the Talis efforts, and was drafted by Jordan Hatcher) and the CC0 legal tool. We are now developing a certification process for submission of conforming implementations and expect to release that process in 2008.