Remembering Babel: Open Data Sharing & Integration

November 19th, 2009 by Thinh

Since the release of CC0, I’ve been talking to many people about when and how to use it. A group of scientists and science policy experts recently endorsed public domain data sharing, and the use of CC0 to do so, in a letter to Nature. This is a significant affirmation of our approach to data sharing. But a question that inevitably arises in many discussions is: What about data providers that are unable or unwilling to commit their data to the public domain? Will Creative Commons support providing a flexible set of licensing options, intermediate between public domain, on the one hand, and full control (secrecy), on the other?

First, I have to clarify what I mean by “data” in this discussion. “Data” by itself can mean anything, including music, movies, pictures, and other things that are clearly copyrightable. But in this discussion, I will use the term “data” in a narrower and more specific sense:  we mean facts, ideas, and concepts that are not copyrightable by themselves. An example would be Einstein’s E=MC^2 equation, the height of Mount Everest, or the coordinates of a particular star. The unprotected status of these data was affirmed in Feist Publications vs. Rural Telephone Service, where the U.S. Supreme Court found that originality is a basic Constitutional prerequisite for copyright to exist, or as Justice O’Conner, writing for the majority, said: “It is this bedrock principle of copyright that … No one may claim originality as to facts.” (emphasis added) The U.S Copyright Act further codifies this principle as a limitation on the scope of copyright protection (at Section 102(b)). Likewise, other countries recognize this limitation in their originality requirements.

This basic limitation on the scope copyright acknowledges that copyright is inherently a social compromise between the desire to reward authors for creative output and the need to protect a reservoir of facts and ideas available for everyone to draw upon. Without this “commons” of facts and ideas, social discourse and creativity would suffer. As Lawrence Lessig writes, in The Future of Ideas, “”Free resources have been crucial to innovation and creativity… without them, creativity is crippled. Thus, and especially in the digital age, the central question becomes not whether government or the market should control a resource, but whether a resource should be controlled at all. Just because control is possible, it doesn’t follow that it is justified. Instead, in a free society, the burden of justification should fall on him who would defend systems of control.”

And yet, over time, copyright control has expanded dramatically in scope and duration, straining this delicate social compromise. Ironically, it is the growth and success of the Internet, with its extraordinary power and freedom, that has spurred renewed interest in extending copyright-like controls even beyond the traditional realms of copyright itself. Databases containing myriad facts and ideas, once considered public domain if shared publicly, are now the subject of efforts to create new systems of control. In Europe, by E.U. Directive, countries have implemented “sui generis” database rights that protect databases and their contents even if they are too unoriginal to merit copyright protection. Other countries grant copyright protection to databases under relaxed copyright standards that demand less than full originality or creativity.

Finally, there are attempts to create systems of control based on contract law (like click-wrap agreements, Web site terms of use, etc.), premised not on the existence of any copyright or statutory right, but merely on voluntary agreement. Contracts can expand copyright-like controls well beyond the boundaries of traditional copyright or even sui generis protections, and indeed have no inherent limits other than the enforceability of the agreement (which can be problematic in itself). Not only do such contracts apply to uncopyrightable data, but they can also impose controls on data already otherwise in the public domain, since the issue is not the status of the data but whether you consented to abide by a contract. A recent example is the Open Data Commons’ Open Database License (ODbL), which is being considered for adoption by the OpenStreetMap community, among others. The Open Data Commons not only has been a strong supporter and advocate for open data sharing, but it has provided important community tools, including the Public Domain Dedication and License (PDDL). But unlike the previously released PDDL, the new ODbL contains attribution and share-alike obligations, among other requirements. Its terms and conditions are imposed on copyright or sui generis database rights, but it also purports to act as a contract in the absence of these protections. As a result, it attempts to impose obligations on data that even copyright and sui generis rights do not reach.

With CC0, Creative Commons has chosen to take a different approach (or rather, to stick with an approach similar to the PDDL). CC0 is a way to give up controls and dedicate data to the public domain (or as close to it as we can legally achieve). As I have explained elsewhere, we were concerned about the practical impact of “attribution stacking” and license compatibility problems for data sharing communities. Attribution stacking can burden large-scale data sharing projects that draw on many sources and license compatibility problems can shut down data integration efforts altogether.

In science, an area that I focus on, sharing data in the public domain is in fact part of a long and honored tradition. Before the Internet, data was published, if at all, in journals in print. The articles themselves may be copyrightable, but the facts and ideas revealed there were presumed to be in the public domain. Only with the advent of the Internet and digital technology has there been interest in “licensing” contents of databases including such facts and ideas. Thus, where there is an established tradition of public domain data sharing that has worked well for a community–and continues to work well– any new system of control must meet a high burden of justification. But based on our experiences with other licensing schemes, we know that such controls carry risks. Even a simple requirement like attribution, when aggregated over thousands or millions of data elements, can become a very serious burden. Scientists should provide attribution (and citation) for valid scientific reasons, and no legal requirement may be flexible enough to replace common sense or professional judgment, an important ingredient in deciding what to attribute and how. In addition, license incompatibility problems, which are especially relevant with share-alike licenses, can prevent databases or data sources from being combined or integrated or data from being reused. All of this can have a negative impact on the usability of scientific data.

In light of such risks, what could justify departures from the public domain? One argument, made to me eloquently by several data project organizers, is that unless we grant providers the flexibility to impose some controls–rather than none–they will be reluctant or unwilling to grant any access. And even restricted sharing, with some conditions, is better than no sharing at all. Further, they argue that some extremely valuable data sets might fall into this category, because the more valuable the data, the less likely it is that someone would consider simply releasing it into the public domain. And so by not offering a graduated system of controls, like the CC suite of copyright licenses, important opportunities to share are being missed, with serious consequences for those communities and perhaps for all of us. I have to admit that it’s a powerful argument against being too dogmatically attached to the public domain, and if true, it might justify other approaches.

At issue is whether more data would be made available under a more restrictive system than the public domain and to what extent those restrictions impair the value of that data to the community. I don’t think we know the answer fully yet. It’s a question that undoubtedly deserves more research by sociologists and other scholars, based on empirical evidence. But, when in doubt, what should be done? I come back to Lessig’s admonition that, “the burden of justification should fall on him who would defend systems of control.” I think the best that can be said for more restrictive systems of sharing data is “not yet proven.” And that’s why we will continue to advocate public domain and CC0 for data sharing.

8 Responses

  1. Bob Morris, on November 20th, 2009 at 11:14 am

    See also
    Agosti, D., Egloff, W., 2009. Taxonomic information exchange and copyright: the Plazi approach. BMC Research Notes 2009, 2:53

    for related solutions, especially for treatment of the case of non-copywritable material expressed for the first time in, and extracted from, copyrightable material.

  2. Bob Morris, on November 20th, 2009 at 11:20 am

    Where talking about “facts, ideas, and concepts that are not copyrightable by themselves.” this blog entry seems to contradict the CC0 commentary about which I understand to mean that, like all CC licenses, CC0 can only apply to copyrightable material. How is this reconciled with advocacy of application of CC0 to something that is not copyrightable?

  3. Mike Linksvayer, on December 8th, 2009 at 11:00 pm

    Bob, CC0 applies to the extent it can to neighboring rights as well. The leading paragraph on is pretty long, but worth reading again (emphasis added):

    Using CC0, you can waive all copyrights and related or neighboring rights that you have over your work, such as your moral rights (to the extent waivable), your publicity or privacy rights, rights you have protecting against unfair competition, and database rights and rights protecting the extraction, dissemination and reuse of data.

    Also spelled out in more detail in

  4. Mark, on December 11th, 2009 at 7:25 am

    What license might work best for scientific unit schemas, I wonder. They’re like a database with pieces of code. Review the files and contact me to discuss if interested.

    The stated issue is “whether more data would be made available [shy of] public domain.” My gut says yes. Attribution issues are peanuts.

    A healthy “commons,” almost by definition, needs Affero-style network licensing clauses. Companies monetizing the commons then have incentive to pool data work with competitors and volunteers. No company benefits from duplicating the database work of another. Market competition moves away from databases to company service, hardware/software computational prowess, site entertainment value, educational merit, staff expertise, and other factors.

  5. Jim, on December 13th, 2009 at 9:43 am

    Everything should be public domain after 5 years.

  6. RAHale, on December 17th, 2009 at 12:27 pm

    It is incorrect to assume that because data is appropriate to the public domain it inherently ranks lower in value than “proprietary” data. For example, how valuable would you say it would be to have access to information about the public water supply for the watershed you live in ? Does that gain or lose value, depending on transparency? Who are stakeholders that define value ?

    “…Further, they argue that some extremely valuable data sets might fall into this category, because the more valuable the data, the less likely it is that someone would consider simply releasing it into the public domain.”

  7. Hungover Guy, on January 27th, 2010 at 4:13 am

    As much as I can understand right now, I think you’re right!

  8. Puneet Kishor, on January 29th, 2010 at 11:01 am

    CC0 is not a license. That is one crucial point to keep in mind when thinking about CC0 along with other CC licenses. The latter are true licenses while CC0 is a waiver. It is an absence of any license. Think of CC0 as a mark of quality. It proclaims that the data it is applied to has no known restrictions. To the extent that there were any copyrightable elements, the rights are waived, and with regards to the elements that are not copyrightable, well, there was nothing there to be waived in the first place.

    Hope that helps.