Voices from the future of science: Rufus Pollock of the Open Knowledge Foundation
August 18th, 2008 by dwentworth
If there’s a single quote that best captures the ethos of open science, it might be the following bon mot from Rufus Pollock, digital rights activist, economist at the University of Cambridge and a founder of the Open Knowledge Foundation: “The best thing to do with your data will be thought of by someone else.”
It’s also a pithy way to convey both the challenge and opportunity for publishers of scientific research and data. How can we best capitalize on the lessons from the rise of the Web and open source software to accelerate scientific research? What’s the optimal way to package data so it can be used in ways no one anticipates?
I talked to Pollock, who’s been a driving force behind efforts to improve sharing and reuse of data, about where we stand in developing a common legal, technical and policy infrastructure to make open science happen, and what he thinks the next steps should be.
I’d say that in terms of applying lessons from open source, the biggest thing to look at is data. Code and data have so many similarities — indeed, in many ways, the distinction between code and data are beginning to blur. The most important similarity is that both lend themselves naturally to being broken down into smaller chunks, which can then be reused and recombined.
This breaking down into smaller, reusable chunks is something we at the Open Knowledge Foundation refer to as “componentization.” You can break down projects, whether they are data sets or software programs, into pieces of a manageable size — after all, the human brain can only handle so much data — and do it in a way that makes it easier to put the pieces back together again. You might call this the Humpty Dumpty principle. And splitting things up means people can work independently on different pieces of a project, while others can work on putting the pieces back together — that’s where “many minds” come in.
What’s also crucial here is openness: without openness, you have a real problem putting things together. Everyone ends up owning a different piece of Humpty, and it’s a nightmare getting permission to put him back together (to use jargon from economics, you have an anti-commons problem). Similarly, if a data set starts off closed, it’s harder for different people to come along and begin working on bits of it. It’s not impossible to do componentization under a proprietary regime, but it is a lot harder.
With the ability to recombine information as the goal, it’s critical to be explicit about openness — both about what it is, and about what you intend when you make your work available. In the world of software, the key to making open source work is licensing, and I believe the same is true for science. If you want to enable reuse — whether by humans, or more importantly, by machines operated by humans — you’ve got to make it explicit what can be used, and how. That’s why, when we started the Open Knowledge Foundation back in 2004, one of the first things we focused on was defining what “open” meant. That kind of work, along with the associated licensing efforts, can seem rather boring, but it’s absolutely crucial for putting Humpty back together. Implicit openness is not enough.
So, in terms of open science, one of the main things the Open Knowledge Foundation has been doing is conceptual work — for example, providing an explicit definition of openness for data and knowledge in the form of the open knowledge/data definition, and then explaining to people why it’s important to license their data so it conforms to the definition.
So, to return to the main question, I think one of the strategies we should be taking from open source is its approach to the Humpty Dumpty problem. We should be creating and sharing “packages” of data, using the same principles you see at work in Linux distributions — building a Debian of data, if you like. Debian has currently got something like 18,000 software packages, and these are maintained by hundreds, if not thousands, of people — many of whom have never met. We envision the community being able to do the same thing with scientific and other types of data. This way, we can begin to divide and conquer the complexity inherent in the vast amounts of material being produced — complexity I don’t see us being able to manage any other way.
Your Comprehensive Knowledge Archive Network (CKAN) is a registry for open knowledge packages and projects, and people have added more than 100 in the past year. Can you tell us how the project got started? What have the recent updates achieved? And what are your future plans — where do you hope to go next?
If you’ve got an ambitious goal like this one [of radically changing data sharing and production practices], you’ve got to start with a modest approach — asking, “what is the simplest thing we can do that would be useful?” So we began by identifying some of the key things necessary for a knowledge-sharing infrastructure, to figure out what we could contribute. Sometimes what’s needed is conceptual, like our definitions. Sometimes you need a guide for applying concepts, like our principles for open knowledge development. And you need a way to share resources, which is why we started KnowledgeForge, which hosts all kinds of knowledge development projects.
The impetus behind CKAN was to make it easier for people to find open data, as well as to make their data available to others (especially in a way that can be automated). If you use Google to search for data, you’re much more likely to find a page about data than you are to find the data itself. As a scientist, you don’t want to find just one bit of information — you want the whole set. And you don’t want shiny front ends or permission barriers at any point in the process. We’ve been making updates to CKAN so machines can better interact with the data, which makes it so people who want data don’t have to jump as many hurdles to get it. Ultimately, we want people to be able to request data sets and have the software automatically install any additions and updates on their computers.
What are the biggest challenges to making open science work? If you had to lay out a 3-point agenda for the next five years, what would the action items be?
I think that, like with nearly everything else, the social and cultural challenges may be the biggest hurdle. One aspect of making it work is ensuring that more people understand exactly what they can gain from sharing. I think it’s like a snowball: you might not get much back, initially, from sharing, but over time, you’d be able to see your data sets plugged in with other data sets, and your peers doing the same thing. The results might encourage you to share more.
As for a 3-point agenda:
1.) Open access is very important. In particular, I’d like to see the funders of science mandate not just open access to publications but also, as part of the process, open access to the data. They are paying for the research, so they can provide the incentive to make the results open. Moreover, it should be easier to get open access to the data; you wouldn’t necessarily have the same kind of struggle with publishers.
2.) I think we need more evangelism/advocacy for open science. We’re seeing big shifts in the way we do science, but we’re still on the cusp of a movement to bring open approaches together in a common infrastructure.
3.) We need to make it easier for people to share and manage large data sets. Open science is already working in some respects; arXiv.org is an extraordinary resource, for instance, but we need a better infrastructure for handling the data itself. I also think that many people are put off sharing because they think they don’t know how to manage data. That causes people to hesitate or give up completely. We need to make the process smoother. Sharing your data should be as frictionless as possible.
What do you see as the most important development in open science over the last year?
Without question, the progress we’re making with data licensing. We have the Science Commons Protocol for Implementing Open Access Data, which conforms to the Open Knowledge Definition, and the very first open data licenses that comply with the protocol: the Open Data Commons Public Domain Dedication and License (ODC-PDDL) and the CC0 public domain waiver. We now need to encourage people to start using these waivers — or any other open license that complies.
When I talk to people about what the open science movement is trying to achieve, the most common response I get is, “Well, won’t Google take care of that?” Do you hear that? What’s your response?
I would ask, “Well, what is ‘that’?” You find that many people believe that if you put something online, it’s automatically open, and Google will do the rest. Google is great, but it can’t handle things like community standards or usage rights. And in any case, I’m deeply skeptical of “one ring to rule them all” solutions. What we need is more along the lines of “small pieces, loosely joined.” Of course organizations like Google could help a lot (or hurt!), and they’re certainly an important part of the ecosystem. But at the Open Knowledge Foundation, we like to say that the revolution will be decentralized. No one person, organization or company is going to do everything. Even Google didn’t make the Web standards or create the web pages and hyperlinks that make search engines work. As it stands, Google may be good for finding bits of Humpty, but not for creating or putting him back together.
Have you read Chris Anderson’s piece, The End of Theory: The Data Deluge Makes the Scientific Method Obsolete? If so, what’s your take on it?
I’ll be politic and say that it’s provocative but ultimately unconvincing. There are reasons why we have theory. Imagine a library where you could have any book you want, but there are no rules for searching, so you have to search every book. The knowledge space is just too vast. In economics, just like in science, you need models to isolate the variables you’re interested in. There may be millions of variables, for instance, to explain why you’re a happy person right now. You had a happy childhood, you just listened to a symphony, etc. And the number of possible explanations (or, more formally, “regressions”) grows exponentially with the variables, so you’re creating a situation that’s computationally hard — problems that, using brute force, would take longer than the lifetime of the universe to solve, even with the fastest supercomputers around.
I’d argue that with more data, you need more, not less modeling insight. As the haystack grows, finding the needle by brute force is likely to be a less attractive, not more attractive option. Of course it’s true that more data and more computational power are a massive help in making progress in science or any other area. It’s just that they have to be used intelligently.
On a more personal note, how does being an economist inform your approach/perspective?
Economists study information goods a lot, so I’d say my background has been very influential. Economics 101 tells us that openness is often the most efficient way to do things, especially when there’s the possibility of up-front funding by, for instance, the government. There are clear, massive benefits for society in having a healthy, balanced information commons. Unfortunately, it is often the case that those who benefit from proprietarization have better-paid advocates, better-oiled PR machines, etc.
My hope is that this work that so many of us are doing pro bono, often in our spare time, will slowly increase in impact — and that, at a minimum, we can ensure that all publicly funded scientific research will be open.
Previous posts in this series: