Nature on Big Data

September 9th, 2008

There’s plenty to recommend in Nature‘s special issue on Big Data (Sept. 3), but readers of this blog might especially appreciate The future of biocuration.

Here’s a glimpse:

Biology, like most scientific disciplines, is in an era of accelerated information accrual and scientists increasingly depend on the availability of each others’ data. Large-scale sequencing centres, high-throughput analytical facilities and individual laboratories produce vast amounts of data such as nucleotide and protein sequences, protein crystal structures, gene-expression measurements, protein and genetic interactions and phenotype studies. By July 2008, more than 18 million articles had been indexed in PubMed and nucleotide sequences from more than 260,000 organisms had been submitted to GenBank1, 2. The recently announced project to sequence 1,000 human genomes in three years to reveal DNA polymorphisms (http://www.1000genomes.org) is a tip of the data iceberg.

Such data, produced at great effort and expense, are only as useful as researchers’ ability to locate, integrate and access them.


If you’re interested in exploring further, here are a few pointers to other relevant pieces (but be aware that the articles are available free for only two weeks from the publication date):

  • Community cleverness required (editorial) — “Researchers need to adapt their institutions and practices in response to torrents of new data — and need to complement smart science with smart searching.”
  • The next Google (special report) — “Ten years ago this month, Google’s first employee turned up at the garage where the search engine was originally housed. What technology at a similar early stage today will have changed our world as much by 2018?”
  • Welcome to the petacentre (feature) — “What does it take to store bytes by the tens of thousands of trillions? Cory Doctorow meets the people and machines for which it’s all in a day’s work.”
  • Wikiomics (feature) — “Pioneering biologists are trying to use wiki-type web pages to manage and interpret data, reports Mitch Waldrop. But will the wider research community go along with the experiment?”
  • How does your data grow? (commentary) — “Scientists need to ensure that their results will be managed for the long haul. Maintaining data takes big organization, says Clifford Lynch.”

Toward a global platform for open science

September 8th, 2008

Watching the new video about the NeuroCommons project, I was struck by how many different elements are necessary for making knowledge from one domain in science interoperable — or “remixable” — with knowledge from another. 

In a recent piece on open science, Era of Scientific Secrecy Near End (LiveScience, Sept. 2), Cameron Neylon articulates the vision of a global platform to do just that, built using design principles from open source software:

“Making things more open leads to more innovation and more economic activity, and so the technology that underlies the Web makes it possible to share in a way that was never really possible before, while at same time it also means that [the] kinds of models and results generated are much more rich,” he said.

This is the open source approach to software development, as opposed to commercial closed source approaches, Neylon said. The internals are protected by developers and lawyers, but the platform is available for the public to build on in very creative ways.

“Science was always about mashing up, taking one result and applying it to your [work] in a different way,” Neylon said. “The question is ‘Can we make that as effective [for] samples [of] data and analysis as it [is] for a map and set of addresses for a coffee shop?’ That is the vision.”

That’s a vision Science Commons shares. The past ten years have brought the rise of a robust infrastructure for sharing and remixing cultural content, and thanks to the emergence of innovative tools like Google Maps, more people are grasping the power of open systems for connecting information from disparate sources to make it more useful. Yet we remain in the early stages of building an open infrastructure for science that would make it easy to integrate and make sense of research and data from different sources. 

The NeuroCommons is our effort to jumpstart the process, with the goal of making all scientific research materials — research articles, annotations, data, physical materials — as available and as useable as they can be. If you’re new to the project, we hope you’ll take a look at the video and let us know what you think.

Interested in open science?

September 5th, 2008

If so, there’s a new discussion list you may want to join, brought to you by the good folks at the Open Knowledge Foundation.

Writes OKF’s Jonathan Gray:

As far as we could tell, there wasn’t a general mailing list for people interested in open science. Hence the new list aims to cover this gap, and to strengthen and consolidate the open science community.

We hope it will be a relatively low volume list for relevant announcements, questions and notes. We also hope to get as full as possible representation from the open science community — so please forward this to anyone you think might be interested to join!

How academic health research centers can foster data sharing

September 2nd, 2008

PLoS Medicine today published a new paper that provides useful guidelines for people at academic health centers seeking to support scientific data sharing. The paper, Towards a Data Sharing Culture: Recommendations for Leadership from Academic Health Centers, discusses both the enormous benefits and the obstacles to forging a research culture that fosters data sharing, and outlines practical steps people can take to set the process in motion.

Here’s an excerpt summarizing the paper’s recommendations:

Recommendations for Academic Health Centers to Encourage Data Sharing

  1. Commit to sharing research data as openly as possible, given privacy constraints. Streamline IRB, technology transfer, and information technology policies and procedures accordingly.
  2. Recognize data sharing contributions in hiring and promotion decisions, perhaps as a bonus to a publication’s impact factor. Use concrete metrics when available.
  3. Educate trainees and current investigators on responsible data sharing and reuse practices through class work, mentorship, and professional development. Promote a framework for deciding upon appropriate data sharing mechanisms.
  4. Encourage data sharing practices as part of publication policies. Lobby for explicit and enforceable policies in journal and conference instructions, to both authors and peer reviewers.
  5. Encourage data sharing plans as part of funding policies. Lobby for appropriate data sharing requirements by funders, and recommend that they assess a proposal’s data sharing plan as part of its scientific contribution.
  6. Fund the costs of data sharing, support for repositories, adoption of sharing infrastructure and metrics, and research into best practices through federal grants and AHC funds.
  7. Publish experiences in data sharing to facilitate the exchange of best practices.

The paper, co-authored by Heather Piwowar, Michael Becich, Howard Bilofsky and Rebecca Crowley, was written on behalf of the caBIG Data Sharing and Intellectual Capital Workspace (you can read our previous posts about caBIG by following this link).

Progress on the CC0 public domain waiver

September 2nd, 2008

Over at the main Creative Commons blog, Diane Peters has the scoop on draft 3 of the CC0 public domain waiver, a tool for those who wish to relinquish their rights under copyright to a work, and mark it with machine-readable metadata for harvesting as part of the public domain. It is this type of tool that Science Commons advocates using in our Protocol for Implementing Open Access Data, a method for legally integrating scientific databases regardless of the country of origin. The goal of the protocol, to use Catriona MacCallum’s phrase:  increasing the “Lego factor” for scientific data.

The news, in brief:  Creative Commons had added additional language to the CC0 waiver to ensure that it makes sense and can be useful for people across the globe. Explains Diane:

We remain dedicated to pursuing a Universal CC0, but with some substantial revision to the text. Here are a few of the changes you will see in draft 3 as a result of [the community’s] comments and discussions:

  • Inclusion of a Statement of Purpose that provides context and explanation for issues CC0 attempts to solve while also identifying limitations inherent in such an attempt;
  • Clarifying language about the IP rights affected by CC0 through a new comprehensive definition of “Copyright Related Rights”; and
  • Emphasis on the possible existence of privacy and publicity rights of others with respect to a work, and the need for those to be cleared where appropriate.

Creative Commons plans to take CC0 out of beta in late October or early November, and comments on this draft are due on September 26. If you’d like to check out the waiver or weigh in, visit the newly updated CC0 Wiki and subscribe to the cc-licenses mailing list.

FasterCures follows up on ‘Ten to Watch in 2008’

August 26th, 2008

FasterCures President Greg Simon has published a follow-up to the organization’s announcement in January of the FasterCures Ten to Watch in 2008 list, on which we were honored to appear. The post provides updates on the developments, trends and organizations FasterCures identified in the original list, including “Science 2.0,” highlighting innovators in the “use of online platforms for scientific collaboration.” Among them: CollabRx, a company that creates “virtual biotechs” to put drug development in patients’ hands. CollabRx is one of our partners in the Health Commons, a project to make it easier for anyone to pull together the resources for accelerating drug discovery: research, data, materials and services.

You can read the post, Ten to Watch Mid-Year Review, over at the FasterCures blog.

Access to knowledge in science

August 25th, 2008

The August/September 2008 issue of Intellectual Property Watch Monthly Reporter features the second part of a 2-part series on the Access to Knowledge (A2K) movement, which has among its goals “sharing the benefits of scientific advancement” (see the Treaty on Access to Knowledge [PDF]).

The article, Access To Knowledge “Movement” Seeks Strength In Its Diversity (available only to subscribers), cites the Introduction to Science Commons [PDF], co-authored by James Boyle and John Wilbanks, to show how the effort to change the way knowledge is governed has been reaching beyond IP policy. Scientists confront multiple barriers to accessing existing knowledge throughout the research cycle, including problems with securing access to the physical materials needed to verify results. As the IP Watch piece points out, Science Commons seeks to lower these barriers, and to provide solutions that knit together, so that when a researcher finds one piece of the puzzle — for instance, an article containing a piece of relevant data — she can also find the resources she needs to put the knowledge to use.

[Boyle and Wilbanks say that in] scientific research access to knowledge problems start at the earliest stage. Getting access to journals and physical materials needed for research can be difficult and time consuming – particularly problematic for those working on a limited-term grant, they said. This means research institutions “effectively ‘discard’ minds we might need to solve problems because they do not have full access to the research they need.”

Access to knowledge then, is about ease of networking and data transfer as much as it is about IP rights. Wilbanks and Boyle said there needs to be a connection between efforts to “streamline the legal process for clearing materials and efforts to streamline the practical process of actually fabricating and transferring the materials themselves.”

You can read more about our efforts to streamline the materials transfer process here. For a look at how we’re working to bring together all of the resources for accelerating research, check out the NeuroCommons and Health Commons projects.

What’s open science?

August 22nd, 2008

That’s the question many of us have been grappling with in the wake of two unforgettable unconferences: BioBarCamp and SciFoo.

Over at Science in the open, Cameron Neylon writes:

During the introduction [at BioBarCamp] many people expressed an interest in “Open Science”, “Open Data”, or some other open stuff, yet it was already pretty clear that many people meant many different things by this. I think for me the most striking outcome of [a session to define it] was that not only is this a radically new concept for many people but that many people don’t have any background understanding of open source software either which can make the discussion totally impenetrable to them. This, in my view strengthens the need for having some clear brands, or standards, that are easy to point to and easy to sign up to (or not).

This is one of the reasons why Science Commons has published a set of principles for open science, which we prepared for our satellite workshop at ESOF 2008 (you can download the PDF or read them online here). We hope not only to help bring more clarity to the discussion, but also to pave the way for integrating all kinds of open science projects in a shared collaborative infrastructure.

For a taste of the conversation happening elsewhere, here are snippets from relevant posts published in the last few weeks:

Cat Allman @ the Google Open Source Blog: “Certain themes recurred [at SciFoo 2008]. One was the need to do a better job of open sourcing data within the science community, including negative results; such sharing would enable collaboration and prevent scientists from ‘reinventing the wheel.'”

Shirley Wu @ One Big Lab: “At BioBarCamp this past weekend (many thanks to John Cumbers and Attila Csordas for organizing!), the future of science became a recurring theme, with an impromptu discussion on open science the first day and spirited sessions on open science, web 2.0, the data commons, change in science, science ‘worship’, and redefining ‘impact‘ and ‘failure‘ the second. Each of these topics could be their own blog series, and, in fact, many of them are. Even if people didn’t always agree on the details, it was clear that everyone there (a biased group, inarguably) agreed that change is necessary, and inevitable. The question is, what will that change look like, and how will we get there?”

Cameron Neylon @ Science in the open: “Helen [Berman] made the point strongly that it had taken 37 years to get the [Protein Data Bank] to where it is today; a gold standard international and publically available repository of a specific form of research data supported by a strong set of community accepted, and enforced, rules and conventions. We don’t want to take another 37 years to achieve the widespread adoption of high standards in data availability and open practice in research more generally.”

Chris Patil @ Ouroboros: “Given a suitable set of one-to-one and one-to-many agreements between the stakeholders [in scientific research], then, the benefits of sharing could come to outweigh any conceivable advantage derived from secrecy. Perhaps ‘open science’ could be defined (for the moment) as the quest to design and optimize these agreements, along with the quest to design the best tools and licenses to empower scientists as they move from the status quo into the next system — because (and this is very important) if it is to ever succeed, open science has to work not because of governmental fiat or because a large number of people suddenly start marching in lockstep to an unnatural tune, but because it works better than competing models.”

Our own Kaitlin Thaney, who organized the ESOF satellite workshop, led an impromptu session on open science at BioBarCamp, and we’re eager to continue the conversation. If you’re interested in helping to keep the ball rolling, let us know.

Boston Globe on the open science “insurgency”

August 21st, 2008

The Boston Globe today profiles three local organizations that leverage the Web to accelerate scientific research and discovery:  OpenWetWare (OWW), the Journal of Visualized Experiments (JoVE) and Science Commons. The article, Out in the open: some scientists sharing results, tells the story of Barry Canton, a young researcher here at MIT who has joined the “peaceful insurgency” in scientific research by publishing preprint research and raw data, a practice often called open notebook science.

The article also touches briefly on the broader spectrum of endeavors in open science, including the work we do to help realize the vision of fully automated, permission-free access to the available research, data and materials.

If you’ve read the Globe piece and want to learn more about what Science Commons does, here’s a quick tour of our current projects. We focus on:

  • Scholar’s Copyright and Open Access Data — making scientific research “re-useful” by providing free tools for opening and marking research and data for reuse
  • Biological Materials Transfer Agreement Project — facilitating “one-click” access to research materials by streamlining and automating the materials-transfer process, so scientists can more easily replicate, verify and extend research
  • The NeuroCommons — integrating fragmented information sources in the field of neuroscience, in our “proof of concept” project to help researchers to find, analyze and use research, data and materials from disparate sources
  • The Health Commons — building the legal framework for a permission-free marketplace of drug discovery data, materials and services, to make it easier for anyone to pull together the resources for accelerating disease research

If you have any questions, please let us know — we’d like to hear from you. And if you work in scientific research and want to collaborate with us to lift legal and technical barriers to research, you can click here to learn about your options.

Voices from the future of science: Rufus Pollock of the Open Knowledge Foundation

August 18th, 2008

If there’s a single quote that best captures the ethos of open science, it might be the following bon mot from Rufus Pollock, digital rights activist, economist at the University of Cambridge and a founder of the Open Knowledge Foundation: “The best thing to do with your data will be thought of by someone else.”

It’s also a pithy way to convey both the challenge and opportunity for publishers of scientific research and data. How can we best capitalize on the lessons from the rise of the Web and open source software to accelerate scientific research? What’s the optimal way to package data so it can be used in ways no one anticipates?

I talked to Pollock, who’s been a driving force behind efforts to improve sharing and reuse of data, about where we stand in developing a common legal, technical and policy infrastructure to make open science happen, and what he thinks the next steps should be.

What strategies and concepts can we use from open source to foster open science? Can you give us a big picture description of the role you see the Open Knowledge Foundation playing?

I’d say that in terms of applying lessons from open source, the biggest thing to look at is data. Code and data have so many similarities — indeed, in many ways, the distinction between code and data are beginning to blur. The most important similarity is that both lend themselves naturally to being broken down into smaller chunks, which can then be reused and recombined.

This breaking down into smaller, reusable chunks is something we at the Open Knowledge Foundation refer to as “componentization.” You can break down projects, whether they are data sets or software programs, into pieces of a manageable size — after all, the human brain can only handle so much data — and do it in a way that makes it easier to put the pieces back together again. You might call this the Humpty Dumpty principle. And splitting things up means people can work independently on different pieces of a project, while others can work on putting the pieces back together — that’s where “many minds” come in.

What’s also crucial here is openness: without openness, you have a real problem putting things together. Everyone ends up owning a different piece of Humpty, and it’s a nightmare getting permission to put him back together (to use jargon from economics, you have an anti-commons problem). Similarly, if a data set starts off closed, it’s harder for different people to come along and begin working on bits of it. It’s not impossible to do componentization under a proprietary regime, but it is a lot harder.

With the ability to recombine information as the goal, it’s critical to be explicit about openness — both about what it is, and about what you intend when you make your work available. In the world of software, the key to making open source work is licensing, and I believe the same is true for science. If you want to enable reuse — whether by humans, or more importantly, by machines operated by humans — you’ve got to make it explicit what can be used, and how. That’s why, when we started the Open Knowledge Foundation back in 2004, one of the first things we focused on was defining what “open” meant. That kind of work, along with the associated licensing efforts, can seem rather boring, but it’s absolutely crucial for putting Humpty back together. Implicit openness is not enough.

So, in terms of open science, one of the main things the Open Knowledge Foundation has been doing is conceptual work — for example, providing an explicit definition of openness for data and knowledge in the form of the open knowledge/data definition, and then explaining to people why it’s important to license their data so it conforms to the definition.

So, to return to the main question, I think one of the strategies we should be taking from open source is its approach to the Humpty Dumpty problem. We should be creating and sharing “packages” of data, using the same principles you see at work in Linux distributions — building a Debian of data, if you like. Debian has currently got something like 18,000 software packages, and these are maintained by hundreds, if not thousands, of people — many of whom have never met. We envision the community being able to do the same thing with scientific and other types of data. This way, we can begin to divide and conquer the complexity inherent in the vast amounts of material being produced — complexity I don’t see us being able to manage any other way.

Your Comprehensive Knowledge Archive Network (CKAN) is a registry for open knowledge packages and projects, and people have added more than 100 in the past year. Can you tell us how the project got started? What have the recent updates achieved? And what are your future plans — where do you hope to go next?

If you’ve got an ambitious goal like this one [of radically changing data sharing and production practices], you’ve got to start with a modest approach — asking, “what is the simplest thing we can do that would be useful?” So we began by identifying some of the key things necessary for a knowledge-sharing infrastructure, to figure out what we could contribute. Sometimes what’s needed is conceptual, like our definitions. Sometimes you need a guide for applying concepts, like our principles for open knowledge development. And you need a way to share resources, which is why we started KnowledgeForge, which hosts all kinds of knowledge development projects.

The impetus behind CKAN was to make it easier for people to find open data, as well as to make their data available to others (especially in a way that can be automated). If you use Google to search for data, you’re much more likely to find a page about data than you are to find the data itself. As a scientist, you don’t want to find just one bit of information — you want the whole set. And you don’t want shiny front ends or permission barriers at any point in the process. We’ve been making updates to CKAN so machines can better interact with the data, which makes it so people who want data don’t have to jump as many hurdles to get it. Ultimately, we want people to be able to request data sets and have the software automatically install any additions and updates on their computers.

What are the biggest challenges to making open science work? If you had to lay out a 3-point agenda for the next five years, what would the action items be?

I think that, like with nearly everything else, the social and cultural challenges may be the biggest hurdle. One aspect of making it work is ensuring that more people understand exactly what they can gain from sharing. I think it’s like a snowball:  you might not get much back, initially, from sharing, but over time, you’d be able to see your data sets plugged in with other data sets, and your peers doing the same thing. The results might encourage you to share more.

As for a 3-point agenda:

1.) Open access is very important. In particular, I’d like to see the funders of science mandate not just open access to publications but also, as part of the process, open access to the data. They are paying for the research, so they can provide the incentive to make the results open. Moreover, it should be easier to get open access to the data; you wouldn’t necessarily have the same kind of struggle with publishers.

2.) I think we need more evangelism/advocacy for open science. We’re seeing big shifts in the way we do science, but we’re still on the cusp of a movement to bring open approaches together in a common infrastructure.

3.) We need to make it easier for people to share and manage large data sets. Open science is already working in some respects; arXiv.org is an extraordinary resource, for instance, but we need a better infrastructure for handling the data itself. I also think that many people are put off sharing because they think they don’t know how to manage data. That causes people to hesitate or give up completely. We need to make the process smoother. Sharing your data should be as frictionless as possible.

What do you see as the most important development in open science over the last year?

Without question, the progress we’re making with data licensing. We have the Science Commons Protocol for Implementing Open Access Data, which conforms to the Open Knowledge Definition, and the very first open data licenses that comply with the protocol: the Open Data Commons Public Domain Dedication and License (ODC-PDDL) and the CC0 public domain waiver. We now need to encourage people to start using these waivers — or any other open license that complies.

When I talk to people about what the open science movement is trying to achieve, the most common response I get is, “Well, won’t Google take care of that?” Do you hear that? What’s your response?

I would ask, “Well, what is ‘that’?” You find that many people believe that if you put something online, it’s automatically open, and Google will do the rest. Google is great, but it can’t handle things like community standards or usage rights. And in any case, I’m deeply skeptical of “one ring to rule them all” solutions. What we need is more along the lines of “small pieces, loosely joined.” Of course organizations like Google could help a lot (or hurt!), and they’re certainly an important part of the ecosystem. But at the Open Knowledge Foundation, we like to say that the revolution will be decentralized. No one person, organization or company is going to do everything. Even Google didn’t make the Web standards or create the web pages and hyperlinks that make search engines work. As it stands, Google may be good for finding bits of Humpty, but not for creating or putting him back together.

Have you read Chris Anderson’s piece, The End of Theory: The Data Deluge Makes the Scientific Method Obsolete? If so, what’s your take on it?

I’ll be politic and say that it’s provocative but ultimately unconvincing. There are reasons why we have theory. Imagine a library where you could have any book you want, but there are no rules for searching, so you have to search every book. The knowledge space is just too vast. In economics, just like in science, you need models to isolate the variables you’re interested in. There may be millions of variables, for instance, to explain why you’re a happy person right now. You had a happy childhood, you just listened to a symphony, etc. And the number of possible explanations (or, more formally, “regressions”) grows exponentially with the variables, so you’re creating a situation that’s computationally hard — problems that, using brute force, would take longer than the lifetime of the universe to solve, even with the fastest supercomputers around.

I’d argue that with more data, you need more, not less modeling insight. As the haystack grows, finding the needle by brute force is likely to be a less attractive, not more attractive option. Of course it’s true that more data and more computational power are a massive help in making progress in science or any other area. It’s just that they have to be used intelligently.

On a more personal note, how does being an economist inform your approach/perspective?

Economists study information goods a lot, so I’d say my background has been very influential. Economics 101 tells us that openness is often the most efficient way to do things, especially when there’s the possibility of up-front funding by, for instance, the government. There are clear, massive benefits for society in having a healthy, balanced information commons. Unfortunately, it is often the case that those who benefit from proprietarization have better-paid advocates, better-oiled PR machines, etc.

My hope is that this work that so many of us are doing pro bono, often in our spare time, will slowly increase in impact — and that, at a minimum, we can ensure that all publicly funded scientific research will be open.

Previous posts in this series: