Background Briefing

let data speak to data

"Upload and share your raw data, and have a high impact factor for your blog - or perish? That day has not yet come, but Web technologies, from the personal publishing tools such as blogs to electronic laboratory notebooks, are pushing the character of the Web from that of a large library towards providing a user-driven collaborative workspace."

Nature vol 438 December 2005

The NeuroCommons is a proving ground for the ideas behind Science Commons’ Data Project. It is built on the legal opportunities created by Open Access to the scientific literature and the technical capabilities of the Semantic Web.

Executive Summary

The Neurocommons project, a collaboration between Science Commons and the Teranode Corporation, is building on Open Access scientific knowledge to build a Semantic Web for neuroscience research. The project has three distinct goals.

  • To demonstrate that scientific impact is directly related to the freedom to legally reuse and technically transform scientific information – that Open Access is an essential foundation for innovation.
  • To establish a framework that increases the impact of investment in neuroscience research in a public and clearly measurable manner.
  • To develop an open community of neuroscientists, funders of neuroscience research, technologists, physicians, and patients to extend the Neurocommons work in an open, collaborative, distributed manner.

Background

Today’s life scientist faces a dizzying array of knowledge sources. Peer reviewed journal articles, online repositories of sequences and pathways, robot-driven data collection, and more must be integrated into experimental design and analysis. Many scientists spend as much time on Google and PubMed as they do at the bench; the difference between success and failure in the lab or clinic can be the judicious and timely utilization of information. But this is all local knowledge utilization. The logarithmic explosion of information in science overwhelms any one individual’s ability to store and model all the relevant science in her head.

The result is a “scalability problem” in life sciences: while methods for generating information have gone digital, methods for using that information remain stolidly analog. Technology can help. Bandwidth, processing and storage are cheap. Machines can transmute from a string such as “aaattcaggagattacaggta” to a physical molecule of DNA – and back again, making genetic information truly fungible, something that can be shared via the Web. Advances in language processing and ontology development allow for the construction of machine-readable and interpretable representations of scientific information. Logic and reasoning engines can crawl across massive data sets and come back with suggestions on causation.

Unfortunately it is neither cheap nor easy to seize the moment and apply the technological advances to the human problem. Legal and economic factors have to date muted the impact of new technologies on the life sciences: copyrights and contracts intertwine with software-enforced restrictions on reusing and republishing knowledge in a more usable format. But there is enough information on the Web, in the form of taxpayer-funded databases and openly licensed scientific literature, to demonstrate the utility of a legally open, technically standardized approach to knowledge. And, in so doing, to sow the seeds of a massive change in how scientific knowledge is licensed and reused.

The technological opportunity: Semantic Web

Semantic Web (SW) is at root a set of common standards to describe and name the relationships we contemplate and describe in text: this gene is active in this disease, is related to this protein. Using the standard allows us to republish this kind of knowledge into a format that software can exploit: search engines, browsers, statistical analysis.

The life sciences represent an ideal test case for the Semantic Web. SW technologies make the most sense where there is a certain set of conditions.

  • A massive amount of data: clinical images, robot-arrayed “gene chips”, machines that can sort materials cell-by-cell, gene sequencers and massively high throughput chemical screens. There are hundreds of public databases, from flies to humans to plants, each potentially able to inform a decision or experimental design.
  • Rapidly changing knowledge: every journal article, every paper, every experiment in the lab creates new knowledge about our bodies and the world we live in. This makes it very hard to apply traditional computational approaches or even integrate the data. We know what goes into a car – engine, tires, wheels, axles, fenders – and thus we can create a fairly fixed representation of a car for a computer, for model building and more. But we don’t have anything resembling consensus to items as fundamental as “what is the role of the non-coding DNA in the human genome?”
  • Distributed knowledge and expertise: the nature of modern life science is specialization. One scientist is an expert on the genetics of Huntington’s Disease (a rare neurodegenerative disease) another an expert on the impact of protein folding on Alzheimer’s Disease. The two both work on the brain, on many of the same genes and proteins. But they attend different conferences and are pressed for time to study the refereed literature outside their own disease. Possible synchronicities between the researchers are at a minimum because their knowledge can’t interoperate without distracting them from the lab.

For this set of problems, SW is a natural fit. Like the Web, SW is intended to scale through decentralization and an emphasis on information reuse. It is a means to capture and network the relationships implicit in high volume data sets, or the outputs of sophisticated analytic software. It can relate anything to anything, as long as that anything has a unique name – so those data-driven relationships can attach to the descriptions of their related genes and proteins, and the knowledge about those genes and proteins as described in the scientific literature. SW does not require that the picture be complete, either – if the relationships between one gene and another change as our knowledge changes, the technical burden is no lower than adding another hyperlink between web pages. And the concept of integration around unique names makes it easy to create serendipity between researchers: instead of bumping into a colleague in the hall at the right time, a scientist can see the ecosystem of knowledge around a gene expressing in the brain, whether that knowledge comes from her work on Alzheimer’s or a distant colleague’s work on Huntington’s. It all gets published to the Semantic Web.

The legal and economic problem

So why then do we not already have a vast Semantic Web for life sciences? Despite what appears to be an information overload, it is the sparseness of machine-readable scientific knowledge to assist in interpretation restraining the utility of the SW approach. There are technology limitations to extracting this information, lack of consensus around what names to use, and the inherent limitations of natural language processing prominent among them.

These problems are being solved. The National Institutes of Health has invested in the national centers for biomedical ontologies, language processing technologies evolve in leaps and bounds, and public databases are investing in machine readability and open licensing.

But there is a distinct lack of public – free – machine readable knowledge. Scientific authors transfer their copyrights to journals, motivated by the prospect of citation in a well known journal, the next grant, or tenure review. Copyright extends for 75 years after the death of the author and can serve as the foundation for contractual restrictions on access to knowledge. Knowledge publishers’ business models are moving from the sale of physical copies of journals to renting access to digital copies. Digital rights management systems enforcing those contracts prevent librarians from running advanced software across the online versions of journal articles, preventing advanced analysis of both the underlying knowledge and usage patterns of scientists. The laws and technologies guarding such knowledge intertwine to create a set of barriers to open access and reuse.

A path forward: The Neurocommons Project

The Neurocommons Project is a joint effort by Science Commons and the Teranode Corporation, aiming to demonstrate the power of Semantic Web approach based on Open Access information. We have built an initial community of neuroinformaticists, practicing neuroscientists, Semantic Web experts and language experts to ensure our work is accurate and scientifically valid. The first stage is underway:

  • Using automated technologies, extract machine-readable representations of neuroscience-related knowledge as contained in full-text Open Access literature, free text such as the PubMed abstracts, and legally open databases
  • assemble those representations into a Semantic Web for neuroscience
  • publish the graph freely
  • assemble a standard software implementation to store, update, and manage the changes to the graph as knowledge evolves

Plans for stage two of the project involve the deployment of additional software infrastructure, the development of operational manuals so that interested parties can “port” the entire Neurocommons approach into new scientific domains without involving Science Commons, new publishing techniques to automatically add knowledge to the Neurocommons graph, and active community development.

About the project

Teranode provides direct financial support to the Neurocommons project as well as in-kind donations of software and services. Jonathan Rees, formerly in charge of the curated protein-protein interactions database at Millenium Pharmaceuticals and a veteran of MIT’s project MAC, leads the project on a day to day basis as a Science Commons Fellow. The project is deeply involved with the World Wide Web Consortium’s Health Care and Life Sciences Interest Group as well as MIT’s Computer Science and Artificial Intelligence Laboratory (which hosts Science Commons).