The success of a scientific investigation is determined in part by its ability to locate and make effective use of relevant prior work. Automated literature search is a basic tool used by all scientists, but the computer and the Internet have potential for search and integration far beyond what can be done with keyword-based search. However, a prerequisite for automated exploitation of scientific information is that it be in a consistent format that can be processed meaningfully and accurately by software. We need links among literature, data records, real-world entities, and abstract concepts, with formal definitions of each link’s endpoints and type. Applications need to use common identifiers for endpoints so that mentions of shared entities can be matched. This discipline of links, definitions, and identification is exactly what the framework of the semantic web provides.
In 2007 Science Commons intends to roll out artifacts and demonstrations that show the construction of a semantic web for science – in particular, for neuroscience. Our efforts toward such a “Neurocommons” are in three areas:
- Data integration
- Text mining
- Analytic tools
We are integrating information from a variety of standard sources to establish core RDF content that can be used as a basis for bioinformatics applications. The combined whole is greater than the sum of its parts, since queries can cut across combinations of sources in arbitrary ways.
We’re building this prototype as a demonstration – to show others that an open, integrated knowledge base is within reach. Anyone can build infrastructure out of reuseable parts and use it computationally in novel ways.
The scientific literature consists mostly of text. Entities discussed in the text, such as proteins and diseases, need to be specifically identified for computational use, as do the entities’ relationships to the text and the text’s assertions about the entities (for example, a particular asserted relationship between a protein and a disease). Manual annotation by an author, editor, or other “curator” may capture the text’s meaning accurately in a formal notation. However, automated natural language processing (including entity extraction and text mining) is likely to be the only practical method for opening up the literature for computational use.
As a prototype we have generated RDF annotations for a subset of PubMed abstracts. We will follow on with a broader set of entities and relations, and a broader set of articles drawn from the open access literature.
The application of prior knowledge to experimental data can lead to fresh insights. For example, a set of genes or proteins derived from high throughput experiments can be statistically scored against sets of related entities derived from the literature. Particular sets that score well may indicate what’s going on in the experimental setting.
In order to help illustrate the value of semantic web practices, we are developing statistical applications that exploit information extracted from RDF data sources, including both conversions of structured information (such as Gene Ontology annotations (GOA)) and relationships extracted from literature. The first tools we hope to roll out are activity center analysis for gene array data, and set scoring for profiling of arbitrary gene sets.