OBD:SPARQL-GO

From NCBO Wiki

Jump to: navigation, search

Contents

Introduction: A SPARQL EndPoint for GO annotations

This describes a preliminary pilot/experimental project for providing a SPARQL endpoint to the 6 million or so annotations in the Gene Ontology database

We use D2RQ to map the GO DB schema to RDF and d2r-server as a SPARQL endpoint.

Quick Guide

Before you start: please use carefully; this is running against a non-production database so there are no serious consequences if a killer query grinds it to a halt; however, this will prevent others from seeing the demo. Please contact me if the demo is down.

The SPARQL endpoint is here:

A snorql web interface is here:

(firefox only)

(note: these may be down intermittently )

A query for generating GO-annotation tabular format annotations:

(for illustrative purposes only - please execute queries with fewer joins)

 SELECT 
   ?gp ?gpname ?role ?pub ?ev ?with ?gpname ?gpsyn ?type ?sp
  WHERE {
   ?gp rdfs:label ?gpname .
   FILTER regex(?gpname, "abd-A") .
   ?gp oban:has_role ?role .
   ?stmt rdf:type rdf:Statement .
   ?stmt rdf:subject ?gp .
   ?stmt rdf:object ?role .
   ?stmt oban:has_evidence ?ev .
   ?ev rdf:type ?evcode .
   ?stmt oban:has_source ?pub .
   ?gp rdf:type ?type .
   OPTIONAL { ?gp skos:altLabel ?gpsyn } .
   ?gp genomics:part_of_organism ?sp 
 }

Or this (choose XML for results):

  CONSTRUCT {
   ?gp rdfs:label ?gpname .
   ?gp rdf:type ?type .
   ?gp skos:altLabel ?gpsyn .
   ?gp genomics:part_of_organism ?sp 
  }
  WHERE {
   ?gp rdfs:label ?gpname .
   FILTER regex(?gpname, "abd-A") .
   ?gp rdf:type ?type .
   OPTIONAL { ?gp skos:altLabel ?gpsyn } .
   ?gp genomics:part_of_organism ?sp 
 }

The r2dq mapping is available here:

Background

This assumes some familiarity with the GO and its 3 ontologies (MF - molecular function, BP - biological process and CC - cellular component):

http://www.geneontology.org

And also with GO annotations to genes and gene products:

http://www.geneontology.org/GO.annotation.shtml
http://www.godatabase.org

We also assume familiarity with RDF, N3 notation and SPARQL.

We start off with a brief introduction to some in-depth analysis of the nature of the propositions embodied in GO annotation records; this will form our guide for encoding the annotations as RDF.

GO Annotation: a formal ontological perspective

Gene Ontology terms represent types (aka kinds, universals) - they are entities that are repeatable, or instantiated multiple times in nature. "cell nucleus" is the name of a type, there are numerous instances of "cell nucleus" in any portion of my body.

(we must be careful not to confuse the types and instances in reality with their representation in computers)

Gene Product records also represent types - for example, the term "human p53 protein" refers to a type or pattern that is instantiated multiple times in nature (eg trillions of instantiations in any one human instance).

Quick guide: types are those things which are general and repeated, instances are those things which are particular.

GO Functional annotation is a process which results in propositions concerning the relationships between gene product types and GO types. The type of relationship is implicit in a GO annotation record, but it can be assumed to be one of:

- participation: between a gene product and a biological process

- execution: between a gene product and a molecular function

- location: between a gene product and a cellular component

Corresponding definitions of these relations can be found in the OBO Relation ontology. See http://obofoundry.org/ro/

Note that OBO Relations are assumed to hold in an ALL-SOME manner when applied to types (see the above paper). This means if we say "p53 protein participates_in DNA_repair" we are saying ALL instances of the type p53 protein participate_in SOME DNA_repair process. Obviously there are some instances of the p53 protein which do no such thing. For a deeper analysis of this issue, see [Ref: Hill/Smith; forthcoming]

In addition, many GO annotation records contain gene IDs rather than gene product IDs. There is an implicit extra level of indirection here: we do not mean that gene X is localised to the ER; we mean the gene product encoded by X is localised to the ER.

RDF Representation

We ignore the legacy of the current GO RDF exports (which were crystalised before even the introduction of DAML+OIL). We provide a new mapping that is better geared towards current semantic web technology.

Namespaces

The namespaces are not final - they are just for illustration. rdf/owl properties used come from:

 PREFIX oban: <http://localhost:2020/resource/vocab/>

and are as yet undefined. In future all relations will be defined in RO

the GO namespace is set to be http://www.purl.org/obo/owl

Representing the links between genes and GO types

In choosing how to represent the propositions in GO annotation files using RDF there is a tension between the 'ontologically correct' representation, and the representation that is most expedient in terms of current semantic web technology.

For example, whilst it is undeniable that genes and gene products are types, the natural representation of a type using owl:Class or rdfs:Class is problematic. In OWL layered on RDF, a class-level relationships between two classes uses at least 3 triples:

 human:p53 a owl:Class ;
           owl:subClassOf so:protein ;
           owl:someValuesFrom [
                              owl:onProperty ro:participates_in ;
                              owl:someValuesFrom go:DNA_repair
                              ] .

Constrast this with a simpler instance-level representation:

 human:p53 a so:protein ;
           ro:participates_in go:DNA_repair .

A simpler encoding will be more efficient and less-error prone.

In addition, using a class-level representation will require some kind of owl entailment in order to get appropriate query results.

There is a partial assumption with the semantic web that all "data" will be instances, and RDF/RDFS will suffice for most queries, and only "ontologies" will require OWL. However, scientific annotation (whilst in the realm of data and based on scientific experiments on instances) produces propositions concerning types/universals. The GO database has millions of such type-level assertions.

The most expedient solution with RDF would be to abandon ontological accuracy and simply treat genes and gene products as instances/individuals. This is of course the natural solution to those versed in object-oriented modeling or traditional database record-oriented modeling - see also BioPAX and this paper:

 Experience Using OWL DL for the Exchange of Biological Pathway
 Information
 http://www.mindswap.org/2005/OWLWorkshop/sub37.pdf

And

Alan Ruttenberg, Jonathan Rees and Jeremy Zucker. What BioPAX communicates and how to extend OWL to help it (OWLED 2006)


However, if we abandon ontological priniciples it is difficult to see how interoperation will be possible on the semantic web; without adhering to definitions along principles such as those laid down by the OBO Foundry it is likely the semantic web will end up a giant mish-mash of incompatible representations.

Given the limitations of current tools, we take the expedient approach here as a first pass, in order to demonstrate the mapping. Community consensus will have to be reached on the issues above before this service is declared final/stable.

One compromise may be the use of so-called "representative instances". These are instances with class-like properties. Details still need to be worked out - for example, should "human p53 protein" be treated as the name of a representative instance, or should annotation records refer to an anonymous representative instance of a class named "human p53 protein"?

Genes and Gene Products

Each gene or gene product is represented as an individual of the appropriate Sequence Ontology class.

E.g.

 human:p53 a so:protein ;

To find all 'instances' of SO:protein, you can execute this query:

 SELECT * WHERE {
   ?gp rdf:type so:protein
   ?gp rdfs:label ?name .
   FILTER regex(?name, "abc")
 }
 SELECT * WHERE {
   ?gp rdf:type ?type .
   ?gp rdfs:label ?name .
   FILTER regex(?name, "abc")
 }

The gene type is a URI of the form:

 http://www.purl.org/obo/owl/SO#gene

The intent is to use classes from the sequence ontology. For simplicity these are represented as SO#gene and SO#protein for now. In future they will map to actual SO URIs; in the interim a trivial bridging ontology consisting of owl:equivalentClass statements could be used.

Gene/Product synonyms

We use skos:altLabel (a placeholder) as predicate

 SELECT * WHERE {
   ?gp rdf:type ?type .
   ?gp rdfs:label ?name .
   FILTER regex(?name, "ab") .
   OPTIONAL { ?gp skos:altLabel ?syn }
 }

We could use rdfs:label, but we prefer to distinguish the prefered name from alternate names.

We may consider the OboInOWL synonym model here; or SKOS.

Species

We use a genomics:part_of_organism predicate (will be replaced when appropriate RO relation becomes available); the object is an owl:Class of URI NCBITax:nnnn

The actual taxonomy nodes or graph is not accessible from this sparql endpoint; the idea is that this would be queried/aggregated from elsewhere.

NCBI Taxonomy has the namespace http://www.purl.org/obo/owl/NCBI_Taxon

OWL taken from:

Sequence

protein sequence via oban:has_sequence

(the test db on which the demo is running lacks sequence data I believe)

DBXrefs

uses rdfs:seeAlso predicate; you can query by UniProt, MOD ID, ....

note the go db is not v complete in this respect

Associations

Associations between a gene product and a GO type are represented using a single triple.

Given the preamble concerning the difficulties in ontologically pinning down the relationship between a gene and a GO type, we opt for an expedient solution of using a vague, undefined oban:has_role predicate between the gene/product instance and the GO class.

E.g.:

 human:p53 oban:has_role GO:1234567 ;

Note that having a triple between an instance and a class contravenes w3 best practices (does this lead to OWL full?). However, this decision was taken for reasons of expediency (simplicity of RDF querying). Other representations are possible, but introduce both extra triples and extra questions regarding the suitability of representing genes as instances.

In future we may represent this as

 p53gene a so:gene ;
         encodes [
                 participates_in go:DNA_repair
         ]

i.e. gene X encodes some protein which participaes in some DNA repair process. This is ontologically better but harder to query.

Negative Associations

TODO

represent as has_role link to a class (OWL-AS):

 restriction(allValuesFrom complementOf(GO_ID))

Annotation, evidence and providence

GO associations are propositions concerning the relationship between a gene/gene product type and a BP/MF/CC type

We employ RDF reification to represent the source/providence, evidence and annotation detail. Reification is sometimes frowned upon, but we argue its use is appropriate here.

Here we rely on a theory of the process of biomedical annotation, biomedical data and biomedical investigations. More details will be published in a separate document, and the ontology of biomedical annotation will aslo be published separately.

The basic idea is as follows. We have a clean separation between entities in the bio-domain (eg p53, DNA repair) and entities in the realm of human investigations, experiments and documents.

Annotation is a process which has some kind of agent (human or computational), takes as input some kind of data source (in GO annotation this is typically a record of scientific experimentation such as a journal paper) filtered by evidence (here using the OBO evidence code terminology, but could in theory use ontologies like OBI) and produces propositions.

Propositions are most naturally modeled as RDF statements. The subject is the gene/product, the object is the GO type, the predicate is the relation.

Note that the statement itself is actually asserted as an RDF triple (see associations, above).

 TODO: show example in n3 here

Can be retrieved like this:

 SELECT * WHERE {
   ?gp rdf:type ?type .
   ?gp rdfs:label ?name .
   FILTER regex(?name, "abc") .
   ?gp genomics:part_of_organism ?sp .
   ?gp oban:has_role ?role .
   OPTIONAL { ?gp skos:altLabel ?syn } .
   ?role rdfs:label ?term_label .
   OPTIONAL { ?role rdfs:subClassOf ?superclass } .
   ?stmt rdf:type rdf:Statement .
   ?stmt rdf:subject ?gp .
   ?stmt rdf:object ?role .
   ?stmt oban:has_evidence ?ev 
 }

TODO: currently rdf:statements are conflated with the annotation; in future these will be distinct entities, with instances of the annotationProcess class having a 'posits' predicate linking to the rdf:Statement

TODO: use bnodes rather than internal database IDs

Provenance

oban:has_source, between annotation and URI representing publication

TODO: publication instance is untyped; in future this will come from some ontology of documents

Evidence

oban:has_evidence

TODO: use bnodes rather than internal database IDs

Terms, the ontology and reasoning

The GO database also contains representations of GO types and their relationships, as well as the annotations. This RDF mapping does *not* include the terms and their links to eachother (this would be available elsewhere - see OboInOWL)

Of course, this dramatically reduces the utility of the SPARQL endpoint when used in isolation - annotation queries should always incorporate the deductive closure of rules that are corollaries of the definitions in the OBO Relation ontology; otherwise queries for for "transmembrane receptor" will not return genes annotated to "GPCR".

However, the idea is that the annotation endpoint will be used in conjunction with other services, including a service for querying GO in OWL, and performing the deductive closure (which will require going beyond RDFS to some fragment of OWL entailment to traverse the transitive part_of links).

In theory, an intelligent query mediator can combine these services and perform this query optimally; in practice these services as they exist for the semantic web today are not particularly efficient (TODO: evaluate DARQ in more detail). It's early days and hopefully this situation will improve; SPARQL is a declarative language so in theory many optimisations are possible, even enough to get queries as fast as a dedicated relational database such as the GO Database (http://www.godatabase.org - admittedly not as fast as it could be due to some software engineering inefficiencies on my part).

We are currently experimenting with DARQ as a query mediator, and with using d2rq in conjunction with Sesame. Both these approaches are likely to be slow. InstanceStore may be faster [is there any way to reuse the d2rq mappings with instancestore? will instancestore support SPARQL? Is SPARQL even a good language for OWL?]

Another approach is to also wrap the ontology tables in the GO database - this includes a table for pre-computing the transitive closure of relationships; the could be used to effectively fake the small fragment of owl entailment required whilst keeping the basic d2rq query engine at simple rdf entailment. (I think we can do this in r2dq by 'overloading' rdf:type...)

e.g. (TODO)

 SELECT * WHERE {
   ?gp oban:has_role ?class .
   ?class oban:transitive_reflexive_part_of ?query_class .
   ?query_class rdfs:label "endoplasmic reticulum"
 }

this finds gps that are localised to subtypes or parts of the ER (may seem counter-intuitive; reflexive part_of means it also traverses is_a DAG). E.g. ER lumen, rough ER.

note this is a simpler scheme than OWL; the following query expresses the same thing with owl semantics):

 SELECT * WHERE {
   ?gp oban:has_role ?role_inst .
   ?role_inst rdf_type ?class .
   ?class owl:restriction ?r .
   ?r owl:someValuesFrom ?query_class .
   ?r owl:onProperty ro:part_of .
   ?query_class rdfs:label "endoplasmic reticulum"
 }

This may be harder to implement in r2dq.

Note the definition of ro:part_of is reflexive. There may be some clash with owl semantics here, as properties are not reflexive in owl.


Term names

To help debugging of queries, we do map the term names themselves, but not their relationships, or any term metadata such as synonyms.

 SELECT * WHERE {
   ?gp rdf:type ?type .
   ?gp rdfs:label ?name .
   FILTER regex(?name, "ab") .
   ?gp genomics:part_of_organism ?sp .
   ?gp oban:has_role ?role .
   OPTIONAL { ?gp skos:altLabel ?syn } .
   ?role rdfs:label ?term_label .
   OPTIONAL { ?role rdfs:subClassOf ?superclass }
 }

The representation of GO types in OWL (and thus RDF) is a separate project in its own right; see

http://www.bioontology.org/wiki/index.php/OboInOwl:Main_Page

Note the service may later discontinue providing ontology detail such as term names; or it may encompass the entire ontology (currently d2rq does not have any entailment rules so there wouldn't be much point; it can in theory be used with jena performing the entailment but this is reputedly v slow).

Conclusions and future work

Representation of entities

Discussion on RDF mapping here; representing types/universals as instances vs classes. Realism vs RDF practicalities trade-off...

Efficiency issues

Resource Discovery and service metadata

Currently there is no way to query the endpoint to discover what kind of data it serves

See http://www.openlinksw.com/weblog/oerling/index.vspx?page=&id=1085

GO Database Service

If there is sufficient interest, this may become a permanent stable service of the GO Database or NCBO. This is dependent on there being some kind of throttle capability on the endpoint to alleviate/stop killer queries.

Applications

The first application using this will probably be overlaying GO annotations over gene glyphs in the new AJAX-GBrowse (http://genome.biowiki.org).

  • Tabulator?
  • Disco

See Also

Personal tools