Semantic web support for biostatistics with R/Rswub (W3C SWLS position paper)

VJ Carey mailto:stvjc@channing.harvard.edu

September 14, 2004

NOTE ADDED May 3 2010

Rswub package mentioned in this document is no longer maintained -- various java infrastructure components went stale. Rredland is an RDF processing library still available from Bioconductor.

END OF NOTE ADDED May 3 2010

This paper describes software that makes semantic web methodology more accessible to biostatisticians and informaticians interested in statistical data analysis. The tools described are under development in a popular open source statistical computing environment, R.

Overview of R/Rswub

R is an open source dialect of the S language developed at Bell Labs in the 1970s to support interactive data analysis. The S Language won the ACM Software Systems Award in 1999. Open software contributions are archived at cran.r-project.org, taking the form of packages. R has a strong commitment to platform-independence and application interoperability. Bidirectional interfaces to C, Java, Perl, Python are available.

Bioconductor is an independent project for developing software infrastructure and tools for bioinformatic analysis in R. Approximately 100 R packages are distributed at Bioconductor, and thousands of package downloads are recorded monthly.

Rswub (R semantic web utilities for bioinformatics) is an R package currently in development. A draft version can be obtained upon request. Rswub includes a comprehensive interface to Jena, the HP java-based framework for RDF and OWL computations. Support tools for resolving RDQL queries and for conducting OWL model manipulation and inference are included. Rswub also includes an LSID resolver and documentation on how to establish an LSID authority. Rswub will be released at www.bioconductor.org in late 2004.

Four tasks that can be confronted with R/Rswub that are of interest to statisticians are:

Each of these tasks has facets that concern semantic web methodology. The first three are addressed in this document. Tools for fragmentation reduction via semantic web methodology are in development.

Annotation conversion

Eric Jain at Swiss Institute for Bioinformatics published RDF serializations of a number of resources related to UNIPROT. Most of the serialized files are too large to manipulate conveniently for demonstration purposes, but we can easily interactively assess the database summary. In the sequel, teletype text preceded by > or + (continuation) denotes user input to R, plain teletype text denotes R output. This document is generated by R.

> library(Rswub)
> dbs <- readRDFModel("http://www.isb-sib.ch/~ejain/rdf/data/databases.rdf.gz", 
+     asGZURL = TRUE)
> dbs
RDF model (instance of com.hp.hpl.jena.rdf.model.impl.ModelCom )
source: /tmp/RtmpK8rObJ/file66334873 
We can determine the number of statements:
> size(dbs)
[1] 518
It is useful to obtain the decomposition of triples by predicate and by subject. The getSplits method obtains these decompositions as R lists:
> sdbs <- getSplits(dbs)
We can now list the predicates in use:
> names(sdbs$bypred)
[1] "comment"                               
[2] "label"                                 
[3] "seeAlso"                               
[4] "subClassOf"                            
[5] "type"                                  
[6] "urn:lsid:uniprot.org:ontology:abstract"
[7] "urn:lsid:uniprot.org:ontology:implicit"
[8] "urn:lsid:uniprot.org:ontology:pattern" 
Comments are usually informative:
> sdbs$bypred[["comment"]][1:3, ]
                                 subj
5  urn:lsid:uniprot.org:databases:122
9   urn:lsid:uniprot.org:databases:49
12  urn:lsid:uniprot.org:databases:23
                                                           obj
5                         Homologous bacterial genes database.
9                                               Gene Ontology.
12 Repository for 3D biological macromolecular structure data.
We now see that there is a mapping between UNIPROT LSIDs and distributed bioinformatic data resources.

The purpose of a formally defined "pattern" predicate can be assessed:

> sdbs$bypred[[8]][1:4, 2]
[1] http://rgd.mcw.edu/tools/genes/genes_view.cgi?id=%s                                    
[2] http://www.reactome.org/cgi-bin/search?QUERY_CLASS=DatabaseIdentifier&QUERY=SPTREMBL:%s
[3] http://www.tigr.org/tigr-scripts/CMR2/hmm_report.spl?acc=%s                            
[4] http://yeastgfp.ucsf.edu/getOrf.php?orf=%s                                             
359 Levels: 2-DE database at Universidad Complutense de Madrid. ...

LSID resolution

Apparently there are very few publicly accessible LSID authorities. One at U Wisconsin deals with limnology.
> ldoc <- getMetadata("urn:lsid:limnology.wisc.edu:dataset:ntlch01")
The output is an XML document that needs to be stripped of its XML header to be processed by Jena. A function delivers the resulting model directly:
> lmod <- LSID2RDF("urn:lsid:limnology.wisc.edu:dataset:ntlch01")
> slmod <- getSplits(lmod)
> names(slmod$bypred)[1:5]
[1] "http://purl.org/dc/elements/1.1/description"       
[2] "http://purl.org/dc/elements/1.1/title"             
[3] "urn:lsid:limnology.wisc.edu:predicates:begindate"  
[4] "urn:lsid:limnology.wisc.edu:predicates:contact"    
[5] "urn:lsid:limnology.wisc.edu:predicates:containedin"

OWL representation of statistical datasets

In too many instances, an excel spreadsheet is the sole encoding of the outcomes of a complex experiment. Semantically rich data representations can be achieved using RDF/OWL models. Considerable technical support is needed "upstream" for the design and population of RDF/OWL models for observational data.

An example of a problem with "upstream" deployment of standard for data interoperability emerges in the field of microarray archiving. MAGE-ML Archives of microarray data at NCI and EBI ArrayExpress have proven in a number of cases to be semantically broken, despite correctness of the MAGEML serialization. By `broken' we mean that the interpretation of a data field cannot be derived from the archive. Examples have been reported to the administrators of these projects. The MAGE-OM object model for microarray data has been specified in DAML+OIL, and this has been translated to OWL. Tools for semantic assessment of MAGE-ML archives are not currently available.

Given an effective RDF/OWL representation of an experimental result, Rswub can be used "downstream" for several purposes. Information on individuals of the model can be transformed to statistical data records. A key benefit of working with an OWL representation is that data semantics can be propagated to guide the choice of statistical analysis procedure. For example, detailed information on detection limits can be preserved through the transformation, driving application of censored data analysis methods. RDF statement filtering and inference is supported through an RDQL interface. Ontological inference (e.g., deriving statements through propagating properties across collections, or amalgamating distinct vocabularies) is supported through access to Jena's OWL Lite reasoner. Rule-based derivations are likewise supported. Space constraints forbid illustration of the current interfaces; however, some manipulations with the MAGE-OM model are:

> m <- buildMGED()
loading MGED ontology...
> m
Ontology model (instance of com.hp.hpl.jena.ontology.impl.OntModelImpl )
source: /home/stvjc/R200a/library/Rswub/XML/MGEDOntology.owl 
There are 214 named classes.
Base namespace: http://mged.sourceforge.net/ontologies/MGEDOntology.owl .
> size(m)
[1] 2916
> listOntClasses(m)[1:5]
[1] "Software"        "CellLine"        "ImageFormat"     "ProtocolPackage"
[5] "SubstrateType"  
We do not illustrate the OWL Lite deductions from the initial document, but they expand the size to 4997 statements.

An RDQL query will be used to determine the instances of the PerturbationalDesign class. We build (not shown) the query into variable bd, and then resolve it:

> bd
[1] "SELECT ?inst WHERE ( ?inst, <rdf:type>, <mged:PerturbationalDesign> ) 
  USING mged FOR <http://mged.sourceforge.net/ontologies/MGEDOntology.owl#>, 
          rdf FOR <http://www.w3.org/1999/02/22-rdf-syntax-ns#>"
> pdi <- RDQLresolve(model(m), bd, "inst")
> as.character(strip2pound(unlist(getStrings(pdi))))[1:4]
[1] "cellular_modification_design" "disease_state_design"        
[3] "stimulus_or_stress_design"    "growth_condition_design"     

Provenance of statistical methodology implementations and analyses

Data provenance methods have been reviewed at several workshops The provenance of a statistical data analysis consists of the data provenance plus the provenance of the analysis computations. Formal links between implementations of statistical algorithms and the associated literature are important for supporting method discovery and for automating matches between data analysis problems and software solutions. Figure * sketches an ontology for statistical data resources, literature, and software elements, along with some individual instances of the specified classes. Instances of the StatDataResource and StatSoftware classes are related to instances of the key class, StatAnalysisReport, by the employedBy property. StatSoftware instances implement StatMethodDocument instances. Integration of ontological and inferential infrastructure for literature, data and software provenance specification in R promises to be a challenging but ultimately very useful achievement.

ontoviz

An ontology for statistical data, software and literature resources with instances.
 
http://www.biostat.harvard.edu/~carey