Upload
randolf-roy-perkins
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
1
Tactical Formalization of Linked Open Data
@micheldumontier::RDA Workshop:23-02-2015
Michel Dumontier, Ph.D.
Associate Professor of Medicine (Biomedical Informatics)Stanford University
@micheldumontier::RDA Workshop:23-02-20152
Linked Open Data provides an incredibly dynamic, rapidly growing set of interlinked resources
@micheldumontier::RDA Workshop:23-02-2015
Linked Data for the Life Sciences
3
Bio2RDF is an open source project to unify the representation and interlinking of biological data
chemicals/drugs/formulations, genomes/genes/proteins, domainsInteractions, complexes & pathwaysanimal models and phenotypesDisease, genetic markers, treatmentsTerminologies & publications
11B triples from 35 datasets
Each dataset is described using its own schema.
4
Query data, ontologies, and services simultaneouslyusing Federated Queries over Independent SPARQL DBs
Get all protein catabolic processes (and more specific) in biomodels query against <http://bioportal.bio2rdf.org/sparql>
SELECT ?go ?label count(distinct ?x) WHERE { ?go rdfs:label ?label . ?go rdfs:subClassOf ?tgo ?tgo rdfs:label ?tlabel . FILTER regex(?tlabel, "^protein catabolic process") service <http://biomodels.bio2rdf.org/sparql> { ?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go . ?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> . }}
@micheldumontier::RDA Workshop:23-02-2015
Gene Ontology
5
Despite all the data, it’s still hard to find answers to questions
Because there are many ways to represent the same dataand each dataset represents it differently
@micheldumontier::RDA Workshop:23-02-2015
@micheldumontier::RDA Workshop:23-02-20156
multiple models for the same kind of data do emerge, each with their own merit
@micheldumontier::RDA Workshop:23-02-20157
This lack of coordination makes Linked Open Data somewhat chaotic and unwieldy
@micheldumontier::RDA Workshop:23-02-20158
Massive Proliferation of Ontologies / Vocabularies could be harnessed to bring order out of chaos
9 @micheldumontier::RDA Workshop:23-02-2015
The Semanticscience Integrated Ontology (SIO) Is used to ground Bio2RDF, SADI semantic web services
1300+ classes, 201 object properties (inc. inverses)1 datatype property
@micheldumontier::NIDM:Jan 26,201510
Ontology Design PatternsObjects, processes and their attributes.
@micheldumontier::RDA Workshop:23-02-201511
Multi-Stakeholder Efforts to Standardize Representations are Reasonable,
Long Term Strategies for Data Integration
12
tactical formalization
@micheldumontier::RDA Workshop:23-02-2015
Turn linked data intowhatever you need
STANDARDSAPPLICATION SPECIFICDrug repurposingVerifying annotations in biomodelsDiscovering aberrant pathways
W3CCommunity-standardsApplication standardsVisualization standards
@micheldumontier::RDA Workshop:23-02-201513w/ Deborah McGuiness, Jim McCusker @ RPI
@micheldumontier::RDA Workshop:23-02-201514
ReDrugS uses Nanopublications
• A nanopublication is a structured digital object to associate a statement composed of one or more triples with its evidence/provenance, and digital object metadata.
• http://nanopub.org
@micheldumontier::RDA Workshop:23-02-201515
nanopublications are dynamically generated from a SPARQL query
Fit for purpose to a target schema
Here, we make an ontological commitment
16
we can do more
@micheldumontier::RDA Workshop:23-02-2015
@micheldumontier::RDA Workshop:23-02-201517
Have you heard of OWL?
18
SBML-based biomodels place semantic annotations in an annotation element
<species metaid="_525530" id="GLCi"
compartment="cyto" > <annotation> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:bqbiol="http://biomodels.net/biology-qualifiers/" xmlns:bqmodel="http://biomodels.net/model-qualifiers/">
<rdf:Description rdf:about="#_525530"> <bqbiol:is> <rdf:Bag> <rdf:li rdf:resource=“http://identifiers.org/obo.chebi/CHEBI:4167"/> <rdf:li rdf:resource=“http://identifiers.org/kegg.compound/C00031"/> </rdf:Bag> </bqbiol:is> </rdf:Description> </rdf:RDF> </annotation>
</species>
object
predicate(qualifier)
The intent is to express that the species represents a substance composed of glucose moleculesWe also know from the SBML model that this substance is located in the cytosol and with a (initial) concentration of 0.09765M
subject
@micheldumontier::RDA Workshop:23-02-2015
19
By converting models into formal representations of knowledge we get to:
o capture the semantics of models and the biological systems they represent
o leverage knowledge explicit in linked terminologieso validate the accuracy of the annotations / modelso discover biological implications inherent in the modelso query the results of simulations in the context of the
biological knowledge
@micheldumontier::RDA Workshop:23-02-2015
20
Model verification
After reasoning, we found 27 models to be inconsistent
reasons1. our representation - functions sometimes found in the place of
physical entities (e.g. entities that secrete insulin). better to constrain with appropriate relations
2. SBML abused – e.g. species used as a measure of time3. Incorrect annotations - constraints in the ontologies themselves
mean that the annotation is simply not possible
@micheldumontier::RDA Workshop:23-02-2015
Integrating systems biology models and biomedical ontologies.BMC Syst Biol. 2011 Aug 11;5:124. doi: 10.1186/1752-0509-5-124.
@micheldumontier::RDA Workshop:23-02-201521
Identification of drug and disease enriched pathways
• Approach– Integrated 3 datasets & 7 terminologies
• DrugBank, PharmGKB and CTD• MeSH, ATC, ChEBI, UMLS, SNOMED, ICD, DO
– Formalized into an OWL-EL ontology• 4 class top level ontology• logical class axioms• 650,000+ classes, 3.2M subClassOf axioms
– Identified significant associations using enrichment analysis over the fully inferred knowledge base
Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics. Bioinformatics. 2012.
@micheldumontier::RDA Workshop:23-02-201522
Benefit 1: Enhanced Query Capability
– Use any mapped terminology to query a target resource.– Use knowledge in target ontologies to formulate more
precise questions• ask for drugs that are associated with diseases of the joint:
‘Chikungunya’ (do:0050012) is defined as a viral infectious disease located in the ‘joint’ (fma:7490) and caused by a ‘Chikungunya virus’ (taxon:37124).
– Learn relationships that are inferred by automated reasoning.
• alcohol (ChEBI:30879) is associated with alcoholism (PA443309) since alcoholism is directly associated with ethanol (CHEBI:16236)
@micheldumontier::RDA Workshop:23-02-201523
Benefit 2: Knowledge Discovery through Enrichment Analysis
• OntoFunc: Tool to discover significant associations between sets of objects and ontology categories.
• We found 22,653 disease-pathway associations, where for each pathway we find genes that are linked to disease.– Mood disorder (do:3324) associated with Zidovudine
Pathway (pharmgkb:PA165859361). Zidovudine is used to treat HIV/AIDS. Side effects include fatigue, headache, myalgia, malaise and anorexia
@micheldumontier::RDA Workshop:23-02-201524
http://tiny.cc/hcls-datadesc-ed
@micheldumontier::RDA Workshop:23-02-201525
61 metadata elementsCore• Identifiers• Title• Description• Attribution• Homepage• License• Language• Keywords• Concepts and vocabularies used• Standards• Publication
Editor & Validator underway
Provenance and Change• Version number• Source• Provenance: retrieved from, derived
from, created with• Frequency of change
Availability• Format• Download URL• Landing page• SPARQL endpoint
13 Content Statistics• With SPARQL queries
@micheldumontier::RDA Workshop:23-02-201526
NIH Big Data to Knowledge (BD2K) Center of ExcellenceMark Musen (PI), Michel Dumontier (Co-I), Purvesh Khatri (Co-I), Olivier Gevaert (Co-I)Goals:• Tools to facilitate template construction and semi-automated metadata
annotation• Enable dataset discovery & ruse• Partnership with ImmPort imm. Repo, BioSharing, and the Stanford Library
@micheldumontier::RDA Workshop:23-02-201527
Website: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier