From biological data to clinical applications: positioning a digital infrastructure for the
future of biomedicine
1
Michel Dumontier, Ph.D.
Associate Professor of Bioinformatics, Department of Biology, School of Computer Science, Institute of Biochemistry, Carleton University
Professeur Associé, Université Laval
Ottawa Institute of Systems Biology Ottawa-Carleton Institute of Biomedical Engineering
DERI::Digital Infrastructure for Biomedicine
2 DERI::Digital Infrastructure for Biomedicine
DERI::Digital Infrastructure for Biomedicine 3
4 DERI::Digital Infrastructure for Biomedicine
uncovering a sufficient amount of evidence to support/refute a hypothesis is becoming increasingly difficult
it requires a lot of digging around
DERI::Digital Infrastructure for Biomedicine 5
continuous growth in research literature
6
Source:http://www.nlm.nih.gov/bsd/stats/cit_added.html
DERI::Digital Infrastructure for Biomedicine
access to increasing amounts of biomedical data
7 DERI::Digital Infrastructure for Biomedicine
access to the most effective software to predict, compare and evaluate
8 DERI::Digital Infrastructure for Biomedicine
ultimately, we answer questions by building sophisticated workflows
9 DERI::Digital Infrastructure for Biomedicine
What if we could automatically answer a question using available data and services?
10 DERI::Digital Infrastructure for Biomedicine
The Semantic Web is the new global web of knowledge
11 DERI::Digital Infrastructure for Biomedicine
It involves standards for publishing, sharing and querying facts, expert knowledge and services
It is a scalable approach to the
discovery of independently formulated and distributed knowledge
Link all the data!!!
DERI::Digital Infrastructure for Biomedicine 12
something you can search, lookup, link to, query for
and check consistency and veracity of
13 DERI::Digital Infrastructure for Biomedicine
an emerging linked data network
14 “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/” DERI::Digital Infrastructure for Biomedicine
Life Science Data Contributors
• Bio2RDF • Chem2Bio2RDF • LODD (HCLS)
DERI::Digital Infrastructure for Biomedicine 15
• > 40 biological datasets from independent
providers • > 3 billion triples
DERI::Digital Infrastructure for Biomedicine 16
17
linked data for the life sciences
An Open Source Project for the Provision of Scalable, Decentralized Data with Global Mirroring
and Customizable Query Resolution
Francois Belleau, Laval University Marc-Alexandre Nolin, Laval University
Peter Ansell, Queensland University of Technology Michel Dumontier, Carleton University
DERI::Digital Infrastructure for Biomedicine
Bio2RDF resources are identified using IRIs
• Data providers’ record identifiers are maintained from source http://bio2rdf.org/namespace:identifier
• E.g.: DrugBank’s resource IRI for
Leucovorin
http://bio2rdf.org/drugbank:DB00650
DERI::Digital Infrastructure for Biomedicine 18
vocabulary and resource namespaces are used to describe auxiliary resources
• Vocabulary namespaces are used for dataset specific types and predicates http://bio2rdf.org/drugbank_vocabulary:Drug
• Entities arising from n-ary relations are identified in the resource namespace http://bio2rdf.org/drugbank_resource:DB00440_DB00650
DERI::Digital Infrastructure for Biomedicine 19
DERI::Digital Infrastructure for Biomedicine 20
Every Bio2RDF dataset now contains provenance metadata
DERI::Digital Infrastructure for Biomedicine 21
Bio2RDF types include biological, information content & processual entities
CTD: Chemical, Disease, Chemical-Disease Interaction, Chemical-Gene Interaction
Entrez Gene: Gene, Model Organism, Publication HGNC: Accession Number, Gene, Gene Symbol iRefIndex: Protein Complex, Protein Interaction MGI: Gene Marker, Gene Symbol PharmGKB: Association, Disease, Drug, Gene SGD: Enzyme, Pathway, Protein, RNA, Reaction, Location, Experiment
DERI::Digital Infrastructure for Biomedicine 22
Question: Find all proteins that interact with beta amyloid (uniprot:P05067)
SELECT * WHERE { ?protein a bio2rdf:Protein . ?protein bio2rdf:interacts_with uniprot:P05067 . }
Heterogeneous biological data on the semantic web is difficult to query
UniProt Protein PDB Protein
iRefIndex Protein
?
Physical interaction?
Pathway interaction?
Genetic interaction?
DERI::Digital Infrastructure for Biomedicine 23
Uncertainty in what is being said with a simple triple
imagine a statement between two types, C1 and C2 C1 R C2 nucleus part-of cell does it mean For every C1 there is a C2 that is related by R? For every C2 there is a C1 that is related by R? For some C1, there is a C2 that is related by R, or vice versa? Every C1 is a kind of C2? or vice versa? C1s and C2s are the same kind? There is no C1 that is also a C2? we need to commit to a particular meaning that can be universally interpreted – this formalization will then hold across datasets
DERI::Digital Infrastructure for Biomedicine 24
From linked data to linked knowledge through syntactic and semantic normalization.
RDF-based Linked Data is a great first step, but it’s not enough.
DERI::Digital Infrastructure for Biomedicine 25
ontology as a strategy to
formally represent and integrate
knowledge
26 DERI::Digital Infrastructure for Biomedicine
Have you heard of OWL?
DERI::Digital Infrastructure for Biomedicine 27
The Web Ontology Language (OWL) Has Explicit Semantics
Can therefore be used to capture knowledge in a machine understandable way
28 DERI::Digital Infrastructure for Biomedicine
SIO provides an OWL ontology for the representation of diverse biomedical knowledge
DERI::Digital Infrastructure for Biomedicine 29
30 DERI::Digital Infrastructure for Biomedicine
uniprot:P05067
uniprot:Protein
is a
sio:protein
is a is a
Semantic data integration, consistency checking and query answering over Bio2RDF with the Semanticscience Integrated Ontology (SIO)
dataset
ontology
Knowledge Base
31 DERI::Digital Infrastructure for Biomedicine
refseq:NP_009225.1
refseq:Protein
is a
is a
uniprot:P05067
uniprot:Protein refseq:Protein
Querying Bio2RDF Linked Open Data with a Global Schema. Alison Callahan, José Cruz-Toledo and Michel Dumontier. to be presented at Bio-ontologies 2012.
Use CTD & SGD to find all chemicals and proteins that participate in the same GO process
SELECT * FROM <http://bio2rdf.org/ctd> WHERE { ?chemical a sio:SIO_010004. # 'chemical entity' ?chemical rdfs:label ?chemicalLabel. ?chemical sio:SIO_000062 ?process. # 'is participant in' ?process rdfs:label ?processLabel. SERVICE <http://sgd.bio2rdf.org/sparql> { ?protein a sio:SIO_010043. # ‘protein’ ?protein sio:SIO_000062 ?process. ?gene sio:SIO_010078 ?protein. # ‘encodes’ ?gene rdfs:label ?geneLabel. } }
DERI::Digital Infrastructure for Biomedicine 32
More sophisticated OWL-based Data Integration, Consistency Checking and Discovery
• Checking the consistency of semantic annotations [1] – Formalized semantic annotations in SBML models as OWL axioms.
Automated reasoning uncovered inconsistencies in 16 models. • e.g. alpha-D-glucose phosphate is not the required ATP in an ATP-dependent
reaction (GO + ChEBI + disjoint + closure axioms)
• Finding significant biomedical associations [2] – found significant associations between genes, drugs, diseases and
pathways using Drugbank, PharmGKB, CTD, PID across categories of drugs (ChEBI, ATC, MeSH) and diseases (DO, MeSH)
– 22,653 pathway-disease type associations (6304 over; 16,349 under) • carcinosarcoma (DOID:4236) and Zidovudine Pathway (PharmGKB:PA165859361)
– 13,826 pathway-chemical type associations (12,564 over; 1262 under) • drug clopidogrel (CHEBI:37941) with Endothelin signaling pathway
(PharmGKB:PA164728163);
1. Integrating systems biology models and biomedical ontologies. BMC Systems Biology. 2011. 5 : 124 2. Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics. Bioinformatics. 2012. in press
http://pharmgkb-owl.googlecode.com
DERI::Digital Infrastructure for Biomedicine 33
Translational Medicine Requires Integration of Patient and Biomedical Data
DERI::Digital Infrastructure for Biomedicine 34
Integration of patient record data with Linked Open Data through the Translational Medicine Ontology
DERI::Digital Infrastructure for Biomedicine 35
223 mappings : 60 TMO classes to 201 target classes from over 40 ontologies and 8 datasets
Formalization of the Dubois AD diagnostic criteria for
decision support
DERI::Digital Infrastructure for Biomedicine 36
# the panel is a textual entity dubois:panel2 a iao:IAO_0000300 . dubois:panel2 rdfs:label "Alzheimer Disease diagnostic criteria as reported in panel 2 of dubois et al - pubmed:17616482 [dubois:panel2]". # the panel is about alzheimer disease dubois:panel2 iao:is_about diseasome:74. # the panel is from the article dubois:panel2 ro:part_of <http://bio2rdf.org/pubmed:17616482>. # the panel is about diagnostic criterion dubois:panel2 iao:is_about tmo:TMO_0068. #inclusion criterion dubois:10 rdfs:label "Proven AD autosomal dominant mutation within the immediate family [dubois:10]" ; a tmo:TMO_0069; ro:part_of dubois:panel2; iao:is_about diseasome:74. # exclusion criterion dubois:16 rdfs:label "Major depression [dubois:16]" ; a tmo:TMO_0070; ro:part_of dubois:panel2; iao:is_about diseasome:74.
TMKB for pharmaceutical and clinical research, and health care
Pharmaceutical Research • Which existing marketed drugs might potentially be re-purposed for
AD because they are known to modulate genes that are implicated in the disease?
– 57 compounds or classes of compounds that are used to treat 45 diseases, including AD, hyper/hypotension, diabetes and obesity
Clinical research • Identify an AD clinical trial for a drug with a different mechanism of
action (MOA) than the drug that the patient is currently taking – Of the 438 drugs linked to AD trials, only 58 are in active trials and only 2
(Doxorubicin and IL-2) have a documented MOA. 78 AD-associated drugs have an established MOA.
Health care • Have any of my AD patients been treated for other neurological
conditions as this might impact their diagnosis? – Patient 2 is also being treated for depression.
DERI::Digital Infrastructure for Biomedicine 37 http://esw.w3.org/topic/HCLSIG/PharmaOntology/Queries
Personal Health Lens
Observation: Patients often look up new/alternative drugs to treat their condition or alleviate side effects. Opportunity: A patient-centric health care application that identifies contraindications for drugs mentioned on web pages using the patient’s own health data
Components: • RDFized patient data • Bio2RDF semantically annotated data • SADI semantic web services to process the page and retrieve data • SHARE automatic workflow composition
DERI::Digital Infrastructure for Biomedicine 38
Mark Wilkinson, UBC Michel Dumontier, Carleton University
Christopher Baker, UNB
The Semantic Automated Discovery and Integration (SADI) framework makes it easy to create Semantic Web Services using OWL classes as service inputs and outputs
http://sadiframework.org
~700 bioinformatic services as of May 29, 2012
SADI enables discovery and access to Semantic Web Services
DERI::Digital Infrastructure for Biomedicine 39
DERI::Digital Infrastructure for Biomedicine 40
DERI::Digital Infrastructure for Biomedicine 41
DERI::Digital Infrastructure for Biomedicine 42
sources
contraindication
uses the patient’s data
rationale
The SADI+SHARE workflow and reasoning was personalized to YOUR medical data
DERI::Digital Infrastructure for Biomedicine 43
so how do we get at the supporting evidence?
DERI::Digital Infrastructure for Biomedicine 44
HyQue
HyQue is the Hypothesis query and evaluation system • A platform for knowledge discovery • Facilitates hypothesis formulation and evaluation • Leverages Semantic Web technologies to provide access to
facts, expert knowledge and web services • Conforms to a simplified event-based model • Supports evaluation against positive and negative findings • Transparent and reproducible evidence prioritization • Provenance of across all elements of hypothesis testing
– trace a hypothesis to its evaluation, including the data and rules used
DERI::Digital Infrastructure for Biomedicine 45 HyQue: evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3.
Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference (ESWC 2012). Heraklion, Crete. May 27-31, 2012.
HyQue Architecture
DERI::Digital Infrastructure for Biomedicine 46
Services
Ontologies
Event-based data model
HyQue events denote a phenomenon involving two objects: ‘agent’ and ‘target’ . In addition, we can specify the location of this event (e.g. located in nucleus, or under some genetic background) Event
‘has agent’ agent
‘has target’ target
‘is located in’ location
‘is negated’ boolean
DERI::Digital Infrastructure for Biomedicine 47
Currently supported events
1. protein-protein binding 2. protein-nucleic acid binding 3. molecular activation 4. molecular inhibition 5. gene induction 6. gene repression 7. transport
HyQue domain rules CALCULATE a quantitative measure of evidence for an event
‘induce’ rule (maximum score: 5): – Is event negated?
• If yes, subtract 2 – Is event of type ‘induce’?
• If yes, add 1; if no, subtract 1 – Is agent of type ‘protein’ or ‘RNA’?
• If yes, add 1; if type ‘gene’, subtract 1 – Is target of type ‘gene’?
• If yes, add 1; if no, subtract 1 – Does agent have known ‘transcription factor activity’?
• If yes, add 1 – Is event located in the ‘nucleus’?
• If yes, add 1; if no, subtract 1
GO:0010628
CHEBI:36080
SO:0000236
GO:0003700
GO:0005634
DERI::Digital Infrastructure for Biomedicine 48
DERI::Digital Infrastructure for Biomedicine 49
Combination of system and domain rules to retrieve and score data, and add new triples
:e1 a go:0010628; hyque:agent sgd:Gal4p; hyque:target sgd:GAL1 . hyque:is_negated "0" ;
Event - induction SPIN induction rule
Customization of rules/data sources will generate different evidence-based evaluations
DERI::Digital Infrastructure for Biomedicine 50
Reproducible eScience LOD for Hypothesis, Rules, Data and Evaluation
DERI::Digital Infrastructure for Biomedicine 51
52 DERI::Digital Infrastructure for Biomedicine
A digital infrastructure for the future of biomedicine
• Semantic Web technologies offer a powerful integrative platform across facts, expert knowledge and services
• The ability to publish, link to, retrieve, check consistency of, query biomedical knowledge will yield an explosion of health-related applications.
• By formalizing biomedical data, we can integrate molecular to clinical data, and gain insight into how living systems respond to chemical agents – implications drug discovery & delivery of health care
DERI::Digital Infrastructure for Biomedicine 53
Acknowledgements Bio2RDF Peter Ansell, Francois Belleau, Allison Callahan, Jacques Corbeil, Jose Cruz-Toledo, Alex De Leon, Steve Etlinger, James Hogan, Nichealla Keath, Jean Morissette, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault and Paul Roe HyQue Alison Callahan Lab Glen Newton (NLP), Gordana Lenert (PGx), Dana Klassen @ DERI, Leonid Chepelev @ UoO, Natalia Villanueva-Rosales @ UoTexas, Xueying Chen @ IBM China, Mykola Konyk
OWL-Based Data Integration Robert Hoehndorf, John Gennari, Sarah Wimalaratne, Bernard de Bono, Daniel Cook, and George Gkoutos SADI: Christopher Baker, Melanie Courtot, Jose Cruz-Toledo, Steve Etlinger, Nichealla Keath, Artjom Klein, Luke McCarthy, Silvane Paixao, Ben Vandervalk, Natalia Villanueva-Rosales, Mark Wilkinson W3C HCLS: J Luciano, B Andersson, C Batchelor, O Bodenreider, T Clark, C Denney, C Domarew, T Gambet, L Harland, A Jentzsch, V Kashyap, P Kos, J Kozlovsky, T Lebo, SM Marshall, JP McCusker, DL McGuinness, C Ogbuji, E Pichler, R Powers, E Prud hommeaux, M Samwald, L Schriml, PJ Tonellato, PL Whetzel, J Zhao, S Stephens, C Denney, J Luciano, J McGurk, Lynn Schriml, and Peter J. Tonellato.
DERI::Digital Infrastructure for Biomedicine 54
dumontierlab.com [email protected]
DERI::Digital Infrastructure for Biomedicine
Website: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier
55