58
Ontologies and data integration in biomedicine Success stories and challenging issues Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Data Integration in the Life Sciences Evry, France June 26, 2008

Data Integration in the Life Sciences Evry, France June 26, 2008 · 26-06-2008  · Ontologies and data integration in biomedicine Success stories and challenging issues. Olivier

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Ontologies and data integration in biomedicineSuccess stories and challenging issues

Olivier Bodenreider

Lister Hill National Centerfor Biomedical CommunicationsBethesda, Maryland - USA

Data Integration in the Life SciencesEvry, FranceJune 26, 2008

Why integrate data?

Lister Hill National Center for Biomedical Communications 3

Integration yields nice pictures![Yildirim, 2007]

Lister Hill National Center for Biomedical Communications 4

Motivation Translational research

“Bench to Bedside”Integration of clinical and research activities and resultsSupported by research programs

NIH RoadmapClinical and Translational Science Awards (CTSA)

Requires the effective integration and exchange and of information between

Basic researchClinical research

Lister Hill National Center for Biomedical Communications 5

Translational research NIH Roadmap

Lister Hill National Center for Biomedical Communications 6

Motivation Translational research

BasicResearch

ClinicalResearch

andPractice

Why ontologies?

Lister Hill National Center for Biomedical Communications 8

Terminology and translational research

CancerBasic

Research

EHRCancerPatients

NCI Thesaurus SNOMED CT

Lister Hill National Center for Biomedical Communications 9

Approaches to data integration

WarehousingSources to be integrated are transformed into a common format and converted to a common vocabulary

MediationLocal schema (of the sources)Global schema (in reference to which the queries are made)

Lister Hill National Center for Biomedical Communications 10

Ontologies and warehousing

RoleProvide a conceptualization of the domain

Help define the schemaInformation model vs. ontology

Provide value sets for data elementsEnable standardization and sharing of data

ExamplesAnnotations to the Gene OntologyRepositories for translational research (CTSA)Clinical information systems

Lister Hill National Center for Biomedical Communications 11

Ontologies and mediation

RoleReference for defining the global schemaMap between local and global schemas

ExamplesTAMBISBioMediatorOntoFusion

Success stories

Gene Ontologyhttp://www.geneontology.org/

Lister Hill National Center for Biomedical Communications 13

Annotating data

Gene OntologyFunctional annotation of gene productsin several dozen model organisms

Various communities use the same controlled vocabulariesEnabling comparisons across model organismsAnnotations

Assigned manually by curatorsInferred automatically (e.g., from sequence similarity)

Lister Hill National Center for Biomedical Communications 14

GO Annotations for Aldh2 (mouse)

http:// www.informatics.jax.org/

Lister Hill National Center for Biomedical Communications 15

GO ALD4 in Yeast

http://db.yeastgenome.org/

Lister Hill National Center for Biomedical Communications 16

GO Annotations for ALDH2 (Human)

http://www.ebi.ac.uk/GOA/

Lister Hill National Center for Biomedical Communications 17

Integration applications

Based on shared annotationsEnrichment analysis (within/across species)Clustering (co-clustering with gene expression data)

Based on the structure of GOClosely related annotationsSemantic similarity

Based on associations between gene products and annotationsLeveraging reasoning

[Bodenreider, PSB 2005]

[Sahoo, Medinfo 2007]

[Lord, PSB 2003]

Lister Hill National Center for Biomedical Communications 18

Gene Ontology

Integration Entrez Gene + GO

gene

GO

PubMed

Gene name

OMIM

Sequence

InteractionsGlycosyltransferase

Congenital muscular dystrophy

Entrez Gene

[Sahoo, Medinfo 2007]

Lister Hill National Center for Biomedical Communications 19

From glycosyltransferaseto congenital muscular dystrophy

MIM:608840 Muscular dystrophy, congenital, type 1D

GO:0008375

has_associated_phenotype

has_molecular_function

EG:9215LARGE

acetylglucosaminyl-transferase

GO:0016757glycosyltransferase

GO:0008194isa

GO:0008375 acetylglucosaminyl-transferase

GO:0016758

Success stories

caBIGhttp://cabig.nci.nih.gov/

Lister Hill National Center for Biomedical Communications 21

Cancer Biomedical Informatics Grid

US National Cancer InstituteCommon infrastructure used to share data and applications across institutions to support cancer research efforts in a grid environment

Data and application services available on the gridSupported by ontological resources

Lister Hill National Center for Biomedical Communications 22

caBIG services

caArrayMicroarray data repository

caTissueBiospecimen repository

caFE (Cancer Function Express)Annotations on microarray data

caTRIPCancer Translational Research Informatics PlatformIntegrates data services

Lister Hill National Center for Biomedical Communications 23

Ontological resources

NCI ThesaurusReference terminology for the cancer domain~ 60,000 conceptsOWL Lite

Cancer Data Standards Repository (caDSR)Metadata repositoryUsed to bridge across UML models through Common Data ElementsLinks to concepts in ontologies

Success stories

Semantic Webfor Health Care and Life Sciences

http://www.w3.org/2001/sw/hcls/

Lister Hill National Center for Biomedical Communications 25

W3C Health Care and Life Sciences IG

Lister Hill National Center for Biomedical Communications 26

Biomedical Semantic Web

IntegrationData/InformationE.g., translational research

Hypothesis generationKnowledge discovery

[Ruttenberg, 2007]

Lister Hill National Center for Biomedical Communications 27

HCLS mashup of biomedical sources

NeuronDB

BAMS

NC Annotations

Homologene

SWAN

Entrez Gene

Gene Ontology

Mammalian Phenotype

PDSPki

BrainPharm

AlzGene

Antibodies

PubChem

MeSH

Reactome

Allen Brain Atlas

Publications

http://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demo

Lister Hill National Center for Biomedical Communications 28

Shared identifiers Example

GO

Lister Hill National Center for Biomedical Communications 29

HCLS mashup NeuronDB

Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents

BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID

NC Annotations

Genes/ProteinsProcessesCells (maybe)PubMed ID

Allen Brain Atlas

GenesBrain imagesGross anatomy -> neuroanatomy

Homologene

GenesSpeciesOrthologiesProofs

SWAN

PubMedIDHypothesisQuestionsEvidence

Genes

Entrez GeneGenesProtein

GOPubMedID

Interaction (g/p)Chromosome

C. location

GO

Molecular functionCell components

Biological processAnnotation gene

PubMedID

Mammalian Phenotype

Genes Phenotypes

DiseasePubMedID

ProteinsChemicals

Neurotransmitters

PDSPki

BrainPharmDrug

Drug effectPathological agent

PhenotypeReceptorsChannelsCell typesPubMedIDDisease

AlzGene

Gene Polymorphism

PopulationAlz Diagnosis

AntibodiesGenes Antibodies

PubChem

NameStructurePropertiesMeSH term

MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem

Reactome

Genes/proteinsInteractionsCellular locationProcesses (GO)

Lister Hill National Center for Biomedical Communications 30

HCLS mashup NeuronDB

Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents

BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID

NC Annotations

Genes/ProteinsProcessesCells (maybe)PubMed ID

Allen Brain Atlas

GenesBrain imagesGross anatomy -> neuroanatomy

Homologene

GenesSpeciesOrthologiesProofs

SWAN

PubMedIDHypothesisQuestionsEvidence

Genes

Entrez GeneGenesProtein

GOPubMedID

Interaction (g/p)Chromosome

C. location

GO

Molecular functionCell components

Biological processAnnotation gene

PubMedID

Mammalian Phenotype

GenesPhenotypes

DiseasePubMedID

ProteinsChemicals

Neurotransmitters

PDSPki

BrainPharmDrug

Drug effectPathological agent

PhenotypeReceptorsChannelsCell typesPubMedIDDisease

AlzGene

GenePolymorphism

PopulationAlz Diagnosis

AntibodiesGenesAntibodies

PubChem

NameStructurePropertiesMeSH term

MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem

Reactome

Genes/proteinsInteractionsCellular locationProcesses (GO)

Lister Hill National Center for Biomedical Communications 31

HCLS mashups

Based on RDF/OWLBased on shared identifiers

“Recombinant data” (E. Neumann)

Ontologies used in some casesSupport applications (SWAN, SenseLab, etc.)

Journal of Biomedical Informaticsspecial issue on Semantic Bio-mashups(forthcoming)

Challenging issues

Bridges across ontologies

Lister Hill National Center for Biomedical Communications 33

Trans-namespace integration

Addison Disease(D000224)

Addison's disease (363732003)

Biomedicalliterature

MeSH

Clinicalrepositories

SNOMED CT

Primary adrenocortical insufficiency(E27.1)

ICD 10

Lister Hill National Center for Biomedical Communications 34

(Integrated) concept repositories

Unified Medical Language Systemhttp://umlsks.nlm.nih.govNCBO’s BioPortalhttp://www.bioontology.org/tools/portal/bioportal.htmlcaDSRhttp://ncicb.nci.nih.gov/NCICB/infrastructure/cacore_overview/cadsr

Open Biomedical Ontologies (OBO)http://obofoundry.org/

Lister Hill National Center for Biomedical Communications 35

Integrating subdomains

Biomedicalliterature

MeSH

Genomeannotations

GOModelorganisms

NCBITaxonomy

Geneticknowledge bases

OMIM

Clinicalrepositories

SNOMED CTOthersubdomains

Anatomy

FMA

UMLS

Lister Hill National Center for Biomedical Communications 3636

Integrating subdomains

Biomedicalliterature

Genomeannotations

Modelorganisms

Geneticknowledge bases

Clinicalrepositories

Othersubdomains

Anatomy

Lister Hill National Center for Biomedical Communications 37

Trans-namespace integration

Genomeannotations

GOModelorganisms

NCBITaxonomy

Geneticknowledge bases

OMIMOther

subdomains

Anatomy

FMA

UMLSAddison Disease (D000224)

Addison's disease (363732003)

Biomedicalliterature

MeSH

Clinicalrepositories

SNOMED CT

UMLSC0001403

Lister Hill National Center for Biomedical Communications 38

Mappings

Created manually (e.g., UMLS)PurposeDirectionality

Created automatically (e.g., BioPortal)Lexically: ambiguity, normalizationSemantically: lack of / incomplete formal definitions

Key to enabling semantic interoperabilityEnabling resource for the Semantic Web

Challenging issues

Permanent identifiers for biomedical entities

Lister Hill National Center for Biomedical Communications 40

Identifying biomedical entities

Multiple identifiers for the same entity in different ontologiesBarrier to data integration in general

Data annotated to different ontologies cannot “recombine”Need for mappings across ontologies

Barrier to data integration in the Semantic WebMultiple possible identifiers for the same entity

Depending on the underlying representational scheme (URI vs. LSID)Depending on who creates the URI

Lister Hill National Center for Biomedical Communications 41

Possible solutions

PURL http://purl.orgOne level of indirection between developers and usersIndependence from local constraints at the developer’s end

The institution creating a resource is also responsible for minting URIs

E.g., URI for genes in Entrez Gene

Guidelines: “URI note”W3C Health Care and Life Sciences Interest Group

Challenging issues

Other issues

Lister Hill National Center for Biomedical Communications 43

Availability

Many ontologies are freely availableThe UMLS is freely available for research purposes

Cost-free license requiredLicensing issues can be tricky

SNOMED CT is freely available in member countries of the IHTSDO

Being freely availableIs a requirement for the Open Biomedical Ontologies (OBO)Is a de facto prerequisite for Semantic Web applications

Lister Hill National Center for Biomedical Communications 44

Discoverability

Ontology repositoriesUMLS: 143 source vocabularies(biased towards healthcare applications)NCBO BioPortal: ~100 ontologies(biased towards biological applications)Limited overlap between the two repositories

Need for discovery services

Lister Hill National Center for Biomedical Communications 45

Formalism

Several major formalismWeb Ontology Language (OWL) – NCI ThesaurusOBO format – most OBO ontologiesUMLS Rich Release Format (RRF) – UMLS, RxNorm

Conversion mechanismsOBO to OWLLexGrid (import/export to LexGrid internal format)

Lister Hill National Center for Biomedical Communications 46

Ontology integration

Post hoc integration , form the bottom upUMLS approachIntegrates ontologies “as is”, including legacy ontologiesFacilitates the integration of the corresponding datasets

Coordinated development of ontologiesOBO Foundry approachEnsures consistency ab initioExcludes legacy ontologies

Lister Hill National Center for Biomedical Communications 47

Quality

Quality assurance in ontologies is still imperfectly defined

Difficult to define outside a use case or applicationSeveral approaches to evaluating quality

Collaboratively, by users (Web 2.0 approach)Marginal notes enabled by BioPortal

Centrally, by expertsOBO Foundry approach

Important factors besides qualityGovernanceInstalled base / Community of practice

Thinking outside the integration box

The Butte approach

Lister Hill National Center for Biomedical Communications 49

Integrating genomic and clinical data

No genomic data available for most patientsNo precise clinical data available associated with most genomic data (GWAS excepted)

Genomicdata

Clinicaldata

Lister Hill National Center for Biomedical Communications 50

Integrating genomic and clinical data

Genomicdata

Lister Hill National Center for Biomedical Communications 51

Integrating genomic and clinical data

Genomicdata

Upregulatedgenes

Diseases(extracted from text

+ MeSH terms)

Lister Hill National Center for Biomedical Communications 52

Integrating genomic and clinical data

Clinicaldata

Genomicdata

Codeddischarge

summaries

Laboratorydata

Upregulatedgenes

Diseases(extracted from text

+ MeSH terms)

Lister Hill National Center for Biomedical Communications 53

The Butte approach Methods

Courtesy of David Chen, Butte Lab

Lister Hill National Center for Biomedical Communications 54

The Butte approach Results

Courtesy of David Chen, Butte Lab

Lister Hill National Center for Biomedical Communications 55

The Butte approach

Extremely rough methodsNo pairing between genomic and clinical dataText miningMapping between SNOMED CT and ICD 9-CM through UMLSReuse of ICD 9-CM codes assigned for billing purposes

Extremely preliminary resultsRediscovery more than discovery

Extremely promising nonetheless

Lister Hill National Center for Biomedical Communications 56

The Butte approach References

Dudley J, Butte AJ "Enabling integrative genomic analysis of high-impact human diseases through text mining." Pac Symp Biocomput2008; 580-91Chen DP, Weber SC, Constantinou PS, Ferris TA, Lowe HJ, Butte AJ "Novel integration of hospital electronic medical records and gene expression measurements to identify genetic markers of maturation." Pac Symp Biocomput 2008; 243-54Butte AJ, "Medicine. The ultimate model organism." Science 2008; 320: 5874: 325-7

Lister Hill National Center for Biomedical Communications 57

Conclusions

Ontologies are enabling resources for data integrationStandardization works

Grass roots effort (GO)Regulatory context (ICD 9-CM)

Bridging across resources is crucialOntology integration resources / strategies(UMLS, BioPortal / OBO Foundry)

Massive amounts of imperfect data integrated with rough methods might still be useful

MedicalOntologyResearch

Olivier Bodenreider

Lister Hill National Centerfor Biomedical CommunicationsBethesda, Maryland - USA

Contact:Web:

[email protected]