Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Reconciling annotations made to different terminologiesThe role of terminology integration resources
Olivier Bodenreider
Lister Hill National Centerfor Biomedical CommunicationsBethesda, Maryland - USA
Second workshop onSemantic Enrichment of the Scientific Literature
European Bioinformatics Institute, Hinxton, UKMarch 31, 2009
Lister Hill National Center for Biomedical Communications 2
Introduction
Ontologies* are used as a source of vocabularyTerms form a terminologyConcept namesLabels
Lists of terms are a key resource to most named entity recognition (NER) systemsEntity normalization is performed in reference to a given ontology and resolves terms into identifiers from this ontology* Ontologies, defined broadly, include thesauri, controlled
vocabularies, terminologies, coding systems, etc.
Lister Hill National Center for Biomedical Communications 3
Example
Neurofibromatosis type 2 (NF2) is often notrecognised as a distinct entity from peripheralneurofibromatosis. NF2 is a predominantlyintracranial condition whose hallmark is bilateralvestibular schwannomas. NF2 results from amutation in the gene named merlin, located onchromosome 22.
[Uppal, S., and A. P. Coatesworth. “Neurofibromatosis Type 2.” Int J Clin Pract, 57, no. 8, 2003, pp. 698-703.] PMID:14627181
Lister Hill National Center for Biomedical Communications 4
Example Annotation with MeSH
Neurofibromatosis type 2 (NF2) is often notrecognised as a distinct entity from peripheralneurofibromatosis. NF2 is a predominantlyintracranial condition whose hallmark is bilateralvestibular schwannomas. NF2 results from amutation in the gene named merlin, located onchromosome 22.
D016518 - Neurofibromatosis 2D025581 - Neurofibromin 2
Lister Hill National Center for Biomedical Communications 5
The issue
Many biomedical ontologies can be usedAs a source of vocabularyAs a reference for annotation purposes
No universal identifiers for biomedical entitiesMany NER systems are tied to particular (sets of) ontologies
Need for reconciling annotation made by different NER systems to different ontologies
Lister Hill National Center for Biomedical Communications 6
Example Annotation with OMIM
Neurofibromatosis type 2 (NF2) is often notrecognised as a distinct entity from peripheralneurofibromatosis. NF2 is a predominantlyintracranial condition whose hallmark is bilateralvestibular schwannomas. NF2 results from amutation in the gene named merlin, located onchromosome 22.
#101000 - NEUROFIBROMATOSIS, TYPE II; NF2*607379 - NEUROFIBROMIN 2; NF2
Lister Hill National Center for Biomedical Communications 7
Example Reconciling annotations
Neurofibromatosis type 2 (NF2) is often notrecognised as a distinct entity from peripheralneurofibromatosis. NF2 is a predominantlyintracranial condition whose hallmark is bilateralvestibular schwannomas. NF2 results from amutation in the gene named merlin, located onchromosome 22.
#101000 - NEUROFIBROMATOSIS, TYPE II; NF2D016518 - Neurofibromatosis 2
*607379 - NEUROFIBROMIN 2; NF2D025581 - Neurofibromin 2
Lister Hill National Center for Biomedical Communications 8
Related fields 1
Ontology alignmentLexical techniques: string match (exact / normalized)Structural techniques: shared relataSemantic techniques: e.g., disjointness between domains
MappingsSNOMED CT → ICD9-CM
UnidirectionalFor coding purposes
MeSH → ICD9-CMUnidirectionalFor MEDLINE queries
Lister Hill National Center for Biomedical Communications 9
Related fields 2
Terminology integration resourcesUMLS http://www.nlm.nih.gov/research/umls/
Automatic pre-processing (normalized lexical match)Manual curation by the UMLS editors
RxNorm http://www.nlm.nih.gov/research/umls/rxnorm/Narrow domainModel for normalizing drug names
NCBO BioPortalhttp://www.bioontology.org/tools/portal/bioportal.html
Mappings contributed by userLexical matches
Lister Hill National Center for Biomedical Communications 10
Approaches to reconciling annotations
Identifying equivalent conceptsLexical approaches
Lexical similarity
Semantic approachesSynonymy
Explicit mapping relations
Identifying similar conceptsSemantic similarity
Annotation reconciliation strategy
Lister Hill National Center for Biomedical Communications 11
Exact string match
Exact string matchMany false negatives
Missing synonymsMissing lexical variants
False positivesAmbiguity
Equivalent concepts• Lexical
approaches
Lister Hill National Center for Biomedical Communications 12
Edit distance / Stemming
Edit distanceAllows minor differencesThreshold for similarityFalse negatives
Complex lexical variants
False positivesAmbiguity (same as with exact match)Additional ambiguity (few cases)
– Ascending Palatine Artery vs. Descending palatine artery– Right kidney vs. Kidney
Equivalent concepts• Lexical
approaches
Lister Hill National Center for Biomedical Communications 13
Normalization
Linguistic model of variation in biomedical terminologiesAlgorithms for managing variationImplemented in UMLS Lexical tools (lvg, norm)
[McCray et al., AMIA, 1994]
Equivalent concepts• Lexical
approaches
Lister Hill National Center for Biomedical Communications 14
Normalization Principles
Hodgkin’s diseases, NOS
Hodgkin diseases, NOSRemove genitive
Hodgkin diseases, Remove stop words
hodgkin diseases,Lowercase
hodgkin diseasesStrip punctuation
hodgkin diseaseUninflect
Sort wordsdisease hodgkin
Equivalent concepts• Lexical
approaches
Lister Hill National Center for Biomedical Communications 15
Normalization Example
Hodgkin DiseaseHODGKINS DISEASEHodgkin's DiseaseDisease, Hodgkin'sHodgkin's, diseaseHODGKIN'S DISEASEHodgkin's diseaseHodgkins DiseaseHodgkin's disease NOSHodgkin's disease, NOSDisease, HodgkinsDiseases, HodgkinsHodgkins DiseasesHodgkins diseasehodgkin's diseaseDisease, Hodgkin
normalize disease hodgkin
Equivalent concepts• Lexical
approaches
Lister Hill National Center for Biomedical Communications 16
Normalization
NormalizationAllows controlled flexibilityFalse negatives
Derivational morphology– Tibial Inter Condylar Eminence vs.
Intercondylar eminence of tibiaEllipsis
– Splenius Cervicis Muscle vs. Splenius CervicisSome false positives
Ambiguity (same as with exact match)Additional ambiguity (few cases)
– 3’-5’ Exonuclease vs. 5’-3’ Exonucleaseboth normalized into 3 5 Exonuclease
Equivalent concepts• Lexical
approaches
Lister Hill National Center for Biomedical Communications 17
Synonymy relations
Concept-orientationUMLS MetathesaurusWordnet (synsets)
Sets of names recognized as synonymsIn a given source vocabularyBy the UMLS Metathesaurus editors
Assigned the same concept identifer
Equivalent concepts• Semantic
approaches
Lister Hill National Center for Biomedical Communications 18
Synonymy relations Example
C0027832Bilateral Acoustic NeurofibromatosisACOUSTIC NEURINOMA, BILATERALCANNeurofibromatosis, Type IINF2BANFNEUROFIBROMATOSIS, CENTRAL TYPENeurofibromatosis 2Neurofibromatosis, Central NF2Neurofibromatosis, Central, NF 2Neurofibromatosis, Type 2Neurofibromatosis Type 2NF2 (Neurofibromatosis 2)Neurofibromatosis IINeurofibromatosis Type II[…]
Equivalent concepts• Semantic
approaches
Lister Hill National Center for Biomedical Communications 19
Explicit mapping relations
Mapping relationsCreated for a given purpose
SNOMED CT → ICD-9-CM, for billing purposesMeSH → ICD-9-CM, for querying MEDLINE from ICD codes
Recorded in the UMLS MetathesaurusAdditional mappings in the BioPortal
Not synonymy, but probably close enough for the purpose of reconciling annotations
Equivalent concepts• Mapping
approaches
Lister Hill National Center for Biomedical Communications 20
Explicit mapping relations Examples
Neurofibromatosis 2(C0027832)
Schwannoma, Acoustic, Bilateral(C1136043)
alias of(OMIM)
narrower than(MeSH)
Lister Hill National Center for Biomedical Communications 21
Semantic similarity
Based on the notion of a lowest common subsumerEdge counting
Intuitive and simpleRequires density to be homogeneous in the taxonomySeveral methods (Hirst & St-Onge, Leacock and Chodorow, Wu & Palmer)
Information-theoretic metricsGrounded in information theoryCompensates for heterogeneity in density in the taxonomyDeveloped on WordNet, applied to GO, MeSH, UMLSRequires frequencies in a corpusSeveral methods (Resnik, Lin, Jiang & Conrath )
[Pedersen et al., JBI, 2007]
Similar concepts• Semantic
similarity
Lister Hill National Center for Biomedical Communications 22
A strategy for combining approaches
RequirementsPrimum non nocere
Most reliable, less aggressive methods firstReuse
Existing synonymy relationsExisting methods for managing lexical variationExisting mapping relationsExisting semantic similarity measures
Example of strategy for mapping between terminologies [Fung et al., Medinfo, 2007]
Lister Hill National Center for Biomedical Communications 23
Terminology integration resources
Unified Medical Language System (UMLS)National Library of MedicineStarted in 1986Mostly clinical terminologiesCurated manually
BioPortalNational Center for Biomedical Ontology (NCBO)Started in 2006Mostly biological terminologies (OBO)No manual curation
Lister Hill National Center for Biomedical Communications 24
Terminology integration resources
Unified Medical language System (UMLS)National Library of MedicineStarted in 1986
«[…] the UMLS project is an effort to overcome two significant barriers to effective retrieval of machine-readable information.
• The first is the variety of ways the same concepts are expressed in different machine-readable sources and by different people.
• The second is the distribution of useful information among many disparate databases and systems.»
Lister Hill National Center for Biomedical Communications 25
Source Vocabularies
148 source vocabularies17 languages
Broad coverage of biomedicine9M names1.9M concepts>10M relations
Including about 150,000 mapping relations
(2008AB)
Lister Hill National Center for Biomedical Communications 26
UMLS Metathesaurus Example
Addison’s Disease
C0001403
ADRENAL INSUFFICIENCY (ADDISON'S DISEASE) ADRENOCORTICAL INSUFFICIENCY, PRIMARY FAILURE Hypoadrenalisms, PrimaryMelasma addisonii Primary adrenal deficiency Asthenia pigmentosa Bronzed disease Insufficiency, adrenal primary Primary adrenocortical insufficiency Addison's, disease
Maladie d'Addison - FrenchAddison-Krankheit - GermanMorbo di Addison - ItalianDoença de Addison - PortugueseАДДИСОНОВА БОЛЕЗНЬ - Russianアジソン病 - Japanese
An adrenal disease characterized by the progressive destruction of the adrenal cortex, resulting in insufficient production of aldosterone and hydrocortisone. Clinical symptoms include anorexia; nausea; weight loss; muscle ewakness; and hyperpigmentation of the skin due to increase in circulating levels of ACTH precursor hormone which stimulates melanocytes.
Disease or Syndrome
SNOMED CTSNOMED IntlMeSHMedDRA…
Lister Hill National Center for Biomedical Communications 27
UMLS Integrating subdomains
Biomedicalliterature
MeSH
Genomeannotations
GOModelorganisms
NCBITaxonomy
Geneticknowledge bases
OMIM
Clinicalrepositories
SNOMED CTOthersubdomains
…
Anatomy
FMA
UMLS
Lister Hill National Center for Biomedical Communications 28
Trans-namespace integration
Genomeannotations
GOModelorganisms
NCBITaxonomy
Geneticknowledge bases
OMIMOthersubdomains
…
Anatomy
FMA
UMLSAddison Disease (D000224)
Addison's disease (363732003)
Biomedicalliterature
MeSH
Clinicalrepositories
SNOMED CT
UMLSC0001403
Lister Hill National Center for Biomedical Communications 29
Challenges Coverage
Selecting terminology integration resourcesUMLS: more clinical, manual curationBioPortal: more biological, no curationBoth: no integration between UMLS and BioPortal specific resources
Need for additional resources Subdomains
Genes (e.g., Entrez Gene, GenBank/EMBL/DDBJ)Proteins (e.g., UniProt)
Specific constraintsDo not normalize (case matters, e.g., NF2 vs. Nf2)Dictionary approach is insufficient for NER (e.g., genes)
Lister Hill National Center for Biomedical Communications 30
Challenges Integration
Annotations to a resource not integratede.g., BRENDA vs. FMA for tissue typesAutomatic mapping
Controlling for semantic type
Manual curation in case of ambiguity or failure
Annotations to DL ontologies in which labels are not the primary focus (e.g., GALEN)
Lister Hill National Center for Biomedical Communications 31
Challenges Others
How to handle ambiguity?A given system returns several (alternative) annotations for a given chunk of text
No universal thresholdsEdit distanceSemantic similarity
Lack of services for annotation reconciliationfindCui(code, source) → CUIhasMapping(cui1, cui2) → BoleansemSim(cui1, cui2) → semantic similarity score [0-1]
Lister Hill National Center for Biomedical Communications 32
Conclusions
Need for reconciling annotations made to different reference ontologies for integration and comparison purposesTerminology integration resources provide
Synonymy relationsExplicit mapping relationsSupport for computing semantic similarity
Many unsolved issues remain
MedicalOntologyResearch
Olivier Bodenreider
Lister Hill National Centerfor Biomedical CommunicationsBethesda, Maryland - USA
Contact:Web: