14
Olivier Bodenreider Olivier Bodenreider Lister Hill National Center Lister Hill National Center for Biomedical Communications for Biomedical Communications Bethesda, Maryland Bethesda, Maryland - - USA USA Issues and perspectives for medical text indexing Medical Informatics Europe 2003 St-Malo, France May 6, 2003 - Workshop W20

Issues and perspectives for medical text indexing

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Issues and perspectives for medical text indexing

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland Bethesda, Maryland -- USAUSA

Issues and perspectivesfor medical text indexing

Medical Informatics Europe 2003St-Malo, France

May 6, 2003 - Workshop W20

Page 2: Issues and perspectives for medical text indexing

The Indexing Initiative

Page 3: Issues and perspectives for medical text indexing

3

Motivation at NLMMotivation at NLM

◆◆ Increasing volume of biomedical literatureIncreasing volume of biomedical literature●● MEDLINE has grown from about 7 million citations in MEDLINE has grown from about 7 million citations in

1996 to over 12 million now1996 to over 12 million now

●● The number of journals indexed has grown from about The number of journals indexed has grown from about 3,750 in 1996 to 4,600 now3,750 in 1996 to 4,600 now

◆◆ Increasing availability of full textIncreasing availability of full text

◆◆ Limited resourcesLimited resources●● Especially qualified indexersEspecially qualified indexers

Page 4: Issues and perspectives for medical text indexing

4

The IND ProjectThe IND Project

◆◆ ObjectivesObjectives●● Investigate automatic and semiautomatic indexing Investigate automatic and semiautomatic indexing

methodsmethods

●● Producing equal or better retrievalProducing equal or better retrieval

◆◆ Initially, an independent collection of projects Initially, an independent collection of projects addressingaddressing●● Indexing methodsIndexing methods

●● EvaluationEvaluation

●● PolicyPolicy

http://ii.nlm.nih.gov

[Aronson & al., AMIA, 2000]

Page 5: Issues and perspectives for medical text indexing

5

Current statusCurrent status

◆◆ SemiSemi--automatic indexingautomatic indexing●● New citations are indexed every nightNew citations are indexed every night

●● Suggested descriptors integrated in the environment Suggested descriptors integrated in the environment used by the indexersused by the indexers

●● Ongoing evaluationOngoing evaluation

◆◆ Automatic indexingAutomatic indexing●● Collections not otherwise indexedCollections not otherwise indexed

●● Descriptors not displayedDescriptors not displayed

Page 6: Issues and perspectives for medical text indexing

6

OverviewOverview Title + Abstract

Ordered list of MeSH Main Headings

MeSH Main Headings

UMLS concepts

Clustering

Restrict to MeSH

TrigramPhrase

Matching

Rel. Citations

PubMedRelated

Citations

ExtractMeSHdescr.

Phrasex

MetaMap

Noun Phrases

Page 7: Issues and perspectives for medical text indexing

Three issues

Page 8: Issues and perspectives for medical text indexing

8

Three issuesThree issues Title + Abstract

Ordered list of MeSH Main Headings

MeSH Main Headings

UMLS concepts

Clustering

Restrict to MeSH

TrigramPhrase

Matching

Rel. Citations

PubMedRelated

Citations

ExtractMeSHdescr.

Phrasex

MetaMap

Noun Phrases

Word-senseambiguity

Terminologyvs. ontology

Evaluation

Page 9: Issues and perspectives for medical text indexing

9

Word sense ambiguityWord sense ambiguity

◆◆ Inherent to natural language processing (NLP)Inherent to natural language processing (NLP)

◆◆ Active research fieldActive research field

◆◆ Compounded in the biomedical domainCompounded in the biomedical domain●● Acronyms / abbreviationsAcronyms / abbreviations

●● Gene / gene product namesGene / gene product names

●● Terms not fully specifiedTerms not fully specified

Page 10: Issues and perspectives for medical text indexing

10

Terminology vs. ontologyTerminology vs. ontology

◆◆ Hierarchies often taskHierarchies often task--drivendrivenrather than based on principlesrather than based on principles

◆◆ Usually suitable for information retrievalUsually suitable for information retrieval●● Better recallBetter recall

●● Precision may not be crucialPrecision may not be crucial

◆◆ Not necessarily suitable for reasoningNot necessarily suitable for reasoning

Page 11: Issues and perspectives for medical text indexing

11

EvaluationEvaluation

◆◆ IndexIndex--basedbased●● Gold standardGold standard

■■ But no ground truthBut no ground truth

●● Similarity measuresSimilarity measures■■ But multiple perspectives possibleBut multiple perspectives possible

◆◆ RetrievalRetrieval--basedbased●● Requires test collectionsRequires test collections

◆◆ SystemSystem--vs. uservs. user--centeredcentered

Page 12: Issues and perspectives for medical text indexing

Perspectives

Page 13: Issues and perspectives for medical text indexing

13

PerspectivesPerspectives

◆◆ RequirementsRequirements●● Better ontologiesBetter ontologies

●● Better identification of specialized entitiesBetter identification of specialized entities(e.g., gene names)(e.g., gene names)

●● Better wordBetter word--sense disambiguation techniquessense disambiguation techniques

◆◆ Tremendous interestTremendous interest(through data mining and knowledge discovery)(through data mining and knowledge discovery)●● In the medical informatics communityIn the medical informatics community

●● And beyond (KDD cup 02, genomic track at TREC 03)And beyond (KDD cup 02, genomic track at TREC 03)

Page 14: Issues and perspectives for medical text indexing

Contact:Contact: olivierolivier@@nlmnlm..nihnih..govgovWeb:Web: etbsun2.etbsun2.nlmnlm..nihnih..govgov:8000:8000

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland Bethesda, Maryland -- USAUSA

MedicalOntologyResearch