Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Biomedical Knowledge RepositoryBiomedical Knowledge Repositoryand Semantic Medlineand Semantic Medline
Olivier BodenreiderOlivier Bodenreider, M.D., Ph.D., M.D., Ph.D.Thomas C. RindfleschThomas C. Rindflesch, Ph.D., Ph.D.
National Institute of AllergyNational Institute of Allergyand Infectious Diseasesand Infectious DiseasesJune 25, 2007June 25, 2007
2Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
ContextContext
Provide biomedical informationProvide biomedical informationto health care professionals and consumersto health care professionals and consumers
Exploit NLM resourcesExploit NLM resourcesMaintain NLMMaintain NLM’’s cutting edges cutting edge
Proposal overviewAdvanced Library ServicesBiomedical Knowledge Repository
Pilot projects
Proposal overviewProposal overviewAdvanced Library ServicesAdvanced Library ServicesBiomedical Knowledge RepositoryBiomedical Knowledge Repository
Pilot projectsPilot projects
3Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Why additional services?Why additional services?
Biomedical information is growing at an Biomedical information is growing at an increasingly faster paceincreasingly faster pace
HighHigh--throughput approach to knowledge processingthroughput approach to knowledge processingInformation retrieval is the starting point, not the Information retrieval is the starting point, not the end of the journey for the researcherend of the journey for the researcher
Towards Towards ““computablecomputable”” knowledgeknowledgeIntegration between literature and other resources Integration between literature and other resources is insufficientis insufficient
Adequate for navigation purposesAdequate for navigation purposesInsufficient for knowledge processingInsufficient for knowledge processing
4Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
What additional services?What additional services?
Refined information retrievalRefined information retrievalIndexing on relations in addition to conceptsIndexing on relations in addition to conceptsFind articles asserting that Find articles asserting that ILIL--13 inhibits COX13 inhibits COX--22
MultiMulti--document summarizationdocument summarizationExtract and visualize facts from the literatureExtract and visualize facts from the literatureSummarize the top 300 papers on Summarize the top 300 papers on panic disorderpanic disorder
Question answeringQuestion answeringClinical and biological questionsClinical and biological questionsWhat drugs What drugs interactinteract with with imipramineimipramine??
Knowledge discoveryKnowledge discoveryReasoning with facts from heterogeneous resourcesReasoning with facts from heterogeneous resourcesFrom MEDLINE and UMLS togetherFrom MEDLINE and UMLS together
5Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Normalized and integrated knowledgeNormalized and integrated knowledge
Normalized knowledgeNormalized knowledgeCommon formatCommon formatCommon identification mechanismCommon identification mechanism
Integrated knowledgeIntegrated knowledgeSingle repositorySingle repositorySeamless environmentSeamless environmentPhenotype and genotype information togetherPhenotype and genotype information together
Biomedical Knowledge Repository
6Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Sources of knowledgeSources of knowledge
Biomedical literatureBiomedical literaturePredications extracted from Predications extracted from MEDLINEMEDLINE abstracts and fullabstracts and full--text text publicly available articles using text mining techniquespublicly available articles using text mining techniquesOther corpora (e.g., Other corpora (e.g., ClinicalTrials.govClinicalTrials.gov))
Terminological knowledgeTerminological knowledgeUMLSUMLS
Structured knowledge basesStructured knowledge basesNCBI resources (e.g., NCBI resources (e.g., Entrez GeneEntrez Gene))Functional annotations from model organism databasesFunctional annotations from model organism databases……
Contributed knowledgeContributed knowledgeThe repository is open to collaborators outside NLMThe repository is open to collaborators outside NLM
7Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Formalism Formalism TriplesTriples
FactsFactsAssertionsAssertionsRelationsRelationsSemantic predicationsSemantic predicationsRDF triplesRDF triples
Imipramine Panic Disorder
treats
APP Alzheimer disease
has_associated_disease
concept1 concept2
relationship
8Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Annotated knowledgeAnnotated knowledge
Provenance informationProvenance informationSource (e.g., PMID)Source (e.g., PMID)Extraction mechanismExtraction mechanismTimestampTimestamp
Frequency informationFrequency informationRedundancyRedundancy
Collaborative annotationCollaborative annotation““Was this information useful?Was this information useful?””Context of use/usefulnessContext of use/usefulness
9Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Semantic Web perspectiveSemantic Web perspective
Common format for knowledgeCommon format for knowledgeResource Description Format (RDF)Resource Description Format (RDF)
Common identification schemeCommon identification schemeUnified Resource Identifier (URI)Unified Resource Identifier (URI)
Standard toolsStandard toolsRDF browsersRDF browsersRDF RDF ““reasonersreasoners””
High level of interest for biomedicine in the SW High level of interest for biomedicine in the SW communitycommunity
Health Care and Life Sciences Interest GroupHealth Care and Life Sciences Interest Group
BiomedicalKnowledgeRepository
InformationRetrieval
QuestionAnswering
KnowledgeDiscovery
DocumentSummarization
Sourceselection(PubMed,
annotations)
UMLSTerminological
Knowledge
GO
EntrezGene Structured
Knowl. Bases
ContributedKnowledge
Advanced Library Services SummaryAdvanced Library Services Summary
MEDLINE BiomedicalLiteratureCT.gov
BiomedicalKnowledgeRepository
InformationRetrieval
QuestionAnswering
KnowledgeDiscovery
DocumentSummarization
Sourceselection(PubMed,
annotations)
UMLSTerminological
Knowledge
GO
EntrezGene Structured
Knowl. Bases
ContributedKnowledge
MEDLINE BiomedicalLiteratureCT.gov
BiomedicalKnowledgeRepositoryStructured
Knowl. Bases
EntrezGene
Advanced Library Services Pilot projectsAdvanced Library Services Pilot projects
DocumentSummarization
Populating the repository Exploiting the repository
Source selection(PubMed)
XSLT
MEDLINE BiomedicalLiteratureCT.gov
SemRep
Pilot #1Pilot #1
Populating and exploiting thePopulating and exploiting theBiomedical Knowledge RepositoryBiomedical Knowledge Repository
Converting Entrez Gene into RDFConverting Entrez Gene into RDF
With With SatyaSatya SahooSahoo (U. Georgia)(U. Georgia)and and Kelly Kelly ZengZeng (LHC)(LHC)
13Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
…
14Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
OverviewOverview
XML(file)XML(file)
RDF(file)RDF(file)
RDF(Oracle)
RDF(Oracle)
JAPX Jena
XSLTStylesheet
XSLTStylesheet
124 element tags2M genes
106 properties410M triples
Names has_name
15Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
…
APP(GeneID: 351)
amyloid beta A4 protein
has_protein_name
16Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
APP (geneid-351) amyloid beta A4 protein
eg:has_protein_reference_name_E
subject predicate object
RDF triple RDF triple Gene propertyGene property
17Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
RDF graph RDF graph Connecting several genesConnecting several genes
PARK1 Parkinson disease
has_associated_disease
MAPT Parkinson disease
MAPT Pick disease
TBP Parkinson disease
TBP Spinocerebellar ataxia
PARK1 Parkinson disease
Parkinson diseaseMAPT
Pick disease
Parkinson diseaseTBP
Spinocerebellar ataxia
PARK1 Parkinson disease
MAPT Pick disease
TBP Spinocerebellar ataxia
18Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Future workFuture work
Transform additional resources into RDFTransform additional resources into RDFUMLS MetathesaurusUMLS MetathesaurusOther NCBI databasesOther NCBI databasesDrug knowledge basesDrug knowledge bases……
Integrate resourcesIntegrate resourcesQuery across resourcesQuery across resources
APP Alzheimer disease
PARK1 Parkinson disease
has_associated_disease
Alzheimer disease
Parkinson disease
Neurodegenerative diseases
isa
19Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
From From glycosyltransferaseglycosyltransferaseto to congenital muscular dystrophycongenital muscular dystrophy
MIM:608840 Muscular dystrophy, congenital, type 1D
GO:0008375
has_associated_phenotype
has_molecular_function
EG:9215LARGE
acetylglucosaminyl-transferase
GO:0016757glycosyltransferase
GO:0008194isa
GO:0008375 acetylglucosaminyl-transferase
GO:0016758
Pilot #2Pilot #2
Populating and exploiting thePopulating and exploiting theBiomedical Knowledge RepositoryBiomedical Knowledge Repository
Semantic Medline:Semantic Medline:MultiMulti--document summarizationdocument summarization
and visualizationand visualization
With With Marcelo Fiszman, Marcelo Fiszman, M.D., Ph.D.M.D., Ph.D.and and Halil Kilicoglu,Halil Kilicoglu, M.S.M.S.
BiomedicalKnowledgeRepository
InformationRetrieval
QuestionAnswering
KnowledgeDiscovery
DocumentSummarization
Sourceselection(PubMed,
annotations)
UMLSTerminological
Knowledge
GO
EntrezGene Structured
Knowl. Bases
ContributedKnowledge
MEDLINE BiomedicalLiteratureCT.gov
BiomedicalKnowledgeRepository
Advanced Library Services Pilot projectsAdvanced Library Services Pilot projects
DocumentSummarization
Populating the repository Exploiting the repository
Source selection(PubMed)
MEDLINE BiomedicalLiteratureCT.gov
SemRep
22Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Managing retrieval resultsManaging retrieval results
Information retrieval
summarization
Semantic Medline
HIV
Network of relations
retrieval
500 citations
23Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Managing retrieval resultsManaging retrieval results
HIV
24Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Seamless integration of technologiesSeamless integration of technologies
Information retrievalInformation retrievalPubMed PubMed -- MEDLINEMEDLINEEssie Essie -- ClinicalTrials.govClinicalTrials.gov
Natural language processing: Natural language processing: SemRepSemRepRepresent content of text with semantic predicationsRepresent content of text with semantic predications
Abstraction summarizationAbstraction summarizationInformative: Overview of most salient informationInformative: Overview of most salient information
VisualizationVisualizationIndicative: Links to source text and additional informationIndicative: Links to source text and additional information
25Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
StructuredBiomedical
DataUMLS
SemanticPredications
SemRep
InformativeGraph
Visualize
SalientSemantic
Predications
Summarize
Semantic Medline Semantic Medline OverviewOverview
Text
MEDLINEClinicalTrials.gov
PubMedEssieQuery
26Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
StructuredBiomedical
DataUMLS
SemanticPredications
SemRep
InformativeGraph
Visualize
SalientSemantic
Predications
Summarize
Document selectionDocument selection
Text
MEDLINEClinicalTrials.gov
PubMedQuery “HIV”““HIVHIV””
27Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
StructuredBiomedical
DataUMLS
SemanticPredications
SemRep
InformativeGraph
Visualize
SalientSemantic
Predications
Summarize
MEDLINE citationsMEDLINE citations
MEDLINEClinicalTrials.gov
PubMedEssieQuery
Text…… Tat activities, which play a Tat activities, which play a role in HIV disease development.role in HIV disease development.
……increased risk of invasive increased risk of invasive pneumococcalpneumococcal infection observed infection observed in HIVin HIV--1 infection.1 infection.
28Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
StructuredBiomedical
Data
InformativeGraph
Visualize
SalientSemantic
Predications
Summarize
Semantic Semantic intepretationintepretation
MEDLINEClinicalTrials.gov
PubMedEssieQuery
UMLS
SemanticPredications
SemRep
Text
29Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Semantic interpretationSemantic interpretation
…… TatTat activities, which activities, which play a role play a role in in HIV diseaseHIV diseasedevelopment.development.
tat genes HIV Infectionsassociated_with
……increased risk of invasive increased risk of invasive pneumococcalpneumococcal infectioninfectionobserved observed inin HIVHIV--1 infection1 infection..
Pneumococcal Infections HIV Infectionsco-exists_with
30Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
StructuredBiomedical
DataUMLS
SemRep
InformativeGraph
Visualize
SalientSemantic
Predications
Summarize
Semantic predicationsSemantic predications
Text
MEDLINEClinicalTrials.gov
PubMedEssieQuery
SemanticPredications
tat genes Toxic effectcauses
Pneumococcal Infections HIV Infectionsco-exist_with
tat genes HIV Infectionsassociated_with
HIV Infections Personsprocess_of
Disease HIV Infectionsco-exist_with
… … …
31Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
StructuredBiomedical
DataUMLS
SemRep
InformativeGraph
Visualize
SummarizationSummarization
Text
MEDLINEClinicalTrials.gov
PubMedEssieQuery
SalientSemantic
Predications
Summarize
SemanticPredications
32Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Abstraction summarizationAbstraction summarization
AllPredications
Reduction SalientPredications
Specify a topicSpecify a topicRetain predications on the topicRetain predications on the topicEliminate uninformative predicationsEliminate uninformative predicationsRetain most frequent predicationsRetain most frequent predications
33Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
StructuredBiomedical
DataUMLS
SemRep
InformativeGraph
VisualizeSummarize
Salient semantic predicationsSalient semantic predications
Text
MEDLINE
PubMedQuery
SemanticPredications
SalientSemantic
Predications
tat genes Toxic effectcauses
Pneumococcal Infections HIV Infectionsco-exist_with
tat genes HIV Infectionsassociated_with
HIV Infections Personsprocess_of
Disease HIV Infectionsco-exist_with
… … …
34Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
StructuredBiomedical
DataUMLS
SemRep Summarize
VisualizationVisualization
Text
MEDLINEClinicalTrials.gov
PubMedEssieQuery
SemanticPredications
InformativeGraph
Visualize
SalientSemantic
Predications
35Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
SemRep Summarize
Informative graphInformative graph
MEDLINEClinicalTrials.gov
PubMedEssieQuery
SemanticPredications
Visualize
SalientSemantic
Predications
StructuredBiomedical
DataUMLS
Text InformativeGraph
PneumococcalInfectionsToxic effect
HIV Infections
tat gene
causes
associated_with
co-exists_with
36Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Semantic Medline Semantic Medline LiveLive
37Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Future workFuture work
Process all of MEDLINE/PubMedProcess all of MEDLINE/PubMedWith SemRepWith SemRep
Incrementally integrate structured knowledge sourcesIncrementally integrate structured knowledge sourcesEntrez databasesEntrez databasesUMLSUMLSGenetics Home ReferenceGenetics Home Reference
Implementation Implementation EfficiencyEfficiencyLarge amount of dataLarge amount of data
38Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
SummarySummary
Deliver health informationDeliver health informationBiomedical Knowledge RepositoryBiomedical Knowledge RepositoryAdvanced Library ServicesAdvanced Library Services
ExploitExploitCurrent Library resourcesCurrent Library resourcesAdvanced information technologyAdvanced information technology
Support timely translationSupport timely translationOf biomedical researchOf biomedical researchInto improvements in patient careInto improvements in patient careand public healthand public health
39Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
AcknowledgmentsAcknowledgments
Caroline AhlersCaroline AhlersMariana DimitrovMariana DimitrovMarcelo FiszmanMarcelo FiszmanHalil KilicogluHalil KilicogluFranFranççoisois--Michel LangMichel LangLee PetersLee PetersAnna RippleAnna RippleGraciela RosemblatGraciela RosemblatSatyaSatya SahooSahooKelly Kelly ZengZeng
40Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
ReferencesReferences
Bodenreider O, Bodenreider O, RindfleschRindflesch TC. TC. Advanced library Advanced library services: Developing a biomedical knowledge services: Developing a biomedical knowledge repository to support advanced information repository to support advanced information management applications.management applications. Technical report. Technical report. Bethesda, Maryland: Lister Hill National Center Bethesda, Maryland: Lister Hill National Center for Biomedical Communications, National Library for Biomedical Communications, National Library of Medicine; September 14, 2006.of Medicine; September 14, 2006.http://lhncbc.nlm.nih.gov/lhc/docs/reports/2006/tr2006001.pdfhttp://lhncbc.nlm.nih.gov/lhc/docs/reports/2006/tr2006001.pdf
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland Bethesda, Maryland -- USAUSA
AdvancedLibraryServices
Olivier BodenreiderOlivier Bodenreider [email protected]@nlm.nih.govThomas C. Thomas C. RindfleschRindflesch [email protected]@nlm.nih.gov