Upload
ngongoc
View
217
Download
0
Embed Size (px)
Citation preview
Erik Fäßler TechnicalIntroductiontoSemedico 1
Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena,
Jena, Germany
http://www.julielab.de
A Technical Introduction to the Semantic Search Engine SeMedico
Erik Fäßler
TalkintheSemesterprojektEntwicklungeinerSuchmaschinefürAlternativmethodenzuTierversuchen
January12,2018Humboldt-UniversitätzuBerlin
Erik Fäßler TechnicalIntroductiontoSemedico 6
SeMedico System Overview JULIELabServer
PostgreSQL
CR
AE
AE
AE
CO
ElasticSearchConceptDatabase
SeMedicoWebApplicationJavaServlet
Frontend(Tapestry/JavaScript)
NCBIGene
DocDoc
DocMEDLINE
Erik Fäßler TechnicalIntroductiontoSemedico 7
MEDLINE Document Storage I • MEDLINE comes in (G)ZIPed XML
• 30K documents per file <PubmedArticleSet>
<PubmedArticle><MedlineCitation> <PMID>1234567</PMID>
<Article> <Journal>...</Journal> <ArticleTitle>...</ArticleTitle> <Abstract>...</Abstract> <AuthorList>...</AuthorList> <MeshHeadings>...</MeshHeadings>
</Article></MedlineCitation><MedlineCitation> <PMID>...</PMID> ...
</MedlineCitation></PubmedArticle></PubmedArticleSet>
Erik Fäßler TechnicalIntroductiontoSemedico 8
MEDLINE Document Storage II • Import of MEDLINE citations into database table
• Size of MEDLINE: 27M abstracts
pmid xml
1 1234567 <MedlineCitation><PMID>1234567</PMID>...</MedlineCitation>
2 1729454 <MedlineCitation><PMID>1729454</PMID>...</MedlineCitation>
3 1785742 <MedlineCitation><PMID>1785742</PMID>...</MedlineCitation>
4 2264674 <MedlineCitation><PMID>2264674</PMID>...</MedlineCitation>
... ... ...
Erik Fäßler TechnicalIntroductiontoSemedico 9
pmid xml
1 1234567 <MedlineCitation><PMID>1234567</PMID>...</MedlineCitation>
2 1729454 <MedlineCitation><PMID>1729454</PMID>...</MedlineCitation>
3 1785742 <MedlineCitation><PMID>1785742</PMID>...</MedlineCitation>
4 2264674 <MedlineCitation><PMID>2264674</PMID>...</MedlineCitation>
... ... ...
DocDoc
DocMEDLINE
JULIELabServer
PostgreSQL
CR
AE
AE
AE
CO
ElasticSearchConceptDatabase
SeMedicoWebApplicationJavaServlet
Frontend(Tapestry/JavaScript)
NCBIGene
From the Database into the Pipeline I
Erik Fäßler TechnicalIntroductiontoSemedico 10
From the Database into the Pipeline II
UIMAMedlineDBReader• DBconcurrencyhandling• ParsingofXML• PopulatingUIMACASinstance
• Title/Abstract• Authors• JournalInfo• etc.
JULIELabServer
PostgreSQL totextanalysiscomponents
CAS
CommonAnalysisSystem
Erik Fäßler TechnicalIntroductiontoSemedico 11
SeMedico System Overview JULIELabServer
PostgreSQL
CR
AE
AE
AE
CO
ElasticSearchConceptDatabase
SeMedicoWebApplicationJavaServlet
Frontend(Tapestry/JavaScript)
NCBIGene
DocDoc
DocMEDLINE
Erik Fäßler TechnicalIntroductiontoSemedico 12
SeMedico UIMA JCoRe Pipeline I
Sentences Tokens Abbreviations PartsofSpeech
GeNo:Genes/Proteins
• Recognition• Normalization
(NCBIGene)
Semanticlayer
MolecularEventExtraction(BioSem)
MeSHTerms(Dictionary)
Ontologyclasses(GO,GRO;Dictionary)
EventCertaintyAssessment
Scale1to61:Negation6:Nodoubt
Species(LINNAEUS)
fromreader
toconsumer
https://github.com/JULIELab/,Hahn&Matthiesetal.,LREC2016
Erik Fäßler TechnicalIntroductiontoSemedico 13
SeMedico UIMA JCoRe Pipeline II
ElasticSearchCASConsumer• TransformsCASinto
preanalyzedJSONdocument• Transformationconfigurable
viaAPI• JULIELabESpluginrequired
fromanalysispipeline
ElasticSearch
CAS
title
abstract
species
genes
events
preanalyzedJSON{
“title”:{…},“abstract”:{…},“authors”:{…},“…”:{…}
}
transformationAP
I
http
Erik Fäßler TechnicalIntroductiontoSemedico 14
Full texts from Pubmed Central
• SeMedico integrates the open access subset of PMC
• Use a specific reader from JCoRe: jcore-pmc-reader
• The rest of the analysis is basically the same
• But:
Matthies,Franz,&Hahn,Udo(2017).ScholarlyinformationextractionisgoingtomakeaquantumleapwithPubMedCentral(PMC)®—Butmovingfromabstractstofulltextsseemsharderthanexpected.in:MedInfo2017:PrecisionHealthcarethroughInformatics–Proceedingsofthe16thWorldCongressonMedicalandHealthInformatics.Hangzhou,China,21-25August2017,521-525.
Erik Fäßler TechnicalIntroductiontoSemedico 15
SeMedico System Overview JULIELabServer
PostgreSQL
CR
AE
AE
AE
CO
ElasticSearchConceptDatabase
SeMedicoWebApplicationJavaServlet
Frontend(Tapestry/JavaScript)
NCBIGene
DocDoc
DocMEDLINE
Erik Fäßler TechnicalIntroductiontoSemedico 16
Concept Database I Name Description NumberofConcepts
MedicalSubjectHeadings(MeSH)
Biomedicalvocabulary,multihierarchy
26K
MeSHSupplementaryConcepts Chemicals,proteinsetc.connectedtoMeSH
150K
NCBIGene GeneDatabase 650K(inSeMedico)
NCBITaxonomy Taxonomicalclassificationofspecies
1.1M
GeneOntology(GO) Ontologyaboutgeneproductsandrelatedprocesses
50K
GeneRegulationOntology(GRO) Ontologyaboutgeneregulationprocesses
507
Erik Fäßler TechnicalIntroductiontoSemedico 17
Concept Database II
• Concepts are arranged taxonomically • Squamous Cell Carcinoma IS-A Carcinoma
• Neo4j is a graph database • Terminologies and arbitrary relations between
concepts can be modeled explicitly • Appropiate query language:
• “Get descendants of concept” • “Compute shortest path between two
concepts”
Erik Fäßler TechnicalIntroductiontoSemedico 18
Neo4j Example Graph
type1
type2 type3
type4
Tauopathies
Erik Fäßler TechnicalIntroductiontoSemedico 21
Concept IDs
ConceptDatabase
tid2341
tid914
tid42
CASabstract
speciesncbitax:9606
genesmTOR
ncbigene:2475
JSON{
“abstract”:{[“human”,“tid914”,“mTOR”,“tid42”]}
}transformationAP
I
ElasticSearch
SeMedicoWebApplicationJavaServlet
query:“match:tid914”facet“tid42”:{“name”:“mTOR”,“synonym”:“FRAP”,“description”:“…“}
Erik Fäßler TechnicalIntroductiontoSemedico 22
ElasticSearch I
• Manages Lucene index
• Seamless index updates, no downtime
• Easy to use index distribution model
• Full text search
• Faceting
• Highlighting
Erik Fäßler TechnicalIntroductiontoSemedico 23
ElasticSearch II
• Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming – ElasticSearch does the same on sent document text
• How to integrate UIMA?
• First idea: Create a Lucene UIMA analyzer, but – Moves (a lot!) processing requirements into the ElasticSearch
cluster – Requires to load dictionaries, machine learning models – Memory that is lost to Lucene and ElasticSearch – Overall: Diminishes search performance
?
Erik Fäßler TechnicalIntroductiontoSemedico 24
ElasticSearch III
• JULIE Lab ElasticSearch plugin to exactly specify index terms without ES-internal analysis – https://github.com/JULIELab/elasticsearch-mapper-preanalyzed
• Employs the JSON format created for the Solr JsonPreAnalyzedParser – https://lucene.apache.org/solr/guide/6_6/working-with-external-
files-and-processes.html#WorkingwithExternalFilesandProcesses-JsonPreAnalyzedParser
• Created by JULIE Lab internal (currently) CAS consumer
Erik Fäßler TechnicalIntroductiontoSemedico 25
ElasticSearch IV Preanalyzed Format {"v":"1",
"str":"Immunohistochemistry performed to evaluate the expression of phosphorylated mTOR (p-mTOR), phosphorylated p70S6K (p-p70S6K), phosphorylated 4E-binding protein 1 (p-4E-BP1), and Ki-67 using 105 surgically resected ESCC correlated with treatment outcome.",
"tokens":[{"t":”immunohistochemistry","s”:0,"e”:20,"i":1},
{"t":”tid94702","s”:0,"e”:20,"i”:0},
{"t":”perform","s”:21,"e”:30,"i":1},
{"t":”evaluat","s”:34,"e”:42,"i":1},
{"t":”event","s”:34,"e”:42,"i”:0}, …
]
}
Erik Fäßler TechnicalIntroductiontoSemedico 26
ElasticSearch V Simple Query { "query": { "bool": { "must": [ { "match": { "abstracttext": { "query": ”cancer” }}}, { "nested": { "path": "events", "inner_hits": {}, "query": { "bool": { "must": [{ "match": { "events.allarguments": "mtor" }}], "filter": { "range": { "events.likelihood": { "lte": 5}}}}}}}]}},
"fields": [ "abstracttext", "title" ]}
Erik Fäßler TechnicalIntroductiontoSemedico 27
ElasticSearch VI Concept Query { "query": { "bool": { "must": [ { "match": { "abstracttext": { "query": ”tid52310” }}}, { "nested": { "path": "events", "inner_hits": {}, "query": { "bool": { "must": [{ "match": { "events.allarguments": “tid42" }}], "filter": { "range": { "events.likelihood": { "lte": 5}}}}}}}]}},
"fields": [ "abstracttext", "title" ]}
Erik Fäßler TechnicalIntroductiontoSemedico 29
References • Semedico
– Faessler, Erik, & Hahn, Udo (2017). SEMEDICO: A comprehensive semantic search engine for the life sciences. in: ACL 2017 – Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Vancouver, British Columbia, Canada, August 1, 2017, 91–96.
• GeNo – Wermter, Joachim, & Tomanek, Katrin, & Hahn, Udo (2009). High-performance gene name
normalization with GeNo. in: Bioinformatics, 25, 815-821.
• BioSem – Bui, Q., Mulligen, E. van, Campos, D., & Kors, J. (2013). A Fast Rule-based Approach for
Biomedical Event Extraction. In Proceedings of the BioNLP 2013 Shared Task Workshop (pp. 104–108). Sofia, Bulgaria: Association for Computational Linguistics.
• Certainty Assessment – Engelmann, Christine, & Hahn, Udo (2014). An empirically grounded approach to extend the
linguistic coverage and lexical diversity of verbal probabilities. in: CogSci 2014 - Proceedings of the 36th Annual Cognitive Science Conference. Cognitive Science Meets Artificial Intelligence: Human and Artificial Agents in Interactive Contexts. Québec City, Québec, Canada, July 23-26, 2014., 451-456.
• JCoRe – Hahn, Udo, & Matthies, Franz, & Faessler, Erik, & Hellrich, Johannes (2016). UIMA-based
JCoRe 2.0 goes GitHub and Maven Central: State-of-the-art software resource engineering and distribution of NLP pipelines. in: LREC 2016 – Proceedings of the 10th International Conference on Language Resources and Evaluation. Portorož, Slovenia, 23-28 May 2016, 2502-2509.
Erik Fäßler TechnicalIntroductiontoSemedico 30
Conclusion
DocDoc
DocMEDLINE
JULIELabServer
PostgreSQL
CR
AE
AE
AE
CO
ElasticSearchConceptDatabase
SeMedicoWebApplicationJavaServlet
Frontend(Tapestry/JavaScript)
NCBIGene
http://www.semedico.org/
Erik Fäßler TechnicalIntroductiontoSemedico 31
Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena,
Jena, Germany
http://www.julielab.de
A Technical Introduction to the Semantic Search Engine SeMedico
Erik Fäßler
TalkintheSemesterprojektEntwicklungeinerSuchmaschinefürAlternativmethodenzuTierversuchen
January12,2018Humboldt-UniversitätzuBerlin