13
Indexing UMLS concepts with Apache Lucene Julien Thibault [email protected] University of Utah Department of Biomedical Informatics

Indexing UMLS concepts with Apache Lucene Julien Thibault [email protected] University of Utah Department of Biomedical Informatics

Embed Size (px)

Citation preview

Page 1: Indexing UMLS concepts with Apache Lucene Julien Thibault jcv.thibault@gmail.com University of Utah Department of Biomedical Informatics

Indexing UMLS concepts with Apache Lucene

Julien [email protected]

University of UtahDepartment of Biomedical Informatics

Page 2: Indexing UMLS concepts with Apache Lucene Julien Thibault jcv.thibault@gmail.com University of Utah Department of Biomedical Informatics

Outline

• Goals• Unified Medical Language System (UMLS)• Apache Lucene • Get to work!

Page 3: Indexing UMLS concepts with Apache Lucene Julien Thibault jcv.thibault@gmail.com University of Utah Department of Biomedical Informatics

Goals

• Build a dictionary lookup module for NLP pipelines– Input: string (e.g. “diabetes”, “breast cancer”, “warfarin”)– Output: list of concepts (e.g. “C083562”)

• Application examples:– Unstructured clinical document coding– (Semi)automated literature indexing

• Pre-processing necessary for free text (not covered today):– Tokenization– Sentence detection– Part-of-speech tagging (e.g. to lookup only noun phrases)

Page 4: Indexing UMLS concepts with Apache Lucene Julien Thibault jcv.thibault@gmail.com University of Utah Department of Biomedical Informatics

UMLS• Unified Medical Language System (NLM)

– Millions of organized biomedical concepts– Over 150 sources (e.g. SNOMED-CT, LOINC, NCI, MESH)– Good source to index biomedical concept!– UMLS Terminology Services: https://uts.nlm.nih.gov/home.html

• Content– Concepts, synonymous names, relationships– Semantic network (high-level classification)

• Organism, anatomical structure, biologic function, chemical, …

• Distribution– Files with concept and relationship description data– Loadable into a database for querying– Files/columns: http://www.ncbi.nlm.nih.gov/books/NBK9685/

Page 5: Indexing UMLS concepts with Apache Lucene Julien Thibault jcv.thibault@gmail.com University of Utah Department of Biomedical Informatics

UMLS schema

• 19 files to describe:– Concepts– Relationships– The files (columns and

content)

• MRCONSO– Concepts names and sources

• MRSTY– Concept semantic types

• Terminology (source) codes– http://www.nlm.nih.gov/rese

arch/umls/knowledge_sources/metathesaurus/release/source_vocabularies.html

Page 6: Indexing UMLS concepts with Apache Lucene Julien Thibault jcv.thibault@gmail.com University of Utah Department of Biomedical Informatics

Concept table (MRCONSO)

CUI: concept unique ID; LAT: language of term; LUI: term unique ID; SAB: Source; STR: string

• MySQL database – mysql -u [user] -h [host] -D [database] –p– Replace with provided info (thanks Kristina!!)

• Query example:

CUI LAT LUI SAB STR …

C0001175 ENG L0001175 MSH Acquired Immunodeficiency Syndromes

C0001175 ENG L0001842 SNOMEDCT AIDS …

C0001175 FRE L0162173 SNOMEDCT SIDA …

select * from MRCONSO where STR like ‘my favorite disease’;

Page 7: Indexing UMLS concepts with Apache Lucene Julien Thibault jcv.thibault@gmail.com University of Utah Department of Biomedical Informatics

Apache Lucene

• Relational databases are not optimized for string search (e.g. partial matches, phrases)

• Apache Lucene– http://lucene.apache.org/– High-performance text search engine library

• Ranked searching (score)• Phrase queries, wildcard queries, proximity queries…

– Java API to:• build indexes• perform lookups

– Integrate nicely into UIMA

Page 8: Indexing UMLS concepts with Apache Lucene Julien Thibault jcv.thibault@gmail.com University of Utah Department of Biomedical Informatics

Apache Lucene index

• Indexes stored on disk and loaded at runtime• Documents

– Index entries with indexable fields– The set of fields does not need to be the same for each document– Searches target one field at a time and return the whole matching document

• Default match scoring– Higher ranks = good overlap, non-frequent words, short fields

CUI LAT SAB STR EXTRA

C0001175 - MSH Acquired Immunodeficiency Syndromes

-

C0001175 ENG SNOMEDCT AIDS genial

C0001175 FRE SNOMEDCT SIDA -

Field

Document

Page 9: Indexing UMLS concepts with Apache Lucene Julien Thibault jcv.thibault@gmail.com University of Utah Department of Biomedical Informatics

Apache Lucene Analyzer• Defines the pre-processing step applied to

– Strings indexed by Lucene– Strings that are looked up in the index

• Components– Tokenizer : creates token stream (e.g. based on white spaces)– Filter: applied to token stream (e.g. lower case, stop words)

• This is a good place to customize the matching algorithm, but see also:– Language-specific analyzers (e.g. Arabic, Chinese, Catalan)– CustomScoreQuery (to customize scoring function)– WildcardQuery, FuzzyQuery, RegexpQuery– KeywordQuery (no tokenization)

Page 10: Indexing UMLS concepts with Apache Lucene Julien Thibault jcv.thibault@gmail.com University of Utah Department of Biomedical Informatics

Building an index//create reference to Lucene index to be stored on diskDirectory dir = FSDirectory.open(new File(indexPath));Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);//tokenizer,filterIndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer);IndexWriter writer = new IndexWriter(dir, iwc); //get index writer…Document doc = new Document(); //create new entry (i.e. document)Field myfield = new TextField(“term", term, Field.Store.YES); //create fielddoc.add(pathField); //add field to document…writer.addDocument(doc); //add document to index… writer.close(); //save updated index

http://lucene.apache.org/core/4_6_0/demo/src-html/org/apache/lucene/demo/IndexFiles.html

StandardAnalyzer = StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words. Other analyzer examples: WhitespaceAnalyzer, KeywordAnalyzer.

Field.Store.YES = this field will be indexed

Page 11: Indexing UMLS concepts with Apache Lucene Julien Thibault jcv.thibault@gmail.com University of Utah Department of Biomedical Informatics

Creating index queries//create reference to existing Lucene index stored on diskIndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));

//prepare searchIndexSearcher searcher = new IndexSearcher(reader);Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);//create query on the “term” fieldQueryParser parser = new QueryParser(Version.LUCENE_40, “term”, analyzer); Query query = parser.parse(“hello*”);//search for terms that start with ‘hello’

//searchTopDocs results = searcher.search(query, 5); //search for top 5 matches

http://lucene.apache.org/core/4_6_0/demo/src-html/org/apache/lucene/demo/SearchFiles.html

//collect resultsScoreDoc[] hits = results.scoreDocs; //collect matchesint numTotalHits = results.totalHits; //count number of results…Document doc = searcher.doc(hits[0].doc); //retrieve first matching entryint score = hits[0].score; //retrieve score of first matching entryString term = doc.get(“term"); //retrieve value of field “term”

Page 12: Indexing UMLS concepts with Apache Lucene Julien Thibault jcv.thibault@gmail.com University of Utah Department of Biomedical Informatics

Lets get to work!• Download necessary files

– Apache Lucene Core API• http://lucene.apache.org/core/mirrors-core-latest-redir.html?

– MySQL Java connector • http://dev.mysql.com/downloads/connector/j/

– Files for this tutorial

• Create Eclipse project– Add necessary JAR files to build path– Copy source files to project src folder

• Complete code to:– Build index from MySQL query (don’t use all concepts!!)– Create search function that returns the CUIs of matching terms

Page 13: Indexing UMLS concepts with Apache Lucene Julien Thibault jcv.thibault@gmail.com University of Utah Department of Biomedical Informatics

Merci![C2986674] Thank you (NCI)

Julien [email protected]

University of UtahDepartment of Biomedical Informatics