View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Automatic Document
Categorisation by User Profile in MEDLINE
Euripides G.M. PetrakisAngelos Hliaoutakis
Intelligent Systems Laboratory www.intelligence.tuc.grTechnical University of Crete (TUC)Chania, Crete, Greece
ISHIMR 2011, Zurich, Switzerland
2
Problem Definition
• Medical information systems are designed for experts !– Domain specific answers to experts
• Must also serve naive consumers – Easy to read and comprehend information
• Investigate methods for the categorization of information by user profile– Experts: use complex terms for their searches– Consumers: do simple searches using natural
language terms
ISHIMR 2011, Zurich, Switzerland
3
Current Practices
• In MEDLINE of U.S. NLM, documents are indexed by experts – 10-12 MeSH terms per document (pathology,
disease, treatment, drugs etc)– Over 15 million documents - Slow !!– Automate this process – No categorization
• MedScape, Medlineplus, MedHunt rely on the manual categorization of information – Slow, does not scale-up for large collections
ISHIMR 2011, Zurich, Switzerland
4
Objectives
• Investigate methods for automatic document indexing in MEDLINE
• These terms are subsequently used for filtering documents by user profile
• Main Idea: categorization of terms to simple terms comprehendible by consumers or more involved terms suitable for experts
ISHIMR 2011, Zurich, Switzerland
5
Resources
• Automatic indexing in MEDLINE– MMTx [U.S. NLM]: MMTx focus on UMLS
rather than MeSH– AMTEx [DKE, 2009]: MeSH terms, faster and
more accurate than MMTx
• Dictionaries for biomedical and health related concepts– UMLS Metathesaurus, MeSH
• Dictionaries for general English words– WordNet, Specialist
AMTEx
MMTx (MetaMap Transfer)
• Developed by U.S. NLM• Maps text to UMLS Metathesaurus
concepts– but MEDLINE indexing is based on MeSH– MeSH is a subset of Metathesaurus
• Suffers from term overgeneration Unrelated terms added to the final candidate list
Topic drift
HIKM’2006
AMTEx
HIKM’2006
The AMTEx method [DKE 2009]
• Main ideas:
Initial term extraction based on a hybrid linguistic/statistical approach, the C/NC value
Extracts general single and multi-word terms (noun phrases)
Mainly multi-word terms: “heart disease”, “coronary artery disease”
Extracted terms are validated against MeSH
Faster, improved precision by merely a fifth of term output of MMTx
AMTEx
Example
HIKM’2006
Input: Full text article
MEDLINE index terms: “Aged”, “Data Collection”, “Humans”,“Knee”, “Middle Aged”, “Osteoarthritis, Knee/complications”, “Osteoarthritis, Knee/diagnosis”, “Pain/classification”, “Pain/etiology”, “Prospective Studies”, “Research Support, Non-U.S. Gov’t”
MMTx terms: “osteoarthritis knee”, “retention”, “peat”, “rheumatology”, “acetylcholine”, “lysine acetate”, “potassium acetate”, “questionnaires”, “target population”, “population”, “selection bias”, “creativeness”, “reproduction”, “cohort studies”, “europe”, “couples”, “naloxone”, “sample size”, “arthritis”, “data collection”, “mail” ‘health status”, “respondents”, “ontario”, “universities”, “dna”, “baseline survey”, “medical records”, “informatics”, “general practitioners”, “gender”, “beliefs”, “logistic regression”, “female”, “marital status”, “employment status”, “comprehension”, “surveys”, “age distribution”, “manual”, “occupations”, “manuals”, “persons”, “females”, “minor”, “minority groups”, “incentives”, “business”, “ability”, “comparative study”, “odds ratio”, “biomedical research”, “pubmed”, “copyright”, “coding”, “longitudinal studies”, “immunoelectrophoresis”, “skin diseases”, “government”, “norepinephrine”, “social sciences”, “survey methods”, “tyrosine”, “new zealand”, “azauridine”, “gold”, “nonrespondents”, “cycloheximide”, “rheum”, “jordan”, “cadmium”, “radiopharmaceuticals”, “community”, “disease progression”, “history”
AMTEx terms: “health surveys”, “pain”, “review publication type”, “data collection”, “osteoarthritis knee”, “knee”, “science”, “health services needs and demand”, “population”, “research”, “questionnaires”, “informatics”, “health”
ISHIMR 2011, Zurich, Switzerland
9
Term & Document Categorization
ISHIMR 2011, Zurich, Switzerland
10
New Vocabularies
• Vocabulary of General Terms (VGT): 105.675 general (WordNet) terms
• Vocabulary of Consumer Terms (VCT): 7,165 consumer (MeSH) terms.
• Vocabulary of Expert Terms (VET): 16,719 consumer (MeSH) terms
(MeSH) - (WordNet)=VGT
(MeSH) (WordNet)=VCT
(WordNet) - (MeSH)=VET
ISHIMR 2011, Zurich, Switzerland
11
Document Categorization
• Documents are represented by vectors of terms extracted by AMTEx, MMTx or assigned by human experts
• The more VET (VCT) terms a document contains the higher its probability to be suitable for experts (consumers)– E.g., a document with VET% = 0.62 has 62%
probability to be one suitable for experts
ISHIMR 2011, Zurich, Switzerland
12
Evaluation
• Precision and Recall measures: a good method has high values of both
• Datasets: OHSUMED: 348,566 MEDLINE abstracts that come along with 64 queries and their relevant answers
• Ground truth: the set of MeSH index terms assigned to documents by experts
ISHIMR 2011, Zurich, Switzerland
13
Categorization by User Profile
• How good is the method in retrieving answers for consumers and experts ?
• We run retrievals for consumers & experts– 15 out of the 64 queries contain no expert
terms and are suitable for consumers– The remaining queries are suitable for experts– Documents are represented by document
vectors of MeSH, MMTx, or AMTEx terms– The retrieval method is Vector Space Model– The document similarity score of VSM is
multiplied by its respective VET or VCT score
ISHIMR 2011, Zurich, Switzerland
14
Consumers Retrieval Task
ISHIMR 2011, Zurich, Switzerland
15
Experts Retrieval Task
ISHIMR 2011, Zurich, Switzerland
16
Results
• Consumers retrieval task: – Retrievals with the manually assigned MeSH
terms performs better– MMTx, AMTEx perform equally well
• Experts retrieval task:– Retrievals with AMTEx perform better
• The results indicate – A tendency of human experts to assign simple
terms to documents and – Selective ability of AMTEx in extracting
complex terms suitable for experts
ISHIMR 2011, Zurich, Switzerland
17
Conclusions & Future Work
• We investigate methods:– Automatic document indexing – Categorization by user profile
• AMTEx is very well suited for both problems
• Future work: more elaborate documents methods (machine learning, fuzzy)
• More categories – According to UMLS SN (pathology, treatment)– User categories (e.g., specialty)
ISHIMR 2011, Zurich, Switzerland
18
Questions and answers
AMTEx
ΑΜΤΕx OutlineClick icon to add SmartArt graphic
HIKM’2006
INPUT:Document Collection
INPUT:Document Collection C/NC value
Multi-word Term Extraction& Term Ranking
C/NC valueMulti-word Term Extraction
& Term Ranking
MeSHTerm Validation
MeSHTerm Validation
Single-word Term ExtractionNon-MeSH multi-word are broken down & validated against MeSH
Single-word Term ExtractionNon-MeSH multi-word are broken down & validated against MeSH
Variant GenerationVariant Generation Term Expansion(MeSH)
Term Expansion(MeSH)
MeSHThesaurusResource
MeSHThesaurusResource
OUTPUT:MeSH
Term Lists
OUTPUT:MeSH
Term Lists
ISHIMR 2011, Zurich, Switzerland
20
AMTEx vs MMTx
• AMTEx: faster, improved precision by merely a fifth of term output of MMTx
Data Set MethodNumber of Terms
Precision RecallTime
(hours)
OHSUMEDAMTEX MMTX
840
0.1250.089
0.1010.336
7.38314.516
PMCAMTEX
MMTX
2572
0.0340.033
0.0620.162
1.3872.727
AMTEx
MeSH: Medical Subject Headings
The NLM medical & biological terms thesaurus:
• Organized in IS-A hierarchies – more than 15 taxonomies & more than 22,000 terms– a term may appear in multiple taxonomies
• No PART-OF relationships
• Terms organized into synonym sets called entry terms, including stemmed term forms
HIKM’2006
Fragment of the MeSH IS-A Hierarchy
AMTExHIKM’2006
neuralgia
Root
Nervous systemdiseases
Neurologicmanifestations
pain
headache
Cranial nervediseases
Facialneuralgia