Medical WordNetA Proposal
Christiane Fellbaum
Princeton University
and
Berlin-Brandenburg Academy of Sciences
Health Care Providers (HCP or “Experts”)
--Physicians
--Nurses
--Therapists
--on-line medical information systems
Modes of communications
• Live Interaction with Patients
• Virtual Interaction
--On-line medical information
Characteristics of HCP language
• Ignorance/uncertainty as to non-experts’ lexical and conceptual knowledge
• Same word is used with different meanings by the two populations (word-concept mismatch)
• HCP use technical terms• HCP substitute synonyms from different levels
Characteristics of non-expert language
Idiosyncratic, “unregulated”
--mix of technical and folk terms
--taxonomies are less elaborate, shallower
(fewer intermediate levels of categorial distinctions)
--lay concepts are fuzzy (e.g., flu)
--lay concepts have no (clear) equivalents in medicine
(“Kreislaufprobleme”: “circulatory problems”)
Expert vs. non-expert language in dialogue interaction
• HCP introduce new concepts for which the lay person is unprepared
--go from symptoms to diagnosis, treatments, etc.
• Lay questions are frequently “yes/no” • Expert replies are usually not “yes/no” • Often no opportunities for “repair”
Additional problem with on-line information systems
Trivial linguistic features can have potentially significant consequences
Example: MEDLINEplus
• different results depending on query:
• tremor vs. intentional tremor
• tremble vs. trembling
Linguistic (morphological) differences in the query result in semantically different answers
(our) solution
• Make the HCP “bilingual”
• Enable “translation” between consumer health information systems and laymen
Some ground rules for the next 45 mins
Nothing hinges on “concept”
Propose synset:
{concept, universal, idea, type...}
“Truth” applies only to propositions, not
entities
WordNet has “unicorn”, “Mickey Mouse”, etc.
A Linguist’s view
• Concepts/universals are expressed by lexemes (words)
• Words are embedded in contexts and partially derive their meanings from contexts
• Truth of propositions depends partially on their lexical make-up
Goals
• Document medical knowledge that can be understood by average adult health care consumer in the U.S.
• Make existing tools accessible for non-experts
Plan of Attack
• Create lexical database of medical terminology modeled on WordNet, with WN’s potential for NLP
• Lexical (word) information is complemented with definitional sentences, one for experts, one for laymen
• Sentences provide meaningful contexts for terms• 2 Sentential subcorpora: Facts and Beliefs
Some background: WordNet
• Large lexical database for English
• Semantic network? yes
• Thesaurus? yes BUT unlike in Roget’s, WN’s relations are labeled
• Ontology? who knows?
WordNet
• Constructed entirely by hand• Semantic network of 115,000 synonyms
sets (“synsets”)• Example synsets:• {chest, thorax, torso,# body_part,@ the part
of the body below the neck and above the belly; “the victim had a knife stuck in his chest”)}
WordNet synsets
• One or more “cognitively synonymous” lexemes
• Definition (“gloss”)• Examples sentence
• Meronymy, hyponymy relate noun synsets• result: semantic network
WordNet synsets
• Where did the makers of WN get their synonyms, meronyms, etc. from?
• Mid-1980s: no corpora were available• Association norms• Some psycholinguistic testing (sorting
experiments)• Assumption: speakers’ use of words reflects
conceptual organization
WordNet
WordNet’s value for computational linguistics, Natural Language Processing
• Synonyms, related synsets allow searches for semantically related nodes
--E.g., query expansion• Information retrieval, Q-A systems, data mining,...• Inferencing
Two problems: Synonymy and Polysemy
WordNet maps lexemes (words) and concepts (meanings)
Words are labels for concepts that speakers find salient
--Identification of the same concept labelled with different words (synonymy); e.g.
chest, thorax--Disambiguation of polysemous words weak patient vs. weak solution
Synonymy and Polysemy
• Synonymy: membership in the same synset
• Polysemy: number of synsets of which a given string is a member
WordNet
In addition, related words and concepts can be found via the relations among entire synsets
Hyponymy/hyperonymy (super-/subordination)HIV is a kind of virusOne kind of virus is HIV
Meronymy/holonymy (part-whole)occipital bone is part of craniumcranium has an occipital bone
WordNet
Different kinds of hyponymy
Types vs. Instances
Kingdom is a type of countryMonaco is an instance of a kingdom
WordNet for medical/bioinformatics?
• Synonymy, polysemy are problems here, too
• is WN’s way of mapping words and meanings useful?
WordNet for medical/bioinformatics?
• WN’s was compiled by non-experts
• Medical coverage is sparse and arbitrary
WordNet’s medical coverage
• contains both expert and folk terms (indistinguishable)
• contains archaic terms like unction• no type vs. role (symptom) distinction• e.g., tumors are abnormal but not:some tumors are malignant• No links among entities, properties, processes, states• domain labels (medicine, drugs,..) are assigned
incompletely and inconsistently (no good domain ontology)
• Create lexical database of medical terminology modelled on WordNet (MedWN)
• Info in MedWN can be accessed automatically• Retain WN’s features to make it usable for NLP
Steps to take
• Review, validate, augment WN’s present medical coverage
• Ensure sufficiently high scientific level so that
MedWN can work in tandem with existing terminology banks, ontologies,...
Create subcorpora of sentences
• MedicalFactNet
--sentences rated as correct by medical experts
--sentences express “true” beliefs about medical phenomena
--intelligible to non-experts
Subcorpora of sentences
• MedicalBeliefNet
--sentences rated highly for assent by lay persons
--representative fraction of true and false beliefs about medical phenomena
Constraints on subcorpora
• Complete, grammatical English sentences
• No anaphora (it, then, this): context-free generic sentences
• Statements embed terms in typical, informative contexts
Sources for subcorpora
--sentences generated via WordNet’s relations
--WordNet’s definitions of medical terms
--sentences from online medical services
Sentences from on-line information sources
--fact sheets
--NIAID Health Information Publications
--UK NetDoctor’s Diseases Encyclopedia
Example
NetDoctor text:Hay fever, otherwise known as seasonal
allergenic rhinitis, is an allergic reaction to airborne substances such as pollen....
Created sentences:Hay fever is an allergy.Hay fever is an allergic reactionHay fever is a reaction to pollen...
Second source of sentences
• Derive propositions from WordNet:
Express labeled arcs as proposition
e.g. if x is a hyponym of y“x is a type of y”meronymy: “x is a part of y “
Validation
• Derived sentences are judged by humans• Likert Scale 1-5• Participants assign a score for U
(understanding) to all sentences• Sentences judged to be understandable are
scored further forB (belief) by lay personsC (correctness) by experts
Validation
• Statements receiving a B-score of 4 or higher => MedicalBeliefNet
• Statements receiving a C-score of 4 or higher => MedicalFactNet
Side effects (beneficial) of corpus
• Basis for new NLP applications in the medical domain
• Basis for exploring individual and group differences wrt medical knowledge, vocabulary, reasoning, decision-making
• Use in medical training
Future work
• Scale up coverage• Add relations among events (states, activites) as
expressed by verbs• Current work: explore “function/purpose” relation
among verbs (analogous to roles among entities expressed by nouns)
e.g., to run is to exercise (defeasible) to run is to move (not defeasible)