Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
(Word) Sense and Similarity
6 April 2021
cmpu 366 · Computational Linguistics
Word senses and relations
A word sense is a distinct meaning of a word.
How do we know if a word (lemma) has more than one sense?
Linguists often design tests for this purpose.
E.g., zeugma combines distinct senses in an uncomfortable way:
Which flights serve breakfast?
Which flights serve New York?
→ ? Which flights serve breakfast and New York?
There’s also the cross-lingual test: Is the word translated differently in other languages?
E.g., for English → French:
pen → stylo and
pen → enclos
But beware:
interest → interêt in both senses
river → fleuve or rivière (one sense)
Lexemes that share a form but have distinct, unrelated meanings are homonyms.
The form can be: Phonological (homophones)
Write and right
Piece and peace
Orthographic (homographs)
Bat (wooden stick vs flying mammal)
Bank (financial institution vs riverside)
Homonymy causes problems for NLP! Text-to-speech
Same orthographic form but different phonologic form (pronunciation), e.g., bass (instrument) vs bass (fish).
Information retrieval
Same orthographic form but different meaning
Query: bat care – need a vet or a woodworker?
Machine translation
pen (English) → stylo (French)
pen (English) → enclos (French)
Speech recognition
Turn up the bass or turn up the base?
If two senses are distinct but related, that’s polysemy, e.g., financial bank vs blood bank.
We say bank is polysemous.
A word frequently takes on further related meanings through systematic polysemy or metaphor.
The wing of a bird
→ the wing of a plane
→ the wing of a building
→ the left wing on a soccer (football) team
Often preserved across languages
Same word, different senses: Homonyms have unrelated senses
Polysemes have related but distinct senses
Different word, same sense: Synonyms
Synonyms are words that have the same meaning in some or all contexts, e.g., filbert hazelnut
youth adolescent
big large
automobile car
water H2O
There are few if any examples of perfect synonymy.
WordNet
WordNet is a lexical database for English. It includes most English nouns, verbs, adjectives, and adverbs.
The electronic format makes it amenable to automatic manipulation, so it’s used in many NLP applications.
“WordNets” generically refer to similar resources in other languages.
E.g., EuroWordNet or BalkaNet.
WordNet is organized in terms of synsets – sets of (roughly) synonymous words or phrases.
Each synset corresponds to a distinct sense or a concept.
wordnetweb.princeton.edu/perl/webwn
Different inventories can be used to define senses:
Different inventories can be used to define senses
Different inventories do not always agree on sense distinctions e.g., translation makes some distinctions but not othersDifferent inventories don’t always agree on sense distinctions,
e.g., translation makes some distinctions but not others.
WordNet hierarchies
WordNet hierarchies
Hyponymy and hypernymy
A more specific sense is a hyponym of a more general sense, e.g.,
car is a hyponym of vehicle
mango is a hyponym of fruit
The converse is being a hypernym: vehicle is a hypernym of car
fruit is a hypernym of mango
Meronymy is the relation of a part to its whole: leg is a meronym of chair
wheel is a meronym of car
Holonymy is the relation of a whole to a part: car is a holonym of wheel
chair is a holonym of leg
Antonyms are senses that are opposites with respect to one feature of meaning, e.g.,
dark light
short long
fast slow
rise fall
hot cold
up down
in out
WordNet noun relationsRelation Definition Example
Hypernym From concept to superordinate breakfast1 ⇒ meal1
Hyponym From concept to subtype meal1 ⇒ lunch1
Has-member From group to member faculty2 ⇒ professor1
Member-of From member to group copilot1 ⇒ crew1
Has-part From whole to part table2 ⇒ leg3
Part-of From part to whole course7 ⇒ meal1
Substance meronym From substances to their subparts water1 ⇒ oxygen1
Substance holonym From parts of substances to wholes gin1 ⇒ martini1
Antonym Semantic opposition between lemmas leader1 ⇔ follower1
Derivationally related Lemmas with the same morphological root destruction1 ⇔ destroy1
WordNet verb relations
Relation Definition Example
Hypernym From events to superordinate events fly9 ⇒ travel5
Troponym From events to subordinate event (often via specific manner) walk1 ⇒ stroll1
Entails From verbs (events) to the verbs (events) they entail snore1 ⇒ sleep1
Antonym Semantic opposition between lemmas increase1 ⇔ decrease1
Derivationally related Lemmas w/same morphological root destruction1 ⇔ destroy1
WordNet as a graph
Word sense disambiguation
Given
a word in context and
a fixed inventory of potential word senses,
decide which sense of the word is used in this context.
WSD systems need annotated data for evaluation and, typically, for training.
What can we do when humans who annotate senses disagree?
Disagreement is inevitable when annotating based on human judgments, even with trained annotators.
We cannot measure the “correctness” of annotations directly.
Instead, we can measure the reliability of annotation.
Do human annotators make the same decisions consistently?
Assumption: high reliability implies validity.
To quantify the (dis)agreement between human annotators, we can use Cohen’s kappa.
This measures the agreement between two annotators while taking into account the possibility of chance agreement.
κ =P(a) − P(e)
1 − P(e)
Probability of actual agreement
Probability of expected agreement
Quantifying (dis)agreement between human annotators: Cohen’s Kappa• Measures agreement between two annotators while taking into
account the possibility of chance agreement
• Scales for interpreting Kappa
Probability ofactual agreement
Probability of expected agreement
Landis & Koch, 1977
Green, 1997
Landis & Koch, 1977
Green, 1997
Consider this confusion matrix for sense annotations by humans A and B of the same 250 examples:
Quantifying (dis)agreement between human annotators: Cohen’s Kappa
Sense 1 Sense 2 Sense 3 TotalSense 1 54 28 3 85
Sense 2 31 18 23 72
Sense 3 0 21 72 93
Total 85 67 98 250
Consider this confusion matrix for sense annotations by A
and B of the same 250 examples
Here Pr(a) = 0.576, Pr(e) = 0.339, K=0.36
(agreement is low)
P(a) = 0.576 P(e) = 0.339 κ = 0.36 – agreement is low
κ =P(a) − P(e)
1 − P(e)
Most words in English have only one sense. 62% in Longman’s Dictionary of Contemporary English (LDOCE)
79% in WordNet
But the others tend to have several senses: Average of 3.83 in LDOCE
Average of 2.96 in WordNet
And frequently used words are more likely to be ambiguous.
In the British National Corpus, 84% of word instances have more than one sense.
The upper bound on performance for WSD is human performance:
For fine-grained, WordNet-style senses, expect 75–80% agreement.
For coarser-grained inventories, 90% human agreement is possible.
A baseline for word sense disambiguation is to always select the most frequent sense.
For WordNet, this means taking sense 1.
This does surprisingly well!
There are two variants of the WSD task: Lexical sample task:
Small, pre-selected set of target words (e.g., line and plant)
Inventory of senses for each word
Use supervised machine learning: Train a classifier for each word
All-words task:
Every word in an entire text
A lexicon with sense for each word
Data sparseness: Can’t train word-specific classifiers
Dictionary methods
Lesk (1986) looked at using dictionaries, treating each definition as a distinct sense and looking at overlap between the words in the definition and the words in the given context.
ash 1. a tree of the olive family
2. the solid residue left when combustable material is burned
Uses to disambiguate: The cigar burns slowly and creates a stiff ash.
The ash is one of the last trees to come into leaf.
Lesk (1986) looked at using dictionaries, treating each definition as a distinct sense and looking at overlap between the words in the definition and the words in the given context.
ash 1. a tree of the olive family
2. the solid residue left when combustable material is burned
Uses to disambiguate: The cigar burns slowly and creates a stiff ash.
The ash is one of the last trees to come into leaf.
Sense 1: 0 Sense 2: 1
Lesk (1986) looked at using dictionaries, treating each definition as a distinct sense and looking at overlap between the words in the definition and the words in the given context.
ash 1. a tree of the olive family
2. the solid residue left when combustable material is burned
Uses to disambiguate: The cigar burns slowly and creates a stiff ash.
The ash is one of the last trees to come into leaf.
Sense 1: 0 Sense 2: 1
Sense 1: 1 Sense 2: 0
Lesk (1986) looked at using dictionaries, treating each definition as a distinct sense and looking at overlap between the words in the definition and the words in the given context.
ash 1. a tree of the olive family
2. the solid residue left when combustable material is burned
Uses to disambiguate: The cigar burns slowly and creates a stiff ash.
The ash is one of the last trees to come into leaf.
Sense 1: 0 Sense 2: 1
Sense 1: 1 Sense 2: 0
Insufficient information in definitions. Accuracy: 50–70%
Lesk’s algorithm
Simplest implementation: Count overlapping content words between glosses and context.
Lots of variants: Include the examples with the dictionary definitions
Include hypernyms and hyponyms
Give more weight to larger overlaps
Give extra weight to infrequent words (e.g., using IDF)
Supervised machine-learning approaches
Use a corpus of words tagged in context with their sense to train a classifier that can tag words in new text.
We need: the tag set (“sense inventory”)
the training corpus
a set of features extracted from the training corpus
a classifier
“If one examines the words in a book, one at a time as through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine, one at a time, the meaning of the words … But if one lengthens the slit in the opaque mask, until one can wee not only the central word in question but also say N words on either side, then if N is large enough one can unambiguously decide the meaning of the central word … The practical question is: ‘What minimum value of N will, at least in a tolerable fraction of cases, lead to the correct choice of meaning for the central word?’ ” Warren Weaver, Translation, 1949
A simple representation for each observation (that is, each instance of a target word) is as vectors of sets of feature–value pairs representing a window of words around the target.
Two kinds of features in the vectors
Collocational Features about words at specific positions near target word
Often limited to just word identity and POS
Bag-of-words Features about words that occur anywhere in the window (regardless of position)
Typically limited to frequency counts
Example
Two-word window An electric guitar and bass player stand off to one side not really part of the scene, just as a sort of nod to gringo expectations perhaps.
Collocational
Position-specific information about the words in the window
guitar and bass player stand
[guitar, NN, and, CC, player, NN, stand, VB]
wordn−2, POSn−2, wordn−1, POSn−1, wordn+1, POSn+1…
In other words, a vector consisting of
[position n word, position n part-of-speech…]
Bag-of-words
Information about the words that occur within the window
First derive a set of terms to place in the vector
Then note how often each of those terms occurs in a given window
Co-occurrence example
Assume we’ve settled on a possible vocabulary of 12 words for bass sentences:
[fish, big, sound, player, fly, rod, pound, double, runs, playing, guitar, band]
The vector for guitar and bass player stand
is [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]
Classifiers
Once we cast WSD as a classification problem, all sorts of techniques are possible, including
naïve Bayes
decision trees
neural networks
support vector machines
nearest-neighbor methods
Applying naïve Bayes to WSD
P(s) is the prior probability of that sense Counting in a labeled training set
P(w | s) is the conditional probability of a word given a particular sense
P(w | s) = count(w, s) / count(s)
We can get both of these from a tagged corpus like SemCor.
Problems
Given general ML approaches, how many classifiers do I need to perform WSD robustly?
One for each ambiguous word in the language
How do you decide what set of tags/labels/senses to use for a given word?
Depends on the application
WordNet bass
Tagging with this set of senses is an impossibly hard task that is probably overkill for any realistic application
Word similarity
Abbé Gabriel Girard, 1718
[I do not believe that there is a synonymous word in any language]
Synonymy is a binary relation – two words are either synonymous or not.
Word similarity (or distance) is a looser metric: Two words are more similar if they share more features of meaning.
Actually these are really relations between senses: Instead of saying “bank is like fund”
We say
bank1 is similar to fund3
bank2 is similar to slope5
But we’ll compute similarity over both words and senses.
We often distinguish word similarity from word relatedness
Similar words: near-synonyms
Related words: can be related any way
E.g.,
car and bicycle are similar car and gasoline are related but not similar
Two classes of similarity algorithms
Thesaurus-based algorithms Are words “nearby” in WordNet (or Roget’s, e.g)?
Do words have similar glosses (definitions)?
Distributional algorithms Do words have similar distributional contexts?
Thesaurus-based word similarity
We could use any relation in the thesaurus Meronymy
Glosses
Example sentences
In practice By “thesaurus-based” we just mean using the is-a/subsumption/hypernym hierarchy
Path-based similarity
Two words are similar if nearby in thesaurus hierarchy, i.e., short path between them.
(Concepts have path 1 to themselves)
Refinements to path-based similarity
pathlen(c1, c2) = 1 + the number of edges in the shortest path in the hypernym graph between the sense nodes c1 and c2
simpath(c1, c2) = 1 / pathlen(c1, c2) Ranges from 0 to 1
wordsim(w1, w2) = max c1 ∈ senses(w1), c2 ∈ senses(w2) sim(c1, c2)
simpath(c1, c2) = 1 / pathlen(c1, c2)
E.g., simpath(nickel, coin) = 1/2 = 0.5
simpath(fund, budget) = 1/2 = 0.5
simpath(nickel, currency) = 1/4 = 0.25
simpath(nickel, money) = 1/6 = 0.17
simpath(coinage, Richter scale) = 1/6 = 0.17
Example: path-based similarity
Problem with basic path-based similarityAssumes each link represents a uniform distance
But nickel to money seems to use to be closer than nickel to standard
Nodes high in the hierarchy are very abstract.
Instead, we want a metric that represents the cost of each edge independently
counts words connected only through abstract nodes as less similar
Problem with basic path-based similarityAssumes each link represents a uniform distance
But nickel to money seems to use to be closer than nickel to standard
Nodes high in the hierarchy are very abstract.
Instead, we want a metric that represents the cost of each edge independently
counts words connected only through abstract nodes as less similar
5
Information content similarity metricsLet’s define P(c) as the probability that a randomly selected word in a corpus is an instance of concept c.
Formally, there is a distinct random variable, ranging over words, associated with each concept in the hierarchy
For a given concept, each observed noun is either
a member of that concept with probability P(c)
not a member of that concept with probability 1 − P(c)
All words are members of the root node (entity) P(root) = 1
The lower a node in hierarchy, the lower its probability
Resnik, 1995
Information content similarity
Train by counting in a corpus Each instance of hill counts toward frequency of natural elevation, geological formation, entity, etc.
Let words(c) be the set of all words that are children of node c
words(geological formation) = {hill, ridge, grotto, coast, cave, shore, natural elevation}
words(natural elevation) = {hill, ridge}
Information content similarity
WordNet hierarchy augmented with probabilities P(c)
Information content and probability
The self-information of an event is also called its surprisal:
how surprised we are to know it; how much we learn by knowing it
The more surprising something is, the more it tells us when it happens
Measure self-information in bits I(w) = −log2 P(w)
I flip a coin; P(heads) = 0.5
How many bits of information do I learn by flipping it? I(heads) = −log2(0.5) = −log2(1/2) = log2(2) = 1 bit
I flip a biased coin: P(heads) = 0.8, I don’t learn as much I(heads) = −log2(0.8) = −log2(0.8) = 0.32 bits
Information content: definitions
Information content IC(c) = −log P(c)
Lowest common subsumer (or most informative subsumer):
LCS(c1, c2) = the lowest common subsumer
i.e., the most informative (lowest) node in the hierarchy that subsumes (is a hypernym of) both c1 and c2
How do we use information content IC as a similarity metric?
Using information content for similarity
The similarity between two words is related to their common information
The more two words have in common, the more similar they are
Resnik: Measure the common information as the information content of the lowest common subsumer of the two nodes
simresnik(c1, c2) = −log P(LCS(c1, c2))
Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. IJCAI 1995.
Philip Resnik. 1999. Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. JAIR 11, 95-130.
Problems with thesaurus-based methods
We don’t have a thesaurus for every language
Even if we do, many words are missing
They rely on hyponym info: Strong for nouns, but lacking for adjectives and even verbs
Alternative Distributional methods for word similarity
Vector semantics!
Acknowledgments
The lecture incorporates material from: Nancy Ide, Vassar College
Daniel Jurafsky, Stanford University
Jonathan May, University of Southern California