(Word) Sense and Similarity

(Word) Sense and Similarity

6 April 2021

cmpu 366 · Computational Linguistics

Word senses and relations

A word sense is a distinct meaning of a word.

How do we know if a word (lemma) has more than one sense?

Linguists often design tests for this purpose.

E.g., zeugma combines distinct senses in an uncomfortable way:

Which flights serve breakfast?

Which flights serve New York?

→ ? Which flights serve breakfast and New York?

There’s also the cross-lingual test: Is the word translated differently in other languages?

E.g., for English → French:

pen → stylo and

pen → enclos

But beware:

interest → interêt in both senses

river → fleuve or rivière (one sense)

Lexemes that share a form but have distinct, unrelated meanings are homonyms.

The form can be: Phonological (homophones)

Write and right

Piece and peace

Orthographic (homographs)

Bat (wooden stick vs flying mammal)

Bank (financial institution vs riverside)

Homonymy causes problems for NLP! Text-to-speech

Same orthographic form but different phonologic form (pronunciation), e.g., bass (instrument) vs bass (fish).

Information retrieval

Same orthographic form but different meaning

Query: bat care – need a vet or a woodworker?

Machine translation

pen (English) → stylo (French)

pen (English) → enclos (French)

Speech recognition

Turn up the bass or turn up the base?

If two senses are distinct but related, that’s polysemy, e.g., financial bank vs blood bank.

We say bank is polysemous.

A word frequently takes on further related meanings through systematic polysemy or metaphor.

The wing of a bird

→ the wing of a plane

→ the wing of a building

→ the left wing on a soccer (football) team

Often preserved across languages

Same word, different senses: Homonyms have unrelated senses

Polysemes have related but distinct senses

Different word, same sense: Synonyms

Synonyms are words that have the same meaning in some or all contexts, e.g., filbert hazelnut

youth adolescent

big large

automobile car

water H2O

There are few if any examples of perfect synonymy.

WordNet

WordNet is a lexical database for English. It includes most English nouns, verbs, adjectives, and adverbs.

The electronic format makes it amenable to automatic manipulation, so it’s used in many NLP applications.

“WordNets” generically refer to similar resources in other languages.

E.g., EuroWordNet or BalkaNet.

WordNet is organized in terms of synsets – sets of (roughly) synonymous words or phrases.

Each synset corresponds to a distinct sense or a concept.

wordnetweb.princeton.edu/perl/webwn

Different inventories can be used to define senses:

Different inventories can be used to define senses

Different inventories do not always agree on sense distinctions e.g., translation makes some distinctions but not othersDifferent inventories don’t always agree on sense distinctions,

e.g., translation makes some distinctions but not others.

WordNet hierarchies

WordNet hierarchies

Hyponymy and hypernymy

A more specific sense is a hyponym of a more general sense, e.g.,

car is a hyponym of vehicle

mango is a hyponym of fruit

The converse is being a hypernym: vehicle is a hypernym of car

fruit is a hypernym of mango

Meronymy is the relation of a part to its whole: leg is a meronym of chair

wheel is a meronym of car

Holonymy is the relation of a whole to a part: car is a holonym of wheel

chair is a holonym of leg

Antonyms are senses that are opposites with respect to one feature of meaning, e.g.,

dark light

short long

fast slow

rise fall

hot cold

up down

in out

WordNet noun relationsRelation Definition Example

Hypernym From concept to superordinate breakfast1 ⇒ meal1

Hyponym From concept to subtype meal1 ⇒ lunch1

Has-member From group to member faculty2 ⇒ professor1

Member-of From member to group copilot1 ⇒ crew1

Has-part From whole to part table2 ⇒ leg3

Part-of From part to whole course7 ⇒ meal1

Substance meronym From substances to their subparts water1 ⇒ oxygen1

Substance holonym From parts of substances to wholes gin1 ⇒ martini1

Antonym Semantic opposition between lemmas leader1 ⇔ follower1

Derivationally related Lemmas with the same morphological root destruction1 ⇔ destroy1

WordNet verb relations

Relation Definition Example

Hypernym From events to superordinate events fly9 ⇒ travel5

Troponym From events to subordinate event (often via specific manner) walk1 ⇒ stroll1

Entails From verbs (events) to the verbs (events) they entail snore1 ⇒ sleep1

Antonym Semantic opposition between lemmas increase1 ⇔ decrease1

Derivationally related Lemmas w/same morphological root destruction1 ⇔ destroy1

WordNet as a graph

Word sense disambiguation

Given

a word in context and

a fixed inventory of potential word senses,

decide which sense of the word is used in this context.

WSD systems need annotated data for evaluation and, typically, for training.

What can we do when humans who annotate senses disagree?

Disagreement is inevitable when annotating based on human judgments, even with trained annotators.

We cannot measure the “correctness” of annotations directly.

Instead, we can measure the reliability of annotation.

Do human annotators make the same decisions consistently?

Assumption: high reliability implies validity.

To quantify the (dis)agreement between human annotators, we can use Cohen’s kappa.

This measures the agreement between two annotators while taking into account the possibility of chance agreement.

κ =P(a) − P(e)

1 − P(e)

Probability of actual agreement

Probability of expected agreement

Quantifying (dis)agreement between human annotators: Cohen’s Kappa• Measures agreement between two annotators while taking into

account the possibility of chance agreement

• Scales for interpreting Kappa

Probability ofactual agreement

Probability of expected agreement

Landis & Koch, 1977

Green, 1997

Landis & Koch, 1977

Green, 1997

Consider this confusion matrix for sense annotations by humans A and B of the same 250 examples:

Quantifying (dis)agreement between human annotators: Cohen’s Kappa

Sense 1 Sense 2 Sense 3 TotalSense 1 54 28 3 85

Sense 2 31 18 23 72

Sense 3 0 21 72 93

Total 85 67 98 250

Consider this confusion matrix for sense annotations by A

and B of the same 250 examples

Here Pr(a) = 0.576, Pr(e) = 0.339, K=0.36

(agreement is low)

P(a) = 0.576 P(e) = 0.339 κ = 0.36 – agreement is low

κ =P(a) − P(e)

1 − P(e)

Most words in English have only one sense. 62% in Longman’s Dictionary of Contemporary English (LDOCE)

79% in WordNet

But the others tend to have several senses: Average of 3.83 in LDOCE

Average of 2.96 in WordNet

And frequently used words are more likely to be ambiguous.

In the British National Corpus, 84% of word instances have more than one sense.

The upper bound on performance for WSD is human performance:

For fine-grained, WordNet-style senses, expect 75–80% agreement.

For coarser-grained inventories, 90% human agreement is possible.

A baseline for word sense disambiguation is to always select the most frequent sense.

For WordNet, this means taking sense 1.

This does surprisingly well!

There are two variants of the WSD task: Lexical sample task:

Small, pre-selected set of target words (e.g., line and plant)

Inventory of senses for each word

Use supervised machine learning: Train a classifier for each word

All-words task:

Every word in an entire text

A lexicon with sense for each word

Data sparseness: Can’t train word-specific classifiers

Dictionary methods

Lesk (1986) looked at using dictionaries, treating each definition as a distinct sense and looking at overlap between the words in the definition and the words in the given context.

ash 1. a tree of the olive family

2. the solid residue left when combustable material is burned

Uses to disambiguate: The cigar burns slowly and creates a stiff ash.

The ash is one of the last trees to come into leaf.






Sense 1: 0 Sense 2: 1















Insufficient information in definitions. Accuracy: 50–70%

Lesk’s algorithm

Simplest implementation: Count overlapping content words between glosses and context.

Lots of variants: Include the examples with the dictionary definitions

Include hypernyms and hyponyms

Give more weight to larger overlaps

Give extra weight to infrequent words (e.g., using IDF)

Supervised machine-learning approaches

Use a corpus of words tagged in context with their sense to train a classifier that can tag words in new text.

We need: the tag set (“sense inventory”)

the training corpus

a set of features extracted from the training corpus

a classifier

“If one examines the words in a book, one at a time as through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine, one at a time, the meaning of the words … But if one lengthens the slit in the opaque mask, until one can wee not only the central word in question but also say N words on either side, then if N is large enough one can unambiguously decide the meaning of the central word … The practical question is: ‘What minimum value of N will, at least in a tolerable fraction of cases, lead to the correct choice of meaning for the central word?’ ” Warren Weaver, Translation, 1949

A simple representation for each observation (that is, each instance of a target word) is as vectors of sets of feature–value pairs representing a window of words around the target.

Two kinds of features in the vectors

Collocational Features about words at specific positions near target word

Often limited to just word identity and POS

Bag-of-words Features about words that occur anywhere in the window (regardless of position)

Typically limited to frequency counts

Example

Two-word window An electric guitar and bass player stand off to one side not really part of the scene, just as a sort of nod to gringo expectations perhaps.

Collocational

Position-specific information about the words in the window

guitar and bass player stand

[guitar, NN, and, CC, player, NN, stand, VB]

wordn−2, POSn−2, wordn−1, POSn−1, wordn+1, POSn+1…

In other words, a vector consisting of

[position n word, position n part-of-speech…]

Bag-of-words

Information about the words that occur within the window

First derive a set of terms to place in the vector

Then note how often each of those terms occurs in a given window

Co-occurrence example

Assume we’ve settled on a possible vocabulary of 12 words for bass sentences:

[fish, big, sound, player, fly, rod, pound, double, runs, playing, guitar, band]

The vector for guitar and bass player stand

is [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]

Classifiers

Once we cast WSD as a classification problem, all sorts of techniques are possible, including

naïve Bayes

decision trees

neural networks

support vector machines

nearest-neighbor methods

Applying naïve Bayes to WSD

P(s) is the prior probability of that sense Counting in a labeled training set

P(w | s) is the conditional probability of a word given a particular sense

P(w | s) = count(w, s) / count(s)

We can get both of these from a tagged corpus like SemCor.

Problems

Given general ML approaches, how many classifiers do I need to perform WSD robustly?

One for each ambiguous word in the language

How do you decide what set of tags/labels/senses to use for a given word?

Depends on the application

WordNet bass

Tagging with this set of senses is an impossibly hard task that is probably overkill for any realistic application

Word similarity

Abbé Gabriel Girard, 1718

[I do not believe that there is a synonymous word in any language]

Synonymy is a binary relation – two words are either synonymous or not.

Word similarity (or distance) is a looser metric: Two words are more similar if they share more features of meaning.

Actually these are really relations between senses: Instead of saying “bank is like fund”

We say

bank1 is similar to fund3

bank2 is similar to slope5

But we’ll compute similarity over both words and senses.

We often distinguish word similarity from word relatedness

Similar words: near-synonyms

Related words: can be related any way

E.g.,

car and bicycle are similar car and gasoline are related but not similar

Two classes of similarity algorithms

Thesaurus-based algorithms Are words “nearby” in WordNet (or Roget’s, e.g)?

Do words have similar glosses (definitions)?

Distributional algorithms Do words have similar distributional contexts?

Thesaurus-based word similarity

We could use any relation in the thesaurus Meronymy

Glosses

Example sentences

In practice By “thesaurus-based” we just mean using the is-a/subsumption/hypernym hierarchy

Path-based similarity

Two words are similar if nearby in thesaurus hierarchy, i.e., short path between them.

(Concepts have path 1 to themselves)

Refinements to path-based similarity

pathlen(c1, c2) = 1 + the number of edges in the shortest path in the hypernym graph between the sense nodes c1 and c2

simpath(c1, c2) = 1 / pathlen(c1, c2) Ranges from 0 to 1

wordsim(w1, w2) = max c1 ∈ senses(w1), c2 ∈ senses(w2) sim(c1, c2)

simpath(c1, c2) = 1 / pathlen(c1, c2)

E.g., simpath(nickel, coin) = 1/2 = 0.5

simpath(fund, budget) = 1/2 = 0.5

simpath(nickel, currency) = 1/4 = 0.25

simpath(nickel, money) = 1/6 = 0.17

simpath(coinage, Richter scale) = 1/6 = 0.17

Example: path-based similarity

Problem with basic path-based similarityAssumes each link represents a uniform distance

But nickel to money seems to use to be closer than nickel to standard

Nodes high in the hierarchy are very abstract.

Instead, we want a metric that represents the cost of each edge independently

counts words connected only through abstract nodes as less similar

Problem with basic path-based similarityAssumes each link represents a uniform distance

But nickel to money seems to use to be closer than nickel to standard

Nodes high in the hierarchy are very abstract.

Instead, we want a metric that represents the cost of each edge independently

counts words connected only through abstract nodes as less similar

5

Information content similarity metricsLet’s define P(c) as the probability that a randomly selected word in a corpus is an instance of concept c.

Formally, there is a distinct random variable, ranging over words, associated with each concept in the hierarchy

For a given concept, each observed noun is either

a member of that concept with probability P(c)

not a member of that concept with probability 1 − P(c)

All words are members of the root node (entity) P(root) = 1

The lower a node in hierarchy, the lower its probability

Resnik, 1995

Information content similarity

Train by counting in a corpus Each instance of hill counts toward frequency of natural elevation, geological formation, entity, etc.

Let words(c) be the set of all words that are children of node c

words(geological formation) = {hill, ridge, grotto, coast, cave, shore, natural elevation}

words(natural elevation) = {hill, ridge}

Information content similarity

WordNet hierarchy augmented with probabilities P(c)

Information content and probability

The self-information of an event is also called its surprisal:

how surprised we are to know it; how much we learn by knowing it

The more surprising something is, the more it tells us when it happens

Measure self-information in bits I(w) = −log2 P(w)

I flip a coin; P(heads) = 0.5

How many bits of information do I learn by flipping it? I(heads) = −log2(0.5) = −log2(1/2) = log2(2) = 1 bit

I flip a biased coin: P(heads) = 0.8, I don’t learn as much I(heads) = −log2(0.8) = −log2(0.8) = 0.32 bits

Information content: definitions

Information content IC(c) = −log P(c)

Lowest common subsumer (or most informative subsumer):

LCS(c1, c2) = the lowest common subsumer

i.e., the most informative (lowest) node in the hierarchy that subsumes (is a hypernym of) both c1 and c2

How do we use information content IC as a similarity metric?

Using information content for similarity

The similarity between two words is related to their common information

The more two words have in common, the more similar they are

Resnik: Measure the common information as the information content of the lowest common subsumer of the two nodes

simresnik(c1, c2) = −log P(LCS(c1, c2))

Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. IJCAI 1995.

Philip Resnik. 1999. Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. JAIR 11, 95-130.

Problems with thesaurus-based methods

We don’t have a thesaurus for every language

Even if we do, many words are missing

They rely on hyponym info: Strong for nouns, but lacking for adjectives and even verbs

Alternative Distributional methods for word similarity

Vector semantics!

Acknowledgments

The lecture incorporates material from: Nancy Ide, Vassar College

Daniel Jurafsky, Stanford University

Jonathan May, University of Southern California

Documents

(Word) Sense and Similarity