73
(Word) Sense and Similarity 6 April 2021 cmpu 366 · Computational Linguistics

(Word) Sense and Similarity

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: (Word) Sense and Similarity

(Word) Sense and Similarity

6 April 2021

cmpu 366 · Computational Linguistics

Page 2: (Word) Sense and Similarity
Page 3: (Word) Sense and Similarity

Word senses and relations

Page 4: (Word) Sense and Similarity

A word sense is a distinct meaning of a word.

How do we know if a word (lemma) has more than one sense?

Page 5: (Word) Sense and Similarity

Linguists often design tests for this purpose.

E.g., zeugma combines distinct senses in an uncomfortable way:

Which flights serve breakfast?

Which flights serve New York?

→ ? Which flights serve breakfast and New York?

Page 6: (Word) Sense and Similarity

There’s also the cross-lingual test: Is the word translated differently in other languages?

E.g., for English → French:

pen → stylo and

pen → enclos

But beware:

interest → interêt in both senses

river → fleuve or rivière (one sense)

Page 7: (Word) Sense and Similarity

Lexemes that share a form but have distinct, unrelated meanings are homonyms.

The form can be: Phonological (homophones)

Write and right

Piece and peace

Orthographic (homographs)

Bat (wooden stick vs flying mammal)

Bank (financial institution vs riverside)

Page 8: (Word) Sense and Similarity

Homonymy causes problems for NLP! Text-to-speech

Same orthographic form but different phonologic form (pronunciation), e.g., bass (instrument) vs bass (fish).

Information retrieval

Same orthographic form but different meaning

Query: bat care – need a vet or a woodworker?

Machine translation

pen (English) → stylo (French)

pen (English) → enclos (French)

Speech recognition

Turn up the bass or turn up the base?

Page 9: (Word) Sense and Similarity

If two senses are distinct but related, that’s polysemy, e.g., financial bank vs blood bank.

We say bank is polysemous.

Page 10: (Word) Sense and Similarity

A word frequently takes on further related meanings through systematic polysemy or metaphor.

The wing of a bird

→ the wing of a plane

→ the wing of a building

→ the left wing on a soccer (football) team

Often preserved across languages

Page 11: (Word) Sense and Similarity

Same word, different senses: Homonyms have unrelated senses

Polysemes have related but distinct senses

Different word, same sense: Synonyms

Page 12: (Word) Sense and Similarity

Synonyms are words that have the same meaning in some or all contexts, e.g., filbert hazelnut

youth adolescent

big large

automobile car

water H2O

There are few if any examples of perfect synonymy.

Page 13: (Word) Sense and Similarity

WordNet

Page 14: (Word) Sense and Similarity

WordNet is a lexical database for English. It includes most English nouns, verbs, adjectives, and adverbs.

The electronic format makes it amenable to automatic manipulation, so it’s used in many NLP applications.

“WordNets” generically refer to similar resources in other languages.

E.g., EuroWordNet or BalkaNet.

Page 15: (Word) Sense and Similarity

WordNet is organized in terms of synsets – sets of (roughly) synonymous words or phrases.

Each synset corresponds to a distinct sense or a concept.

Page 16: (Word) Sense and Similarity

wordnetweb.princeton.edu/perl/webwn

Page 17: (Word) Sense and Similarity

Different inventories can be used to define senses:

Different inventories can be used to define senses

Different inventories do not always agree on sense distinctions e.g., translation makes some distinctions but not othersDifferent inventories don’t always agree on sense distinctions,

e.g., translation makes some distinctions but not others.

Page 18: (Word) Sense and Similarity

WordNet hierarchies

Page 19: (Word) Sense and Similarity

WordNet hierarchies

Page 20: (Word) Sense and Similarity

Hyponymy and hypernymy

A more specific sense is a hyponym of a more general sense, e.g.,

car is a hyponym of vehicle

mango is a hyponym of fruit

The converse is being a hypernym: vehicle is a hypernym of car

fruit is a hypernym of mango

Page 21: (Word) Sense and Similarity

Meronymy is the relation of a part to its whole: leg is a meronym of chair

wheel is a meronym of car

Holonymy is the relation of a whole to a part: car is a holonym of wheel

chair is a holonym of leg

Page 22: (Word) Sense and Similarity

Antonyms are senses that are opposites with respect to one feature of meaning, e.g.,

dark light

short long

fast slow

rise fall

hot cold

up down

in out

Page 23: (Word) Sense and Similarity

WordNet noun relationsRelation Definition Example

Hypernym From concept to superordinate breakfast1 ⇒ meal1

Hyponym From concept to subtype meal1 ⇒ lunch1

Has-member From group to member faculty2 ⇒ professor1

Member-of From member to group copilot1 ⇒ crew1

Has-part From whole to part table2 ⇒ leg3

Part-of From part to whole course7 ⇒ meal1

Substance meronym From substances to their subparts water1 ⇒ oxygen1

Substance holonym From parts of substances to wholes gin1 ⇒ martini1

Antonym Semantic opposition between lemmas leader1 ⇔ follower1

Derivationally related Lemmas with the same morphological root destruction1 ⇔ destroy1

Page 24: (Word) Sense and Similarity

WordNet verb relations

Relation Definition Example

Hypernym From events to superordinate events fly9 ⇒ travel5

Troponym From events to subordinate event (often via specific manner) walk1 ⇒ stroll1

Entails From verbs (events) to the verbs (events) they entail snore1 ⇒ sleep1

Antonym Semantic opposition between lemmas increase1 ⇔ decrease1

Derivationally related Lemmas w/same morphological root destruction1 ⇔ destroy1

Page 25: (Word) Sense and Similarity

WordNet as a graph

Page 26: (Word) Sense and Similarity

Word sense disambiguation

Page 27: (Word) Sense and Similarity

Given

a word in context and

a fixed inventory of potential word senses,

decide which sense of the word is used in this context.

Page 28: (Word) Sense and Similarity

WSD systems need annotated data for evaluation and, typically, for training.

Page 29: (Word) Sense and Similarity

What can we do when humans who annotate senses disagree?

Disagreement is inevitable when annotating based on human judgments, even with trained annotators.

We cannot measure the “correctness” of annotations directly.

Instead, we can measure the reliability of annotation.

Do human annotators make the same decisions consistently?

Assumption: high reliability implies validity.

Page 30: (Word) Sense and Similarity

To quantify the (dis)agreement between human annotators, we can use Cohen’s kappa.

This measures the agreement between two annotators while taking into account the possibility of chance agreement.

κ =P(a) − P(e)

1 − P(e)

Probability of actual agreement

Probability of expected agreement

Quantifying (dis)agreement between human annotators: Cohen’s Kappa• Measures agreement between two annotators while taking into

account the possibility of chance agreement

• Scales for interpreting Kappa

Probability ofactual agreement

Probability of expected agreement

Landis & Koch, 1977

Green, 1997

Landis & Koch, 1977

Green, 1997

Page 31: (Word) Sense and Similarity

Consider this confusion matrix for sense annotations by humans A and B of the same 250 examples:

Quantifying (dis)agreement between human annotators: Cohen’s Kappa

Sense 1 Sense 2 Sense 3 TotalSense 1 54 28 3 85

Sense 2 31 18 23 72

Sense 3 0 21 72 93

Total 85 67 98 250

Consider this confusion matrix for sense annotations by A

and B of the same 250 examples

Here Pr(a) = 0.576, Pr(e) = 0.339, K=0.36

(agreement is low)

P(a) = 0.576 P(e) = 0.339 κ = 0.36 – agreement is low

κ =P(a) − P(e)

1 − P(e)

Page 32: (Word) Sense and Similarity

Most words in English have only one sense. 62% in Longman’s Dictionary of Contemporary English (LDOCE)

79% in WordNet

But the others tend to have several senses: Average of 3.83 in LDOCE

Average of 2.96 in WordNet

And frequently used words are more likely to be ambiguous.

In the British National Corpus, 84% of word instances have more than one sense.

Page 33: (Word) Sense and Similarity

The upper bound on performance for WSD is human performance:

For fine-grained, WordNet-style senses, expect 75–80% agreement.

For coarser-grained inventories, 90% human agreement is possible.

Page 34: (Word) Sense and Similarity

A baseline for word sense disambiguation is to always select the most frequent sense.

For WordNet, this means taking sense 1.

This does surprisingly well!

Page 35: (Word) Sense and Similarity

There are two variants of the WSD task: Lexical sample task:

Small, pre-selected set of target words (e.g., line and plant)

Inventory of senses for each word

Use supervised machine learning: Train a classifier for each word

All-words task:

Every word in an entire text

A lexicon with sense for each word

Data sparseness: Can’t train word-specific classifiers

Page 36: (Word) Sense and Similarity

Dictionary methods

Page 37: (Word) Sense and Similarity

Lesk (1986) looked at using dictionaries, treating each definition as a distinct sense and looking at overlap between the words in the definition and the words in the given context.

ash 1. a tree of the olive family

2. the solid residue left when combustable material is burned

Uses to disambiguate: The cigar burns slowly and creates a stiff ash.

The ash is one of the last trees to come into leaf.

Page 38: (Word) Sense and Similarity

Lesk (1986) looked at using dictionaries, treating each definition as a distinct sense and looking at overlap between the words in the definition and the words in the given context.

ash 1. a tree of the olive family

2. the solid residue left when combustable material is burned

Uses to disambiguate: The cigar burns slowly and creates a stiff ash.

The ash is one of the last trees to come into leaf.

Sense 1: 0 Sense 2: 1

Page 39: (Word) Sense and Similarity

Lesk (1986) looked at using dictionaries, treating each definition as a distinct sense and looking at overlap between the words in the definition and the words in the given context.

ash 1. a tree of the olive family

2. the solid residue left when combustable material is burned

Uses to disambiguate: The cigar burns slowly and creates a stiff ash.

The ash is one of the last trees to come into leaf.

Sense 1: 0 Sense 2: 1

Sense 1: 1 Sense 2: 0

Page 40: (Word) Sense and Similarity

Lesk (1986) looked at using dictionaries, treating each definition as a distinct sense and looking at overlap between the words in the definition and the words in the given context.

ash 1. a tree of the olive family

2. the solid residue left when combustable material is burned

Uses to disambiguate: The cigar burns slowly and creates a stiff ash.

The ash is one of the last trees to come into leaf.

Sense 1: 0 Sense 2: 1

Sense 1: 1 Sense 2: 0

Insufficient information in definitions. Accuracy: 50–70%

Page 41: (Word) Sense and Similarity

Lesk’s algorithm

Simplest implementation: Count overlapping content words between glosses and context.

Lots of variants: Include the examples with the dictionary definitions

Include hypernyms and hyponyms

Give more weight to larger overlaps

Give extra weight to infrequent words (e.g., using IDF)

Page 42: (Word) Sense and Similarity

Supervised machine-learning approaches

Page 43: (Word) Sense and Similarity

Use a corpus of words tagged in context with their sense to train a classifier that can tag words in new text.

We need: the tag set (“sense inventory”)

the training corpus

a set of features extracted from the training corpus

a classifier

Page 44: (Word) Sense and Similarity

“If one examines the words in a book, one at a time as through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine, one at a time, the meaning of the words … But if one lengthens the slit in the opaque mask, until one can wee not only the central word in question but also say N words on either side, then if N is large enough one can unambiguously decide the meaning of the central word … The practical question is: ‘What minimum value of N will, at least in a tolerable fraction of cases, lead to the correct choice of meaning for the central word?’ ” Warren Weaver, Translation, 1949

Page 45: (Word) Sense and Similarity

A simple representation for each observation (that is, each instance of a target word) is as vectors of sets of feature–value pairs representing a window of words around the target.

Page 46: (Word) Sense and Similarity

Two kinds of features in the vectors

Collocational Features about words at specific positions near target word

Often limited to just word identity and POS

Bag-of-words Features about words that occur anywhere in the window (regardless of position)

Typically limited to frequency counts

Page 47: (Word) Sense and Similarity

Example

Two-word window An electric guitar and bass player stand off to one side not really part of the scene, just as a sort of nod to gringo expectations perhaps.

Page 48: (Word) Sense and Similarity

Collocational

Position-specific information about the words in the window

guitar and bass player stand

[guitar, NN, and, CC, player, NN, stand, VB]

wordn−2, POSn−2, wordn−1, POSn−1, wordn+1, POSn+1…

In other words, a vector consisting of

[position n word, position n part-of-speech…]

Page 49: (Word) Sense and Similarity

Bag-of-words

Information about the words that occur within the window

First derive a set of terms to place in the vector

Then note how often each of those terms occurs in a given window

Page 50: (Word) Sense and Similarity

Co-occurrence example

Assume we’ve settled on a possible vocabulary of 12 words for bass sentences:

[fish, big, sound, player, fly, rod, pound, double, runs, playing, guitar, band]

The vector for guitar and bass player stand

is [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]

Page 51: (Word) Sense and Similarity

Classifiers

Once we cast WSD as a classification problem, all sorts of techniques are possible, including

naïve Bayes

decision trees

neural networks

support vector machines

nearest-neighbor methods

Page 52: (Word) Sense and Similarity

Applying naïve Bayes to WSD

P(s) is the prior probability of that sense Counting in a labeled training set

P(w | s) is the conditional probability of a word given a particular sense

P(w | s) = count(w, s) / count(s)

We can get both of these from a tagged corpus like SemCor.

Page 53: (Word) Sense and Similarity

Problems

Given general ML approaches, how many classifiers do I need to perform WSD robustly?

One for each ambiguous word in the language

How do you decide what set of tags/labels/senses to use for a given word?

Depends on the application

Page 54: (Word) Sense and Similarity

WordNet bass

Tagging with this set of senses is an impossibly hard task that is probably overkill for any realistic application

Page 55: (Word) Sense and Similarity

Word similarity

Page 56: (Word) Sense and Similarity

Abbé Gabriel Girard, 1718

[I do not believe that there is a synonymous word in any language]

Page 57: (Word) Sense and Similarity

Synonymy is a binary relation – two words are either synonymous or not.

Word similarity (or distance) is a looser metric: Two words are more similar if they share more features of meaning.

Actually these are really relations between senses: Instead of saying “bank is like fund”

We say

bank1 is similar to fund3

bank2 is similar to slope5

But we’ll compute similarity over both words and senses.

Page 58: (Word) Sense and Similarity

We often distinguish word similarity from word relatedness

Similar words: near-synonyms

Related words: can be related any way

E.g.,

car and bicycle are similar car and gasoline are related but not similar

Page 59: (Word) Sense and Similarity

Two classes of similarity algorithms

Thesaurus-based algorithms Are words “nearby” in WordNet (or Roget’s, e.g)?

Do words have similar glosses (definitions)?

Distributional algorithms Do words have similar distributional contexts?

Page 60: (Word) Sense and Similarity

Thesaurus-based word similarity

We could use any relation in the thesaurus Meronymy

Glosses

Example sentences

In practice By “thesaurus-based” we just mean using the is-a/subsumption/hypernym hierarchy

Page 61: (Word) Sense and Similarity

Path-based similarity

Two words are similar if nearby in thesaurus hierarchy, i.e., short path between them.

(Concepts have path 1 to themselves)

Page 62: (Word) Sense and Similarity

Refinements to path-based similarity

pathlen(c1, c2) = 1 + the number of edges in the shortest path in the hypernym graph between the sense nodes c1 and c2

simpath(c1, c2) = 1 / pathlen(c1, c2) Ranges from 0 to 1

wordsim(w1, w2) = max c1 ∈ senses(w1), c2 ∈ senses(w2) sim(c1, c2)

Page 63: (Word) Sense and Similarity

simpath(c1, c2) = 1 / pathlen(c1, c2)

E.g., simpath(nickel, coin) = 1/2 = 0.5

simpath(fund, budget) = 1/2 = 0.5

simpath(nickel, currency) = 1/4 = 0.25

simpath(nickel, money) = 1/6 = 0.17

simpath(coinage, Richter scale) = 1/6 = 0.17

Example: path-based similarity

Page 64: (Word) Sense and Similarity

Problem with basic path-based similarityAssumes each link represents a uniform distance

But nickel to money seems to use to be closer than nickel to standard

Nodes high in the hierarchy are very abstract.

Instead, we want a metric that represents the cost of each edge independently

counts words connected only through abstract nodes as less similar

Page 65: (Word) Sense and Similarity

Problem with basic path-based similarityAssumes each link represents a uniform distance

But nickel to money seems to use to be closer than nickel to standard

Nodes high in the hierarchy are very abstract.

Instead, we want a metric that represents the cost of each edge independently

counts words connected only through abstract nodes as less similar

5

Page 66: (Word) Sense and Similarity

Information content similarity metricsLet’s define P(c) as the probability that a randomly selected word in a corpus is an instance of concept c.

Formally, there is a distinct random variable, ranging over words, associated with each concept in the hierarchy

For a given concept, each observed noun is either

a member of that concept with probability P(c)

not a member of that concept with probability 1 − P(c)

All words are members of the root node (entity) P(root) = 1

The lower a node in hierarchy, the lower its probability

Resnik, 1995

Page 67: (Word) Sense and Similarity

Information content similarity

Train by counting in a corpus Each instance of hill counts toward frequency of natural elevation, geological formation, entity, etc.

Let words(c) be the set of all words that are children of node c

words(geological formation) = {hill, ridge, grotto, coast, cave, shore, natural elevation}

words(natural elevation) = {hill, ridge}

Page 68: (Word) Sense and Similarity

Information content similarity

WordNet hierarchy augmented with probabilities P(c)

Page 69: (Word) Sense and Similarity

Information content and probability

The self-information of an event is also called its surprisal:

how surprised we are to know it; how much we learn by knowing it

The more surprising something is, the more it tells us when it happens

Measure self-information in bits I(w) = −log2 P(w)

I flip a coin; P(heads) = 0.5

How many bits of information do I learn by flipping it? I(heads) = −log2(0.5) = −log2(1/2) = log2(2) = 1 bit

I flip a biased coin: P(heads) = 0.8, I don’t learn as much I(heads) = −log2(0.8) = −log2(0.8) = 0.32 bits

Page 70: (Word) Sense and Similarity

Information content: definitions

Information content IC(c) = −log P(c)

Lowest common subsumer (or most informative subsumer):

LCS(c1, c2) = the lowest common subsumer

i.e., the most informative (lowest) node in the hierarchy that subsumes (is a hypernym of) both c1 and c2

How do we use information content IC as a similarity metric?

Page 71: (Word) Sense and Similarity

Using information content for similarity

The similarity between two words is related to their common information

The more two words have in common, the more similar they are

Resnik: Measure the common information as the information content of the lowest common subsumer of the two nodes

simresnik(c1, c2) = −log P(LCS(c1, c2))

Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. IJCAI 1995.

Philip Resnik. 1999. Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. JAIR 11, 95-130.

Page 72: (Word) Sense and Similarity

Problems with thesaurus-based methods

We don’t have a thesaurus for every language

Even if we do, many words are missing

They rely on hyponym info: Strong for nouns, but lacking for adjectives and even verbs

Alternative Distributional methods for word similarity

Vector semantics!

Page 73: (Word) Sense and Similarity

Acknowledgments

The lecture incorporates material from: Nancy Ide, Vassar College

Daniel Jurafsky, Stanford University

Jonathan May, University of Southern California