CS388: Natural Language Processing Lecture 4: Sequence Models

Preview:

Citation preview

CS388: Natural Language Processing Lecture 4: Sequence Models

Eunsol Choi

Parts of this lecture adapted from Greg DurreA, Yejin Choi, Yoav Artzi

LogisHcs

2

‣ HW1 due today midnight

‣ HW2 will be released tomorrow, due September 30th

‣ Materials needed to do HW2 will be covered by next Tuesday

Sequence Models

3

‣ Topics for next three lectures and HW2

‣ We will be back to neural sequence models again in a few weeks

Overview

‣ Sequence Modeling Problems in NLP

‣ GeneraHve Model: Hidden Markov Models (HMM)

‣ DiscriminaHve Model: Maximum Entropy Markov Models (MEMM) CondiHonal Random Fields

‣ Unsupervised Learning: ExpectaHon MaximizaHon

Reading

5

‣ Collins: HMMs —> GeneraHve sequence tagging model

‣ Collins: MEMMs —> DiscriminaHve sequence tagging models

‣ Collins: EMs —> ExpectaHon MaximizaHon

‣ J&M: Chapter 8 (opHonal) ‣ Covers both HMM, MEMM

The Structure of Language

‣ Language is tree-structured

I ate the spaghea with chopsHcks I ate the spaghea with meatballs

‣ But labelled sequence can provide shallow analysis

I ate the spaghea with chopsHcks I ate the spaghea with meatballsPRP VBZ DT NN IN NNS PRP VBZ DT NN IN NNS

Sequence Modeling Problems in NLP

7

‣ Parts of Speech Tagging (POS)

I ate the spaghea with chopsHcks I ate the spaghea with meatballsPRP VBZ DT NN IN NNS PRP VBZ DT NN IN NNS

‣ Named EnHty RecogniHon (NER): ‣ Segment text into spans with certain properHes (person,

organizaHon, )[Germany]LOC ’s representative to the [European Union]ORG ’s veterinary committee [Werner Zwingman]PER said on Wednesday consumers should…

Germany/BL ’s/NA representative/NA to/NA the/NA European/BO Union/CO ’s/NA veterinary/NA committee/NA Werner/BP Zwingman/CP said/NA on/NA Wednesday/NA consumers/NA should/NA…

Parts of Speech

Slide credit: Dan Klein

‣ CategorizaHon of words into types

CC conjunction, coordinating and both but either orCD numeral, cardinal mid-1890 nine-thirty 0.5 oneDT determiner a all an every no that theEX existential there there FW foreign word gemeinschaft hund ich jeuxIN preposition or conjunction, subordinating among whether out on by ifJJ adjective or numeral, ordinal third ill-mannered regrettable

JJR adjective, comparative braver cheaper tallerJJS adjective, superlative bravest cheapest tallestMD modal auxiliary can may might will would NN noun, common, singular or mass cabbage thermostat investment subhumanity

NNP noun, proper, singular Motown Cougar Yvette LiverpoolNNPS noun, proper, plural Americans Materials StatesNNS noun, common, plural undergraduates bric-a-brac averagesPOS genitive marker ' 's PRP pronoun, personal hers himself it we themPRP$ pronoun, possessive her his mine my our ours their thy your RB adverb occasionally maddeningly adventurously

RBR adverb, comparative further gloomier heavier less-perfectlyRBS adverb, superlative best biggest nearest worst RP particle aboard away back by on open throughTO "to" as preposition or infinitive marker to UH interjection huh howdy uh whammo shucks heckVB verb, base form ask bring fire see take

VBD verb, past tense pleaded swiped registered sawVBG verb, present participle or gerund stirring focusing approaching erasingVBN verb, past participle dilapidated imitated reunifed unsettledVBP verb, present tense, not 3rd person singular twist appear comprise mold postponeVBZ verb, present tense, 3rd person singular bases reconstructs marks usesWDT WH-determiner that what whatever which whichever WP WH-pronoun that what whatever which who whomWP$ WH-pronoun, possessive whose WRB Wh-adverb however whenever where why

Main Tags

POS Tagging

The back door = JJ (Adjective) On my back = NN (Noun) Win the voters back = RB (Adverb) Promised to back the bill = VB (Verb)

10

‣ The POS tagging problem is to determine the POS tag for a parHcular instance of a word.

‣ Many words have more than one POS, depending on its context

Sources of InformaHon

11

‣ Knowledge of neighboring words

‣ Knowledge of word probabiliHes‣ the, a, an is almost always arHcle‣ man is frequently noun, rarely used as a verb

Time flies like an arrow; Fruit flies like a banana

‣ If we choose the most frequent tag, over 90% accuracy‣ About 40% of word tokens are ambiguous

What is this good for?

‣ Preprocessing step for syntacHc parsers

‣ Domain-independent disambiguaHon for other tasks

‣ (Very) shallow informaHon extracHon: ‣ write regular expressions like (Det) Adj*N + over the output for

phrases

POS tag sets in different languages

13[Petrov et al. 2012]

Universal POS Tag Set

‣ Universal POS tagset (~12 tags), cross-lingual model works well!

Gillick et al. 2016

Today

‣ Sequence Modeling Problems in NLP

‣ Hidden Markov Models (HMM)

‣ Inference (Viterbi)

‣ HMM parameter esHmaHon

Classic SoluHon: Hidden Markov Models

y = (y1, ..., yn)Output ‣ Input x = (x1, ..., xn)

Two simplifying assumpHons

17

‣ Independent AssumpHon:

‣ Markov AssumpHon (future is condiHonally independent of the past given the present)

P(yi |y1, y2, ⋯, yi−1) = P(yi |yi−1)

P(xi |x, y) = P(xi |yi)

HMM for POS

18

The Georgia branch had taken on loan commitments …

DT NNP NN VBD VBN RP NN NNS

‣ States = {DT, NNP, NN, ... } are the POS tags

‣ Observations = V are words

‣ Transition distribution models the tag sequences

‣ Emission distribution models words given their POS

𝑌𝑋

𝑞(𝑦𝑖 𝑦𝑖−1)

𝑒(𝑥𝑖 𝑦𝑖)

HMM Learning and Inference

19

‣ Learning: ‣ Maximum likelihood: transiHon q and emission e

‣ Inference: ‣ Viterbi:

Learning: Maximum Likelihood

20

‣ Supervised Learning for esHmaHng transiHons and emissions

‣ Any concerns for the quality of any of these esHmates?

Sparsity again!

Learning: Low frequency Words

21

Dealingwith Low-FrequencyWords: An Example[Bikel et. al 1999] (named-entity recognition)

Word class Example Intuition

twoDigitNum 90 Two digit yearfourDigitNum 1990 Four digit yearcontainsDigitAndAlpha A8956-67 Product codecontainsDigitAndDash 09-96 DatecontainsDigitAndSlash 11/9/89 DatecontainsDigitAndComma 23,000.00 Monetary amountcontainsDigitAndPeriod 1.00 Monetary amount,percentageothernum 456789 Other numberallCaps BBN OrganizationcapPeriod M. Person name initialfirstWord first word of sentence no useful capitalization informationinitCap Sally Capitalized wordlowercase can Uncapitalized wordother , Punctuation marks, all other words

18

‣ Used the following word classes for infrequent words [Bickel et. al, 1999]

Inference (Decoding)

‣ Inference problem:

‣ We can list all possible y and then pick the best one! ‣ Any problems?

‣ Input x = (x1, ..., xn) y = (y1, ..., yn)Output

y1 y2 yn

x1 x2 xn

argmaxyP (y|x) = argmaxyP (y,x)

P (x)

23

‣ First soluHon: Beam Search ‣ A beam is a set of parHal hypotheses ‣ Start with a single empty trajectory ‣ At each step, consider all conHnuaHon, discard most, keep top K

‣ But this does not guarantee the opHmal answer…

Inference (Decoding)

The Viterbi Algorithm

24

‣ Dynamic program for compuHng the max score of a sequence of length i ending in tag yi

‣ Now this is an efficient algorithm!

25

‣ Dynamic program for compuHng (for all i)

‣ IteraHve ComputaHon:

‣ For I = 1… n: ‣ Store score

‣ Store back-pointer

The Viterbi Algorithm

Time flies like an arrow; Fruit flies like a banana

26

27

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

Fruit Flies Like Bananas

28

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

Fruit Flies Like Bananas

=0

=0.01

=0.03

29

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

Fruit Flies Like Bananas

=0

=0.01

=0.03 =0.005

30

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

Fruit Flies Like Bananas

=0

=0.01

=0.03 =0.005

=0.007

=0

31

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

Fruit Flies Like Bananas

=0

=0.01

=0.03 =0.005

=0.007

=0

=0.0007

=0.0003

=0.0001

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

=0

=0.01

=0.03 =0.005

=0.007

=0

=0.0007

=0.0003

=0.0001

Fruit Flies Like Bananas

33

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

=0

=0.01

=0.03 =0.005

=0.007

=0

=0.0007

=0.0003

=0.0001

=0.00001

=0

=0.00003

Fruit Flies Like Bananas

34

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

=0

=0.01

=0.03 =0.005

=0.007

=0

=0.0007

=0.0003

=0.0001

=0.00001

=0

=0.00003

Fruit Flies Like Bananas

35

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

=0

=0.01

=0.03 =0.005

=0.007

=0

=0.0007

=0.0003

=0.0001

=0.00001

=0

=0.00003

Why does this find the max p(.)? What is the runtime?

36

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

=0

=0.01

=0.03 =0.005

=0.007

=0

=0.0007

=0.0003

=0.0001

=0.00001

=0

=0.00003

The Viterbi Algorithm: RunHme

37

‣ Linear in sentence length ‣ Polynomial in the number of possible tags

‣ Total RunHme:

‣ Would there any scenarios where we would choose beam search?

Tagsets in Different Languages

38

2942 = 86436

452 = 2045

112 = 121

Trigram HMM Taggers‣ Trigram model: y1 = (<S>, NNP), y2 = (NNP, VBZ), …

‣ P((VBZ, NN) | (NNP, VBZ)) — more context! Noun-verb-noun S-V-O

‣ Tradeoff between model capacity and data size (sparsity) ‣ Trigrams are a “sweet spot” for POS tagging

HMM POS Tagging

‣ Baseline: assign each word its most frequent tag: ~90% accuracy

‣ Trigram HMM: ~95% accuracy / 55% on unknown words

‣ TnT tagger (Brants 1998, tuned HMM): 96.2% accuracy / 86.0% on unks

Slide credit: Dan Klein

‣ State-of-the-art (BiLSTM-CRFs): 97.5% / 89%+

Can we do beAer?

41

‣ HMM is a generaHve model, esHmaHon relies on counHng! ‣ Reminds you of something?

‣ Can we build a discriminaHve model, incorporaHng rich features?

Named EnHty RecogniHon (NER)

Barack Obama will travel to Hangzhou today for the G20 mee=ng .

PERSON LOC ORG

B-PER I-PER O O O B-LOC B-ORGO O O O O

‣ BIO tagset: begin, inside, outside

‣ Why might an HMM not do so well here?

‣ Lots of O’s

‣ Sequence of tags — should we use an HMM?

‣ Insufficient features/capacity with mulHnomials (especially for unks)

Emission Features for NER

Leicestershire is a nice place to visit…

I took a vaca=on to Boston

Apple released a new version…

According to the New York Times…

ORG

ORG

LOC

LOC

Texas governor Greg AbboI said

Leonardo DiCaprio won an award…

PER

PER

LOC

Emission Features for NER

‣ Context features ‣ Words before/a�er

‣ Word features ‣ CapitalizaHon ‣ Word shape ‣ Prefixes/suffixes ‣ Lexical indicators

‣ Word clusters

Leicestershire

Boston

Apple released a new version…

According to the New York Times…

Maximum Entropy Markov Models (MEMM)

45

Chain rule

Independence assumpHon

‣ Log linear model for sequence tagging problem

‣ Learning:

‣ Train as a discrete log-linear model p(yi |yi−1, x1, …, xn)

‣ Scoring:

Recommended