32
LING 581: Advanced Computational Linguistics Lecture Notes January 12th

LING 581: Advanced Computational Linguistics Lecture Notes January 12th

  • View
    222

  • Download
    1

Embed Size (px)

Citation preview

Page 1: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

LING 581: Advanced Computational Linguistics

Lecture Notes

January 12th

Page 2: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Course

• Webpage for lecture slides– http://dingo.sbs.arizona.edu/~sandiway/ling581-

12/

• Meeting information

Page 3: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Course Objectives

• Follow-on course to LING/C SC/PSYC 438/538 Computational Linguistics:– continue with J&M: 25 chapters, a lot of material not covered

in 438/538

• And gain project experience– dealing with natural language software packages– Installation, input data formatting– operation – project exercises– useful “real-world” computational experience

– abilities gained will be of value to employers

Page 4: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Computational Facilities

• Advise using your own laptop/desktop– we can also make use of this computer lab

• but you don’t have installation rights on these computers

• PlatformsWindows is possible but you really should run some variant of Unix…

(your task #1 for this week)– Linux (separate bootable partition or via virtualization software)

• de facto standard for advanced/research software • https://www.virtualbox.org/ (free!)

– Cygwin on Windows• http://www.cygwin.com/• Linux-like environment for Windows making it possible to port software running on POSIX

systems (such as Linux, BSD, and Unix systems) to Windows.

– MacOS X• Not quite Linux, some porting issues, especially with C programs, can use Virtual Box

Page 5: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Grading

• Completion of all homework task and course project will result in a satisfactory grade (A)

Page 6: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Homework Task 1: Install Tregex

Runs in java

Page 7: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Last semester

• 438/538: Language models

Page 8: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Language Models and N-grams

• given a word sequence– w1 w2 w3 ... wn

• chain rule– how to compute the probability of a sequence of words– p(w1 w2) = p(w1) p(w2|w1) – p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) – ...– p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1)

• note– It’s not easy to collect (meaningful) statistics on p(wn|wn-1wn-2...w1)

for all possible word sequences

Page 9: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Language Models and N-grams

• Given a word sequence– w1 w2 w3 ... wn

• Bigram approximation– just look at the previous word only (not all the proceedings words) – Markov Assumption: finite length history– 1st order Markov Model– p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2) ...p(wn|w1...wn-3wn-2wn-1)

– p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)

• note– p(wn|wn-1) is a lot easier to collect data for (and thus estimate well) than p(wn|

w1...wn-2 wn-1)

Page 10: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Language Models and N-grams

• Trigram approximation – 2nd order Markov Model– just look at the preceding two words only– p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2) p(w4|w1w2w3)...p(wn|

w1...wn-3wn-2wn-1)

– p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-2

wn-1)

• note– p(wn|wn-2wn-1) is a lot easier to estimate well than p(wn|w1...wn-2 wn-1)

but harder than p(wn|wn-1 )

Page 11: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Language Models and N-grams

• estimating from corpora– how to compute bigram probabilities– p(wn|wn-1) = f(wn-1wn)/f(wn-1w) w is any word

– Since f(wn-1w) = f(wn-1) f(wn-1) = unigram frequency for wn-1

– p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency

• Note:– The technique of estimating (true) probabilities using a

relative frequency measure over a training corpus is known as maximum likelihood estimation (MLE)

Page 12: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Motivation for smoothing

• Smoothing: avoid zero probability estimates• Consider

• what happens when any individual probability component is zero?– Arithmetic multiplication law: 0×X = 0– very brittle!

• even in a very large corpus, many possible n-grams over vocabulary space will have zero frequency– particularly so for larger n-grams

p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)

Page 13: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Language Models and N-grams

• Example:

unigram frequencies

wn-1wn bigram frequencies

bigram probabilities

sparse matrix

zeros render probabilities unusable

(we’ll need to add fudge factors - i.e. do smoothing)

wn-1

wn

Page 14: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Smoothing and N-grams

• sparse dataset means zeros are a problem– Zero probabilities are a problem

• p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) bigram model

• one zero and the whole product is zero– Zero frequencies are a problem

• p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency

• bigram f(wn-1wn) doesn’t exist in dataset

• smoothing– refers to ways of assigning zero probability n-grams a non-zero value

Page 15: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Smoothing and N-grams

• Add-One Smoothing (4.5.1 Laplace Smoothing)– add 1 to all frequency counts– simple and no more zeros (but there are better methods)

• unigram– p(w) = f(w)/N (before Add-One)

• N = size of corpus– p(w) = (f(w)+1)/(N+V) (with Add-One)– f*(w) = (f(w)+1)*N/(N+V) (with Add-One)

• V = number of distinct words in corpus• N/(N+V) normalization factor adjusting for the effective increase in the corpus

size caused by Add-One• bigram

– p(wn|wn-1) = f(wn-1wn)/f(wn-1) (before Add-One)– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) (after Add-One)– f*(wn-1 wn) = (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V) (after Add-One)

must rescale so that total probability mass stays at 1

Page 16: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Smoothing and N-grams

• Add-One Smoothing– add 1 to all frequency counts

• bigram– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) – (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V)

• frequencies

Remarks:perturbation problem

add-one causes largechanges in somefrequencies due to relative size of V (1616)

want to: 786 338

= figure 6.8

= figure 6.4

Page 17: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Smoothing and N-grams

• Add-One Smoothing– add 1 to all frequency counts

• bigram– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) – (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V)

• Probabilities

Remarks:perturbation problem

similar changes inprobabilities

= figure 6.5

= figure 6.7

Page 18: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Smoothing and N-grams

• let’s illustrate the problem– take the bigram case:– wn-1wn

– p(wn|wn-1) = f(wn-1wn)/f(wn-1)

– suppose there are cases– wn-1w0

1 that don’t occur in the corpus

probability mass

f(wn-1)

f(wn-1wn)

f(wn-1w01)=0

f(wn-1w0m)=0

...

Page 19: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Smoothing and N-grams

• add-one– “give everyone 1”

probability mass

f(wn-1)

f(wn-1wn)+1

f(wn-1w01)=1

f(wn-1w0m)=1

...

Page 20: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Smoothing and N-grams

• add-one– “give everyone 1”

probability mass

f(wn-1)

f(wn-1wn)+1

f(wn-1w01)=1

f(wn-1w0m)=1

... V = |{wi)|

• redistribution of probability mass

– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V)

Page 21: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Smoothing and N-grams

• Excel spreadsheet available– addone.xls

Page 22: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Smoothing and N-grams

• Good-Turing Discounting (4.5.2)– Nc = number of things (= n-grams) that occur c times in the corpus– N = total number of things seen– Formula: smoothed c for Nc given by c* = (c+1)Nc+1/Nc

– Idea: use frequency of things seen once to estimate frequency of things we haven’t seen yet– estimate N0 in terms of N1…– Formula: P*(things with zero freq) = N1/N– smaller impact than Add-One

• Textbook Example:– Fishing in lake with 8 species – Bass, carp, catfish, eel, perch, salmon, trout, whitefish– Sample data:– 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel

– P(new fish unseen) = 3/18– P(next fish=trout) = 1/18 (but, we have reassigned probability mass…)– C*(trout) = 2*N2/N1=2(1/3)=0.67 (discounted from 1)

Page 23: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Language Models and N-grams

• N-gram models– they’re technically easy to compute

• (in the sense that lots of training data are available)

– but just how good are these n-gram language models?

– and what can they show us about language?

Page 24: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Language Models and N-grams

approximating Shakespeare– generate random sentences using n-grams– train on complete Works of Shakespeare

• Unigram (pick random, unconnected words)

• Bigram

Page 25: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Language Models and N-grams

• Approximating Shakespeare (section 6.2)– generate random sentences using n-grams– train on complete Works of Shakespeare

• Trigram

• Quadrigram

Remarks:dataset size problem

training set is small884,647 words29,066 different words

29,0662 = 844,832,356 possible bigrams

for the random sentence generator, this means very limited choices forpossible continuations,which means program can’t be very innovative for higher n

Page 26: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Language Models and N-grams

• Aside: http://hemispheresmagazine.com/contests/2004/intro.htm

Page 27: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Language Models and N-grams

• N-gram models + smoothing– one consequence of smoothing is that– every possible concatentation or sequence

of words has a non-zero probability

Page 28: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Colorless green ideas

• examples– (1) colorless green ideas sleep furiously– (2) furiously sleep ideas green colorless

• Chomsky (1957):– . . . It is fair to assume that neither sentence (1) nor

(2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally `remote' from English. Yet (1), though nonsensical, is grammatical, while (2) is not.

• idea– (1) is syntactically valid, (2) is word salad

• Statistical Experiment (Pereira 2002)

Page 29: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Colorless green ideas

• examples– (1) colorless green ideas sleep furiously– (2) furiously sleep ideas green colorless

• Statistical Experiment (Pereira 2002) wiwi-1

bigram language model

Page 30: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

• example– colorless green ideas sleep furiously

• First hit

Interesting things to Google

Page 31: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Interesting things to Google

• example– colorless green ideas sleep furiously

• first hit– compositional semantics– a green idea is, according to well established usage of the word "green" is

one that is an idea that is new and untried. – again, a colorless idea is one without vividness, dull and unexciting. – so it follows that a colorless green idea is a new, untried idea that is

without vividness, dull and unexciting. – to sleep is, among other things, is to be in a state of dormancy or inactivity,

or in a state of unconsciousness. – to sleep furiously may seem a puzzling turn of phrase but one reflects that

the mind in sleep often indeed moves furiously with ideas and images flickering in and out.

Page 32: LING 581: Advanced Computational Linguistics Lecture Notes January 12th

Interesting things to Google

• example– colorless green ideas sleep furiously

• another hit: (a story)– "So this is our ranking system," said Chomsky. "As you can see, the

highest rank is yellow." – "And the new ideas?" – "The green ones? Oh, the green ones don't get a color until

they've had some seasoning. These ones, anyway, are still too angry. Even when they're asleep, they're furious. We've had to kick them out of the dormitories - they're just unmanageable."

– "So where are they?" – "Look," said Chomsky, and pointed out of the window. There below,

on the lawn, the colorless green ideas slept, furiously.