LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd

LING/C SC 581: Advanced Computational Linguistics

Lecture NotesJan 22nd

Today's Topics

• Minimum Edit Distance Homework

• Corpora: frequency information

• tregex

Minimum Edit Distance Homework

• Background: – … about 20% of the time “Britney Spears” is misspelled

when people search for it on Google• Software for generating misspellings– If a person running a Britney Spears web site wants to

get the maximum exposure, it would be in their best interests to include at least a few misspellings.

– http://www.geneffects.com/typopositive/


• http://www.google.com/jobs/archive/britney.html

Top six misspellings

• Design a minimum edit algorithm that ranks these misspellings (as accurately as possible):– e.g. ED(brittany) < ED(britany)


• Submit your homework in PDF– how many you got right– explain your criteria, e.g. weights, chosen

• you should submit your modified Excel spreadsheet or code (e.g. Python, Perl, Java) as well

• due by email to me before next Thursday class…– put your name and 581 at the top of your submission

Part 2

• Corpora: frequency information

• Unlabeled corpus: just words• Labeled corpus: various kinds …– POS information– Information about phrases– Word sense or Semantic role labeling

easy to find

progressivelyharder tocreate or obtain

Language Models and N-grams

• given a word sequence– w1 w2 w3 ... wn

• chain rule– how to compute the probability of a sequence of words– p(w1 w2) = p(w1) p(w2|w1) – p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) – ...– p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1)

• note– It’s not easy to collect (meaningful) statistics on p(wn|wn-1wn-2...w1) for all

possible word sequences


• Given a word sequence– w1 w2 w3 ... wn

• Bigram approximation– just look at the previous word only (not all the proceedings words) – Markov Assumption: finite length history– 1st order Markov Model– p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2) ...p(wn|w1...wn-3wn-2wn-1)

– p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)

• note– p(wn|wn-1) is a lot easier to collect data for (and thus estimate well) than p(wn|w1...wn-2

wn-1)


• Trigram approximation – 2nd order Markov Model– just look at the preceding two words only– p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2) p(w4|w1w2w3)...p(wn|w1...wn-

3wn-2wn-1)

– p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-2 wn-1)

• note– p(wn|wn-2wn-1) is a lot easier to estimate well than p(wn|w1...wn-2 wn-1) but

harder than p(wn|wn-1 )


• estimating from corpora– how to compute bigram probabilities– p(wn|wn-1) = f(wn-1wn)/f(wn-1w) w is any word

– Since f(wn-1w) = f(wn-1) f(wn-1) = unigram frequency for wn-1

– p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency

• Note:– The technique of estimating (true) probabilities using a relative

frequency measure over a training corpus is known as maximum likelihood estimation (MLE)

Motivation for smoothing

• Smoothing: avoid zero probability estimates• Consider

• what happens when any individual probability component is zero?– Arithmetic multiplication law: 0×X = 0– very brittle!

• even in a very large corpus, many possible n-grams over vocabulary space will have zero frequency– particularly so for larger n-grams

p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)


• Example:

unigram frequencies

wn-1wn bigram frequencies

bigram probabilities

sparse matrix

zeros render probabilities unusable

(we’ll need to add fudge factors - i.e. do smoothing)

wn-1

wn

Smoothing and N-grams

• sparse dataset means zeros are a problem– Zero probabilities are a problem

• p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) bigram model

• one zero and the whole product is zero– Zero frequencies are a problem

• p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency

• bigram f(wn-1wn) doesn’t exist in dataset

• smoothing– refers to ways of assigning zero probability n-grams a non-zero value

Smoothing and N-grams• Add-One Smoothing (4.5.1 Laplace Smoothing)

– add 1 to all frequency counts– simple and no more zeros (but there are better methods)

• unigram– p(w) = f(w)/N (before Add-One)

• N = size of corpus– p(w) = (f(w)+1)/(N+V) (with Add-One)– f*(w) = (f(w)+1)*N/(N+V) (with Add-One)

• V = number of distinct words in corpus• N/(N+V) normalization factor adjusting for the effective increase in the corpus size caused by

Add-One• bigram

– p(wn|wn-1) = f(wn-1wn)/f(wn-1) (before Add-One)– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) (after Add-One)– f*(wn-1 wn) = (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V) (after Add-One)

must rescale so that total probability mass stays at 1


• Add-One Smoothing– add 1 to all frequency counts

• bigram– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) – (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V)

• frequencies

Remarks:perturbation problem

add-one causes largechanges in somefrequencies due to relative size of V (1616)

want to: 786 338

= figure 6.8

= figure 6.4


• Add-One Smoothing– add 1 to all frequency counts

• bigram– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) – (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V)

• Probabilities

Remarks:perturbation problem

similar changes inprobabilities

= figure 6.5

= figure 6.7


• let’s illustrate the problem– take the bigram case:– wn-1wn

– p(wn|wn-1) = f(wn-1wn)/f(wn-1)

– suppose there are cases– wn-1wzero

1 that don’t occur in the corpus

probability mass

f(wn-1)

f(wn-1wn)

f(wn-1wzero1)=0

f(wn-1wzerom)=0

...


• add-one– “give everyone 1”

probability mass

f(wn-1)

f(wn-1wn)+1

f(wn-1w01)=1

f(wn-1w0m)=1

...


• add-one– “give everyone 1”

probability mass

f(wn-1)

f(wn-1wn)+1

f(wn-1w01)=1

f(wn-1w0m)=1

... V = |{wi}|

• redistribution of probability mass– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V)


• Good-Turing Discounting (4.5.2)– Nc = number of things (= n-grams) that occur c times in the corpus– N = total number of things seen– Formula: smoothed c for Nc given by c* = (c+1)Nc+1/Nc

– Idea: use frequency of things seen once to estimate frequency of things we haven’t seen yet– estimate N0 in terms of N1…– and so on but if Nc =0, smooth that first using something like log(Nc)=a+b log(c)– Formula: P*(things with zero freq) = N1/N– smaller impact than Add-One

• Textbook Example:– Fishing in lake with 8 species

• bass, carp, catfish, eel, perch, salmon, trout, whitefish– Sample data (6 out of 8 species):

• 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel

– P(unseen new fish, i.e. bass or carp) = N1/N = 3/18 = 0.17– P(next fish=trout) = 1/18

• (but, we have reassigned probability mass, so need to recalculate this from the smoothing formula…)– revised count for trout: c*(trout) = 2*N2/N1=2(1/3)=0.67 (discounted from 1)– revised P(next fish=trout) = 0.67/18 = 0.037


• N-gram models + smoothing– one consequence of smoothing is that– every possible concatentation or sequence of

words has a non-zero probability

– N-gram models can also incorporate word classes, e.g. POS labels when available


• N-gram models– data is easy to obtain• any unlabeled corpus will do

– they’re technically easy to compute • count frequencies and apply the smoothing formula

– but just how good are these n-gram language models?

– and what can they show us about language?


approximating Shakespeare– generate random sentences using n-grams– Corpus: Complete Works of Shakespeare

• Unigram (pick random, unconnected words)

• Bigram


• Approximating Shakespeare– generate random sentences using n-grams– Corpus: Complete Works of Shakespeare

• Trigram

• Quadrigram

Remarks:dataset size problem

training set is small884,647 words29,066 different words

29,0662 = 844,832,356 possible bigrams

for the random sentence generator, this means very limited choices forpossible continuations,which means program can’t be very innovative for higher n


• A limitation: – produces ungrammatical sequences

• Treebank:– potential to be a better language model– Structural information:• contains frequency information about syntactic rules

– we should be able to generate sequences that are closer to English …


• Aside: http://hemispheresmagazine.com/contests/2004/intro.htm

Part 3

tregex• I assume everyone has:

1. Installed Penn Treebank v32. Downloaded and installed tregex

Trees in the Penn Treebank

Notation: LISP S-expression

Directory: TREEBANK_3/parsed/mrg/

tregex

• Search Example: << dominates, < immediately dominates

tregexHelp

tregexHelp

tregex

• Help: tregex expression syntax is non-standard wrt bracketing

S < VPS < NP

tregex

• Help: tregex boolean syntax is also non-standard

tregex

• Help

tregex• Help

tregex

• Pattern: – (@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)

Key:<, first child$+ immediate left sister<- last child

same node

tregex

• Help

tregex

tregex• Different results from:

– @SBAR < /^WH.*-([0-9]+)$/#1%index << (@NP < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))

tregex

Example: WHADVPalso possible(not just WHNP)

Treebank Guides1. Tagging Guide2. Arpa94 paper3. Parse Guide

Treebank Guides

• Parts-of-speech (POS) Tagging Guide, tagguid1.pdf (34 pages):

tagguid2.pdf: addendum, see POS tag ‘TO’

Treebank Guides

• Parsing guide 1, prsguid1.pdf (318 pages):

prsguid2.pdf: addendum for the Switchboard corpus

Documents

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd