Upload
vivian-anderson
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
LING/C SC 581: Advanced Computational Linguistics
Lecture NotesJan 22nd
Today's Topics
• Minimum Edit Distance Homework
• Corpora: frequency information
• tregex
Minimum Edit Distance Homework
• Background: – … about 20% of the time “Britney Spears” is misspelled
when people search for it on Google• Software for generating misspellings– If a person running a Britney Spears web site wants to
get the maximum exposure, it would be in their best interests to include at least a few misspellings.
– http://www.geneffects.com/typopositive/
Minimum Edit Distance Homework
• http://www.google.com/jobs/archive/britney.html
Top six misspellings
• Design a minimum edit algorithm that ranks these misspellings (as accurately as possible):– e.g. ED(brittany) < ED(britany)
Minimum Edit Distance Homework
• Submit your homework in PDF– how many you got right– explain your criteria, e.g. weights, chosen
• you should submit your modified Excel spreadsheet or code (e.g. Python, Perl, Java) as well
• due by email to me before next Thursday class…– put your name and 581 at the top of your submission
Part 2
• Corpora: frequency information
• Unlabeled corpus: just words• Labeled corpus: various kinds …– POS information– Information about phrases– Word sense or Semantic role labeling
easy to find
progressivelyharder tocreate or obtain
Language Models and N-grams
• given a word sequence– w1 w2 w3 ... wn
• chain rule– how to compute the probability of a sequence of words– p(w1 w2) = p(w1) p(w2|w1) – p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) – ...– p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1)
• note– It’s not easy to collect (meaningful) statistics on p(wn|wn-1wn-2...w1) for all
possible word sequences
Language Models and N-grams
• Given a word sequence– w1 w2 w3 ... wn
• Bigram approximation– just look at the previous word only (not all the proceedings words) – Markov Assumption: finite length history– 1st order Markov Model– p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2) ...p(wn|w1...wn-3wn-2wn-1)
– p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)
• note– p(wn|wn-1) is a lot easier to collect data for (and thus estimate well) than p(wn|w1...wn-2
wn-1)
Language Models and N-grams
• Trigram approximation – 2nd order Markov Model– just look at the preceding two words only– p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2) p(w4|w1w2w3)...p(wn|w1...wn-
3wn-2wn-1)
– p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-2 wn-1)
• note– p(wn|wn-2wn-1) is a lot easier to estimate well than p(wn|w1...wn-2 wn-1) but
harder than p(wn|wn-1 )
Language Models and N-grams
• estimating from corpora– how to compute bigram probabilities– p(wn|wn-1) = f(wn-1wn)/f(wn-1w) w is any word
– Since f(wn-1w) = f(wn-1) f(wn-1) = unigram frequency for wn-1
– p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency
• Note:– The technique of estimating (true) probabilities using a relative
frequency measure over a training corpus is known as maximum likelihood estimation (MLE)
Motivation for smoothing
• Smoothing: avoid zero probability estimates• Consider
• what happens when any individual probability component is zero?– Arithmetic multiplication law: 0×X = 0– very brittle!
• even in a very large corpus, many possible n-grams over vocabulary space will have zero frequency– particularly so for larger n-grams
p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)
Language Models and N-grams
• Example:
unigram frequencies
wn-1wn bigram frequencies
bigram probabilities
sparse matrix
zeros render probabilities unusable
(we’ll need to add fudge factors - i.e. do smoothing)
wn-1
wn
Smoothing and N-grams
• sparse dataset means zeros are a problem– Zero probabilities are a problem
• p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) bigram model
• one zero and the whole product is zero– Zero frequencies are a problem
• p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency
• bigram f(wn-1wn) doesn’t exist in dataset
• smoothing– refers to ways of assigning zero probability n-grams a non-zero value
Smoothing and N-grams• Add-One Smoothing (4.5.1 Laplace Smoothing)
– add 1 to all frequency counts– simple and no more zeros (but there are better methods)
• unigram– p(w) = f(w)/N (before Add-One)
• N = size of corpus– p(w) = (f(w)+1)/(N+V) (with Add-One)– f*(w) = (f(w)+1)*N/(N+V) (with Add-One)
• V = number of distinct words in corpus• N/(N+V) normalization factor adjusting for the effective increase in the corpus size caused by
Add-One• bigram
– p(wn|wn-1) = f(wn-1wn)/f(wn-1) (before Add-One)– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) (after Add-One)– f*(wn-1 wn) = (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V) (after Add-One)
must rescale so that total probability mass stays at 1
Smoothing and N-grams
• Add-One Smoothing– add 1 to all frequency counts
• bigram– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) – (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V)
• frequencies
Remarks:perturbation problem
add-one causes largechanges in somefrequencies due to relative size of V (1616)
want to: 786 338
= figure 6.8
= figure 6.4
Smoothing and N-grams
• Add-One Smoothing– add 1 to all frequency counts
• bigram– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) – (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V)
• Probabilities
Remarks:perturbation problem
similar changes inprobabilities
= figure 6.5
= figure 6.7
Smoothing and N-grams
• let’s illustrate the problem– take the bigram case:– wn-1wn
– p(wn|wn-1) = f(wn-1wn)/f(wn-1)
– suppose there are cases– wn-1wzero
1 that don’t occur in the corpus
probability mass
f(wn-1)
f(wn-1wn)
f(wn-1wzero1)=0
f(wn-1wzerom)=0
...
Smoothing and N-grams
• add-one– “give everyone 1”
probability mass
f(wn-1)
f(wn-1wn)+1
f(wn-1w01)=1
f(wn-1w0m)=1
...
Smoothing and N-grams
• add-one– “give everyone 1”
probability mass
f(wn-1)
f(wn-1wn)+1
f(wn-1w01)=1
f(wn-1w0m)=1
... V = |{wi}|
• redistribution of probability mass– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V)
Smoothing and N-grams
• Good-Turing Discounting (4.5.2)– Nc = number of things (= n-grams) that occur c times in the corpus– N = total number of things seen– Formula: smoothed c for Nc given by c* = (c+1)Nc+1/Nc
– Idea: use frequency of things seen once to estimate frequency of things we haven’t seen yet– estimate N0 in terms of N1…– and so on but if Nc =0, smooth that first using something like log(Nc)=a+b log(c)– Formula: P*(things with zero freq) = N1/N– smaller impact than Add-One
• Textbook Example:– Fishing in lake with 8 species
• bass, carp, catfish, eel, perch, salmon, trout, whitefish– Sample data (6 out of 8 species):
• 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel
– P(unseen new fish, i.e. bass or carp) = N1/N = 3/18 = 0.17– P(next fish=trout) = 1/18
• (but, we have reassigned probability mass, so need to recalculate this from the smoothing formula…)– revised count for trout: c*(trout) = 2*N2/N1=2(1/3)=0.67 (discounted from 1)– revised P(next fish=trout) = 0.67/18 = 0.037
Language Models and N-grams
• N-gram models + smoothing– one consequence of smoothing is that– every possible concatentation or sequence of
words has a non-zero probability
– N-gram models can also incorporate word classes, e.g. POS labels when available
Language Models and N-grams
• N-gram models– data is easy to obtain• any unlabeled corpus will do
– they’re technically easy to compute • count frequencies and apply the smoothing formula
– but just how good are these n-gram language models?
– and what can they show us about language?
Language Models and N-grams
approximating Shakespeare– generate random sentences using n-grams– Corpus: Complete Works of Shakespeare
• Unigram (pick random, unconnected words)
• Bigram
Language Models and N-grams
• Approximating Shakespeare– generate random sentences using n-grams– Corpus: Complete Works of Shakespeare
• Trigram
• Quadrigram
Remarks:dataset size problem
training set is small884,647 words29,066 different words
29,0662 = 844,832,356 possible bigrams
for the random sentence generator, this means very limited choices forpossible continuations,which means program can’t be very innovative for higher n
Language Models and N-grams
• A limitation: – produces ungrammatical sequences
• Treebank:– potential to be a better language model– Structural information:• contains frequency information about syntactic rules
– we should be able to generate sequences that are closer to English …
Language Models and N-grams
• Aside: http://hemispheresmagazine.com/contests/2004/intro.htm
Part 3
tregex• I assume everyone has:
1. Installed Penn Treebank v32. Downloaded and installed tregex
Trees in the Penn Treebank
Notation: LISP S-expression
Directory: TREEBANK_3/parsed/mrg/
tregex
• Search Example: << dominates, < immediately dominates
tregexHelp
tregexHelp
tregex
• Help: tregex expression syntax is non-standard wrt bracketing
S < VPS < NP
tregex
• Help: tregex boolean syntax is also non-standard
tregex
• Help
tregex• Help
tregex
• Pattern: – (@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)
Key:<, first child$+ immediate left sister<- last child
same node
tregex
• Help
tregex
tregex• Different results from:
– @SBAR < /^WH.*-([0-9]+)$/#1%index << (@NP < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))
tregex
Example: WHADVPalso possible(not just WHNP)
Treebank Guides1. Tagging Guide2. Arpa94 paper3. Parse Guide
Treebank Guides
• Parts-of-speech (POS) Tagging Guide, tagguid1.pdf (34 pages):
tagguid2.pdf: addendum, see POS tag ‘TO’
Treebank Guides
• Parsing guide 1, prsguid1.pdf (318 pages):
prsguid2.pdf: addendum for the Switchboard corpus