LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/2

LING 438/538Computational Linguistics

Sandiway Fong

Lecture 20: 11/2

Today’s Topics

• Conclude the n-gram section – (Chapter 6)

• Start Part-of-Speech (POS) Tagging section – (Chapter 8)

N-grams for Spelling Correction

• Spelling Errors– see Chapter 5

• (Kukich, 1992):– Non-word detection (easiest)

• graffe (giraffe)

– Isolated-word (context-free) error correction

• graffe (giraffe,…)• graffed (gaffed,…)• by definition cannot correct when error

word is a valid word

– Context-dependent error detection and correction (hardest)

• your an idiot you’re an idiot• (Microsoft Word corrects this by

default)


• Context-sensitive spelling correction– real-word error detection

• when the mistyped word happens to be a real word• 15% Peterson (1986)• 25%-40% Kukich (1992)

– examples (local)• was conducted mainly be John Black• leaving in about fifteen minuets• all with continue smoothly

– examples (global)• Won’t they heave if next Monday at that time?• the system has been operating system with all four units on-line


• Given a word sequence: – W = w1… wk … wn

• Suppose wk is mispelled• Suppose possible misspellings are w1

k, , w2k etc.

• w1k, , w2

k etc. can be estimated via edit distance operations• Find the most likely sequence

– w1… wnk … wn

– i.e. maximize p(w1… wnk … wn)

• Chain Rule– p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1)


• Use an n-gram language model for P(W)

• Bigram– p(w1 w2 w3...wn) = p(w1) p(w2|w1)

p(w3|w1w2) ...p(wn|w1...wn-3wn-2wn-1)

– p(w1 w2 w3...wn) p(w1) p(w2|w1)

p(w3|w2)...p(wn|wn-1) • Trigram

– p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2) p(w4|w1w2w3)...p(wn|w1...wn-3wn-2wn-1)

– p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-2 wn-1)

• Problem: – we need to estimate n-grams

containing the “mispelled” items– where to find data?– acute sparse data problem

• Possible Solution • (“class-based n-grams”)

– Use a part-of-speech n-gram model

– more data– p(c1 c2 c3...cn) p(c1) p(c2|c1) p(c3|

c2)...p(cn|cn-1) (bigram)

– ci = category label • (N,V,A,P,Adv,Conj, etc.)

Entropy

• Uncertainty measure (Shannon)– given a random variable x

• r =2, pi = probability the event is i

– Biased coin: • -0.8 * lg 0.8 + -0.2 * lg 0.2 =

0.258 + 0.464 = 0.722

– Unbiased coin: • - 2* 0.5 * lg 0.5 = 1

– lg = log2 (log base 2)– entropy = H(x) = Shannon

uncertainty

€

− pi lg pi

i=1

r

∑

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

• Perplexity– (average) branching factor– weighted average number

of choices a random variable has to make

– Formula: 2H

• directly related to the entropy value H

• Examples– Biased coin:

• 20.722 = 0.52

– Unbiased coin: • - 21= 2

Entropy and Word Sequences

• given a word sequence: – W = w1… wn

• Entropy for word sequences of length n in language L

– H(w1… wn) = - p(w1… wn) log p(w1… wn)

– over all sequences of length n in language L

• Entropy rate for word sequences of length n

– 1/n H(w1… wn) – = -1/n p(w1… wn) log p(w1… wn)

• Entropy rate– H(L) = limn> -1/n p(w1… wn) log

p(w1… wn)

– n is number of words in the sequence

• Shannon-McMillan-Breiman theorem

– H(L) = limn→ - 1/n log p(w1… wn)

– select sufficiently large n– possible then to take a single

sequence• instead of summing over all

possible w1… wn

• long sequence will contain many shorter sequences

Cross-Entropy

• evaluate competing models

• compare two probabilistic models, i.e. approximations, m1 and p

• Compute the cross-entropy of mi on p:– H(p, mi) using

– H(p, mi) = limn → 1/n - p(w1… wn) log mi(w1… wn)

• Shannon-McMillan-Breiman version:– H(p, mi) = limn → -1/n log

mi(w1… wn)

• H(p) H(p, mi)

– true entropy is a lower bound

• mi with lowest H(p, mi) is the more accurate model

Perplexity of Language Models

• [see p228: section 6.7]• n-gram models• corpus:

– 38 million words from the WSJ (Wall Street Journal)

• Compute perplexity of each model on a test set of 1.5 million words

• Perplexity defined as– 2 H(p, mi)

• Results:– unigram 962– bigram 170– trigram 109– the lower the perplexity, the

more closely the trained model follows the data

Entropy of English

• Shannon Experiment– given a (hidden) sequence of characters– ask speaker of language to predict what the next character

might be– record the number of guesses taken to get the right

character– H(English) = -1/n p(guess = character) log p(guess = character)

• guess ranges over all characters (letters and space)• n is 27

Entropy of English

• Shannon Experiment– 1.3 bits (recorded)– URL: http://www.math.ucsd.edu/~crypto/java/ENTROPY/– “We should also mention that in a classroom of about 60 students, with

everybody venturing guesses for each next letter, we consistently obtained a value of about 1.6 bits for the estimate of the entropy.”

http://www.math.ucsd.edu/~crypto/java/ENTROPY/

Entropy of English

• Word-based method– train a very good stochastic model m of English on a large

corpus– Use it to assign a log-probability to a very long sequence

• Shannon-McMillan-Breiman formula:– H(p, m) = limn → -1/n log m(w1… wn)

– H(English) H(p, m) – Result:

• 1.75 bits per character (Brown et al.)• 583 million words corpus to train model m• test sequence was the Brown corpus (1 million words)

Next Topic

• Chapter 8: – Word Classes and Part-of-Speech Tagging

Parts-of-Speech

• Divide words into classes based on grammatical function– nouns (open-class: unlimited set)

• referential items (denoting objects/concepts etc.)– proper nouns: John– pronouns: he, him, she, her, it– anaphors: himself, herself (reflexives)– common nouns: dog, dogs, water

» number: dog (singular), dogs (plural)» count-mass distinction: many dogs, *many waters

– eventive nouns: dismissal, concert, playback, destruction (deverbal)

• nonreferential items– it as in it is important to study– there as in there seems to be a problem– some languages don’t have these: e.g. Japanese

• open-class– factoid, email, bush-ism

Parts-of-Speech

• Pronouns:– it– I– he– you– his– they– this– that– she– her– we– all– which– their– what

figure 8.4

Parts-of-Speech

• Divide words into classes based on grammatical function– verbs (closed-class: fixed set)

• auxiliaries– be (passive, progressive)– have (pluperfect tense)– do (what did John buy?, Did Mary win?)– modals: can, could, would, will, may

• Irregular: – is, was, were, does, did

figure 8.5

Parts-of-Speech

• Divide words into classes based on grammatical function– verbs (open-class: unlimited set)

• Intransitive– unaccusatives: arrive (achievement)– unergatives: run, jog (activities)

• Transitive– actions: hit (semelfactive: hit the ball for an hour)– actions: eat, destroy (accomplishment)– psych verbs: frighten (x frightens y), fear (y fears x)

• Ditransitive– put (x put y on z, *x put y)– give (x gave y z, *x gave y, x gave z to y)– load (x loaded y (on z), x loaded z (with y))

– Open-class: • reaganize, email, fax

Parts-of-Speech

• Divide words into classes based on grammatical function– adjectives (open-class: unlimited set)

• modify nouns• black, white, open, closed, sick, well• attributive: black (black car, car is black), main (main street, *street is main),

atomic• predicative: afraid (*afraid child, the child is afraid)• stage-level: drunk (there is a man drunk in the pub)• individual-level: clever, short, tall (*there is a man tall in the bar)• object-taking: proud (proud of him,*well of him)• intersective: red (red car: intersection of the set of red things and the set of cars)• non-intersective: former (former architect), atomic (atomic scientist)• comparative, superlative: blacker, blackest, *opener, *openest

– open-class:• hackable, spammable

Parts-of-Speech

• Divide words into classes based on grammatical function– adverbs (open-class: unlimited set)

• modify verbs (adjectives and other adverbs)• manner: slowly (moved slowly)• degree: slightly, more (more clearly), very (very bad), almost• sentential: unfortunately, suddenly• question: how• temporal: when, soon, yesterday (noun?)• location: sideways, here (John is here)

– open-class:• spam-wise

Parts-of-Speech• Divide words into classes based on grammatical function

– prepositions (closed-class: fixed set)– come before an object, assigns a semantic function (from Mars, *Mars from)

• head-final languages: postpositions (Japanese: amerika-kara)

– location: on, in, by– temporal: by, until

figure 8.1

Parts-of-Speech

• Divide words into classes based on grammatical function– particles (closed-class: fixed set)– resembles a preposition or adverb, often combines to form a phrasal verb– went on, finish up– throw sleep off (throw off sleep)

• single-word particles (Quirk, 1985):

figure 8.2

Parts-of-Speech

• Divide words into classes based on grammatical function– conjunctions (closed-class: fixed set)– used to join two phrases, clauses or sentences– coordinating conjunctions: and, or, but– subordinating conjunctions: that (complementizer)

figure 8.3

Part-of-Speech (POS) Tagging

• Idea:– assign the right part-of-speech tag, e.g. noun, verb,

conjunction, to a word– useful for shallow parsing – or as first stage of a deeper/more sophisticated system

• Question:– Is it a hard task?

• i.e. can’t we look the words up in a dictionary?

• Answer:– Yes. Ambiguity.– No. POS Taggers typical claim 95%+ accuracy

Part-of-Speech (POS) Tagging

• example– walk: noun, verb

• the walk : noun I took …• I walk : verb 2 miles every day

– as a shallow parsing tool: • can we do this without fully parsing the sentence?

• example– still: noun, adjective, adverb, verb

• the still of the night, a glass still• still waters• stand still• still struggling• Still, I didn’t give way• still your fear of the dark (transitive)• the bubbling waters stilled (intransitive)

POS Tagging

• Task:– assign the right part-of-speech tag, e.g. noun, verb,

conjunction, to a word in context

• POS taggers– need to be fast in order to process large corpora

• should take no more than time linear in the size of the corpora– full parsing is slow

• e.g. context-free grammar n3, n length of the sentence– POS taggers try to assign correct tag without actually

parsing the sentence

POS Tagging

• Components:– Dictionary of words

• Exhaustive list of closed class items– Examples:

» the, a, an: determiner» from, to, of, by: preposition» and, or: coordination conjunction

• Large set of open class (e.g. noun, verbs, adjectives) items with frequency information

POS Tagging

• Components:– Mechanism to assign tags

• Context-free: by frequency• Context: bigram, trigram, HMM, hand-coded rules

– Example:» Det Noun/*Verb the walk…

– Mechanism to handle unknown words (extra-dictionary)• Capitalization• Morphology: -ed, -tion

How Hard is Tagging?

• Brown Corpus (Francis & Kucera, 1982):– 1 million words– 39K distinct words– 35K words with only 1 tag– 4K with multiple tags (DeRose, 1988)

figure 8.7

How Hard is Tagging?

• Easy task to do well on:– naïve algorithm

• assign tag by (unigram) frequency

– 90% accuracy (Charniak et al., 1993)

Penn TreeBank Tagset

• 48-tag simplification of Brown Corpus tagset• Examples:

1. CC Coordinating conjunction

3. DT Determiner

7. JJ Adjective

11. MD Modal

12. NN Noun (singular,mass)

13. NNS Noun (plural)

27 VB Verb (base form)

28 VBD Verb (past)

Penn TreeBank Tagsetwww.ldc.upenn.edu/doc/treebank2/cl93.html

1 CC Coordinating conjunction2 CD Cardinal number3 DT Determiner4 EX Existential there5 FW Foreign word6 IN Preposition/subord. conjunction 7 JJ Adjective8 JJR Adjective, comparative9 JJS Adjective, superlative

10 LS List item marker11 MD Modal12 NN Noun, singular or mass13 NNS Noun, plural14 NNP Proper noun, singular15 NNPS Proper noun, plural16 PDT Predeterminer17 POS Possessive ending18 PRP Personal pronoun19 PP Possessive pronoun20 RB Adverb21 RBR Adverb, comparative22 RBS Adverb, superlative23 RP Particle24 SYM Symbol (mathematical or scientific)

Penn TreeBank Tagsetwww.ldc.upenn.edu/doc/treebank2/cl93.html

25 TO to26 UH Interjection27 VB Verb, base form28 VBD Verb, past tense29 VBG Verb, gerund/present participle30 VBN Verb, past participle31 VBP Verb, non-3rd ps. sing. present32 VBZ Verb, 3rd ps. sing. present33 WDT wh-determiner34 WP wh-pronoun35 WP Possessive wh-pronoun36 WRB wh-adverb37 # Pound sign38 $ Dollar sign39 . Sentence-final punctuation40 , Comma41 : Colon, semi-colon42 ( Left bracket character43 ) Right bracket character44 " Straight double quote45 ` Left open single quote46 " Left open double quote47 ' Right close single quote48 " Right close double quote


• How many tags?– Tag criterion

• Distinctness with respect to grammatical behavior?

– Make tagging easier?

• Punctuation tags – Penn Treebank numbers 37- 48

• Trivial computational task


• Simplifications:– Tag TO:

• infinitival marker, preposition• I want to win• I went to the store

– Tag IN:• preposition: that, when, although • I know that I should have stopped, although…• I stopped when I saw Bill


• Simplifications:– Tag DT:

• determiner: any, some, these, those• any man• these *man/men

– Tag VBP: • verb, present: am, are, walk• Am I here?• *Walked I here?/Did I walk here?

Hard to Tag Items

• Syntactic Function– Example:

• resultative

• I saw the man tired from running • Examples (from Brown Corpus Manual)

– Hyphenation:• long-range, high-energy• shirt-sleeved • signal-to-noise

– Foreign words:• mens sana in corpore sano

Documents

LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/2