19
LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

LING 388: Language and Computers

Sandiway Fong

Lecture 26: 12/1

Page 2: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

Administrivia

• Homework #4– returned on Monday (11/29)

• Homework #5– goes out Monday (12/6)– due Monday (12/13) - exam period

Page 3: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

Last Time

• the shape of uncertainty

• N-grams– unigrams, bigrams, trigrams, quadrigrams– conditional probability– chain rule

• p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1)

Page 4: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

N-grams

• Chain rule– bigram/trigram approximation

• 1st/2nd order Markov Model• just look at the preceding one or two words only• bigram: p(wn|wn-1) • trigram: p(wn|wn-2 wn-1)

– Relative frequency/Maximum Likelihood Estimate (MLE)• bigram: f(wn-1 wn)/ f(wn-1 )• trigram: f(wn-2wn-1 wn)/ f(wn-2wn-1 )• f = frequency count

Page 5: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

N-grams and Corpora

• Corpus– New York Times (AP) article– November 29th 2004– Gas Prices Decline to an Average $1.95

Per Gallon

Page 6: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

N-grams and Corpora

Page 7: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

N-grams and Corpora

• Contents • note: blank lines separate paragraphs, new sentences do not always begin with a new line

• The retail price of gasoline fell for the fourth straight week to average $1.95 per gallon nationwide, the Energy Department reported Monday.

• The government survey said the average price nationwide of regular-grade unleaded gasoline declined 0.3 cent last week to $1.945. Prices are 45.5 cents, or 31 percent, higher than a year ago.

• Pump prices are highest on the West Coast, averaging $2.157 per gallon, and cheapest on the Gulf Coast, averaging $1.841 per gallon. In the Midwest, gas averages $1.884 per gallon.

• One of the key factors behind the high price of gasoline is the soaring cost of oil -- the result of strong demand, and tight supplies of heating oil.

• The price of light crude for December delivery rose 32 cents to settle at $49.76 a barrel on the New York Mercantile Exchange. Oil is roughly 66 percent more expensive than a year ago.

• In other Nymex trading, November unleaded gasoline climbed less than a penny to $1.3029 per gallon.

• The nation's supply of commercially available gasoline is 1.7 percent higher than last year at 202.7 million barrels, according to the Energy Department. Crude inventories stand at 292.4 million barrels, or 2 percent above last year at this time.

Page 8: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

N-grams and Corpora

• Contents • The retail price of gasoline fell for the fourth straight week to average $1.95 per gallon nationwide, the

Energy Department reported Monday.• The government survey said the average price nationwide of regular-grade unleaded gasoline declined 0.3

cent last week to $1.945. • Prices are 45.5 cents, or 31 percent, higher than a year ago.• Pump prices are highest on the West Coast, averaging $2.157 per gallon, and cheapest on the Gulf Coast,

averaging $1.841 per gallon. • In the Midwest, gas averages $1.884 per gallon.• One of the key factors behind the high price of gasoline is the soaring cost of oil -- the result of strong

demand, and tight supplies of heating oil.• The price of light crude for December delivery rose 32 cents to settle at $49.76 a barrel on the New York

Mercantile Exchange. • Oil is roughly 66 percent more expensive than a year ago.• In other Nymex trading, November unleaded gasoline climbed less than a penny to $1.3029 per gallon.• The nation's supply of commercially available gasoline is 1.7 percent higher than last year at 202.7 million

barrels, according to the Energy Department. • Crude inventories stand at 292.4 million barrels, or 2 percent above last year at this time.

Page 9: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

N-grams and Corpora

• Contents (StarT = dummy word for start of sentence)• StarT the retail price of gasoline fell for the fourth straight week to average $1.95 per gallon nationwide, the

Energy Department reported Monday.• StarT the government survey said the average price nationwide of regular-grade unleaded gasoline

declined 0.3 cent last week to $1.945.• StarT prices are 45.5 cents, or 31 percent, higher than a year ago.• StarT pump prices are highest on the West Coast, averaging $2.157 per gallon, and cheapest on the Gulf

Coast, averaging $1.841 per gallon.• StarT in the Midwest, gas averages $1.884 per gallon.• StarT one of the key factors behind the high price of gasoline is the soaring cost of oil -- the result of strong

demand, and tight supplies of heating oil.• StarT the price of light crude for December delivery rose 32 cents to settle at $49.76 a barrel on the New

York Mercantile Exchange.• StarT Oil is roughly 66 percent more expensive than a year ago.• StarT in other Nymex trading, November unleaded gasoline climbed less than a penny to $1.3029 per

gallon.• StarT the nation's supply of commercially available gasoline is 1.7 percent higher than last year at 202.7

million barrels, according to the Energy Department.• StarT crude inventories stand at 292.4 million barrels, or 2 percent above last year at this time.

Page 10: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

N-grams and Corpora

• Ted Petersen’s Perl-based Ngram Statistics Package (NSP)– http://www.d.umn.edu/~tpederse/nsp.html

• Computes N-gram statistics– Usage: count.pl [OPTIONS] DESTINATION SOURCE [[, SOURCE]

...]• Counts up the frequency of all n-grams occurring in SOURCE.• Sends to DESTINATION the list of n-grams found, along with the

frequencies of combinations of the n tokens that the n-gram is composed of. If SOURCE is a directory, all text files in it are counted.

– OPTIONS:• --ngram N

– Creates n-grams of N tokens each. N = 2 by default.

• --newLine – Prevents n-grams from spanning across the new-line character.

Page 11: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

N-grams and Corpora• perl count.pl --ngram 1 --newline nyt.ug nyt2.txt

• Unigram stats• 259• .<>23• the<>16• StarT<>11• ,<>11• of<>9• 1<>6• to<>5• gallon<>5• per<>5• gasoline<>5• price<>4• at<>4• a<>4

Page 12: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

N-grams and Corpora

• Bigram stats• 247• 1<>.<>6 6 23• per<>gallon<>5 5 5• StarT<>the<>4 11 16• gallon<>.<>3 5 23• than<>a<>3 4 4• on<>the<>3 3 16• price<>of<>3 4 9• week<>to<>2 2 5• Coast<>,<>2 2 11• ,<>averaging<>2 11 2• higher<>than<>2 2 4• unleaded<>gasoline<>2 2 5• last<>year<>2 3 4• year<>at<>2 4 4

# n-gramsw1 <> w2 <> f1 f2 f3

(<> = separator)f1 = bigram freqf2 = freq of w1 occuring in a bigram as the 1st word f3 = freq of w2 occuring in a bigram as the 2nd word

Page 13: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

N-grams and Corpora

• Trigram stats• 235• per<>gallon<>.<>3 5 5 23 5 3 3• year<>ago<>.<>2 4 2 23 2 2 2• million<>barrels<>,<>2 2 2 11 2 2 2• to<>1<>.<>2 4 6 23 2 2 6• a<>year<>ago<>2 4 4 2 2 2 2• than<>a<>year<>2 4 4 4 3 3 2• last<>year<>at<>2 3 4 4 2 2 2• price<>of<>gasoline<>2 4 9 5 3 2 2• Coast<>,<>averaging<>2 2 11 2 2 2 2• 202<>.<>7<>1 1 12 2 1 1 2• and<>tight<>supplies<>1 2 1 1 1 1 1

w1 <> w2 <> w3 <> f1 f2 f3 f4 f5 f6 f7

f1 = trigram freqf2 = freq of w1 occuring in a trigram as the 1st word f3 = freq of w2 occuring in a trigram as the 2nd wordf4 = freq of w3 occuring in a trigram as the 3rd word...

Page 14: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

N-grams and Corpora

• Probability calculations– Relative frequency/Maximum Likelihood Estimate (MLE)

• bigram: f(wn-1 wn)/ f(wn-1 )• trigram: f(wn-2wn-1 wn)/ f(wn-2wn-1 )• f = frequency count

Page 15: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

Language Models and N-grams

• Approximating Shakespeare– generate random sentences using n-grams– train on complete Works of Shakespeare

• Unigram (pick random, unconnected words)

• Bigram

Page 16: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

Language Models and N-grams

• Approximating Shakespeare– generate random sentences using n-grams– train on complete Works of Shakespeare

• Trigram

• Quadrigram

Remarks:dataset size problem

training set is small884,647 words29,066 different words

29,0662 = 844,832,356 possible bigrams

for the random sentence generator, this means very limited choices forpossible continuations,which means program can’t be very innovative for higher n

Page 17: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

Language Models and N-grams

• Aside: http://hemispheresmagazine.com/contests/2004/intro.htm

Page 18: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

Language Models and N-grams

• Fall ‘04 LING 538 student project• (Tom Meservy & Kelly Fadel)• Java code for generating random sentences

– trained on raw text • Shakespeare, Confucius, Bible, and Beatles lyrics

– Demo ...java ling538.Processor Lyrics.txt.DS_StoreLyrics.txtCalculating probabilitiesProbability Sequence~0.70180374~0.22762895~0.7904473~0.33419436~0.9269424~0.41349262~0.54698783~0.9075054~0.772837~0.44430703~0.3082463~0.36489922~0.834998~0.30963254~0.3173852~0.0041531324~0.07958037~0.72606564~0.8516759~0.8360898~0.8416606~0.37221277~0.5591977~0.91353667~0.5408827~0.48573828~0.9263356~0.054902554~0.06846905~0.33292705~0.7778059~0.79034966~0.45557338~0.61566806~0.8953923~9.255409E-4~0.9976824~0.8329899~0.8512819~0.41009158~0.8826722~4.104972E-4~0.42533338~0.11680031~0.09122926~0.31464487~0.71688455~0.47669286~0.83831835~Unigram Sentence~you ~the ~love ~i ~my ~you ~and ~to ~i ~i ~a ~my ~and ~i ~my ~my ~lies ~now ~i ~i ~i ~i ~and ~the ~you ~to ~to ~you ~need ~back ~my ~i ~i ~me ~the ~i ~bed, ~you ~i ~i ~and ~i ~spain" ~and ~with ~see ~my ~i ~me ~i

Bigram Sentence~you ~know ~she ~was ~waiting ~for ~your ~eyes ~now ~you're ~no ~time ~ago ~your ~emotional ~rescue ~i ~knew ~her ~in ~the ~world ~you ~can ~i ~don't ~come ~on ~back ~your ~emotional ~rescue ~yeah, ~oh ~yeah, ~yeah, ~why ~you ~know ~i ~can ~i ~awoke ~i ~really ~knows ~turn ~you ~can ~i

Trigram Sentence~you ~want to ~change the ~harlem shuffle. ~hello little ~help from ~me to ~make it ~won't be ~belong, yeah, ~i said ~so she's ~an angel ~sent to ~me i ~gotta do ~what he ~left it ~be let ~me down ~i'm really ~got a ~loser and ~i'll be ~your man ~that's the ~day i ~want you ~oh, did ~she understand ~his song ~that was ~waiting for ~me dream ~would take ~a chance ~with me, ~hold me, ~martha my ~name you ~know what ~can i ~notice i ~don't know ~you made ~up you're ~gonna say ~you love ~you and ~i will

Quadrigram Sentence~you ~know, you know ~of thee, ah ~i got a ~chance with me ~mine, i me ~too much and ~i like you ~i want you ~know, you know ~what it's like ~i please you ~love me too ~much it's all ~his nowhere plans ~and schemes, lost ~themselves instead please ~believe me when ~i tell you ~i want you ~i want you ~i want you ~love me too ~fast you gotta ~carry that weight ~a long time ~it's getting better ~all the time ~that was so ~i must be ~the way things ~to tell her ~majesty's a pretty ~nice girl but ~like you heard ~the word is ~good spread the ~sky with diamonds ~lucy in the ~sky with diamonds ~pictures yourself in ~love with you ~i didn't catch ~will you walk ~out and make ~you tell me, ~oh every little ~help from my ~car yes i'm ~in love with

Page 19: LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1

Next Time

• Lab Class – Location: SBSRI Computer Lab.