48
Corpora and Language Modeling Ngrams, information, and monkeys on keyboards Rob Speer Catherine Havasi MAS.S60

Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Embed Size (px)

Citation preview

Page 1: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Corpora and Language Modeling

Ngrams, information, and monkeys on keyboards

Rob Speer Catherine Havasi

MAS.S60

Page 2: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Corpus Linguistics

•  A corpus is a body of existing text •  In descriptive linguistics, it provides

evidence

•  Ideally*, a corpus should contain documents selected for variety

•  In natural language processing, it provides training data

Page 3: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Plain text corpora

•  Project Gutenberg •  British National Corpus •  Presidential inaugural addresses •  The Universal Declaration of Human Rights

(translated into >300 languages) •  CHILDES (conversations between parents

and children) •  Wikipedia •  Google Books •  The entire Web

Page 4: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Annotated corpora

•  Brown corpus (has part of speech tags) •  Penn Treebank (complete parse trees of

sentences, mostly from the WSJ)

•  SemCor (distinguishes word senses) •  Indian POS-tagged corpus (in Bangla,

Hindi, Marathi, and Telugu)

Page 5: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

What can you do with corpora?

•  Examine trends and statistics of language use.

•  Train or test a NLP system. •  Build a lexical resource (by hand or

automatically). • Understand what pairs of words go

together and which contain unusual amounts of information.

Page 6: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Distribution

•  Brown corpus – "the" is 7%

– "to" and "of" are 3% each – “rabbit” is 0.0011%

Page 7: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Making the Dictionary

•  Remember the concordance tool from last class?

Page 8: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Concordance Toolkits

Page 9: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Examining trends

Page 10: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Examining trends

Page 11: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Application: text prediction

Page 12: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Application: machine translation

Page 13: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

What information do you use?

• My landlord called, asking if I had paid the ____

– Grammar? – N-grams?

– Semantic relatedness?

Page 14: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

The Unreasonable Effectiveness of Data

•  Peter Norvig: it’s better to get more data than better representations

•  “An informal, incomplete grammar of the English language runs over 1,700 pages.”

•  “For many tasks, words and word combinations provide all the representational machinery we need to learn from text.”

Page 15: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Existing data isn’t everything

Page 16: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Loading corpora in NLTK

•  nltk.download() •  nltk.corpus.gutenberg – gives  you  a  NLTKCorpusReader  object    

Page 17: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

N-grams

• Given a corpus of text, the n-grams are the sequences of n consecutive words that are in the corpus.

• Can be with words or letters

Page 18: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

N-grams: Unigram

“The cat that sat on the sofa also sat on the mat.”

The 3 sat 2 on 2 cat 1 that 1 sofa 1 also 1 mat 1

Page 19: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

How much information is in a word?

•  An event that happens with a probability of 1 in 2n carries n bits of information

Page 20: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Maximum Likelihood

•  Infinite number of prob. distributions could produce your observations

• Our training data is a sample from an unknown distribution

•  Each sample has a prob of occurring given a distribution.

•  Proportional is most likely distribution given the sampling

Page 21: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Estimating information in a corpus

• Make a FreqDist for English out of the Brown corpus

• How much information is in the word “the”?

• What is the average number of bits per word?

•  Repeat for another language

Page 22: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

But there are always more words!

•  A new word shouldn’t have zero probability.

•  The MLE is not a realistic language model, because it cannot handle new words.

• What probability should a new word have?

Page 23: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Zipf ’s Law

Page 24: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Witten-Bell estimation

•  Estimate the probability of an event we haven’t seen yet, based on the number of event types we’ve seen so far

•  Decrease other probabilities accordingly •  A special case of Good-Turing smoothing •  Implemented in WittenBellFreqDist

Page 25: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Witten-Bell estimation

•  Let i range over all unigram types • N = total tokens, T = total types •  The chance of a new event is T / (N + T)

Page 26: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

What’s the probability of a specific new event?

• We don’t know anything about the events we haven’t seen, so we assume they’re uniformly distributed

•  So assume there’s some finite number:

• Guess the total number of events, (T + Z)

Page 27: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Information in sequences of words

• Words don’t actually convey information independently

•  There is less information in each word than the unigrams would suggest

• When hearing a sentence, you can often guess the next _____

Page 28: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

N-grams: Bigrams

“The cat that sat on the sofa also sat on the mat.” sat on 2 on the 2 the cat 1 cat that 1 that sat 1 the sofa 1 sofa also1 also sat 1 the mat 1

And then there are trigrams, 4-grams, ...

Page 29: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Sliding Windows

“The cat that sat on the sofa also sat on the mat.”

“The cat that sat on the sofa also sat on the mat.”

“The cat that sat on the sofa also sat on the mat.”

Page 30: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Pointwise mutual information

•  Is “vice president” a significant phrase, or is it simply a coincidence when the words “vice” and “president” are near each other?

Page 31: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

What if p(W1, W2) = 0?

• Once again we have an unrealistic model with probabilities that could be 0

•  If the phrase “unrealistic model” isn’t in the corpus, does it have infinite information?

• We need to smooth again

Page 32: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Laplace smoothing

•  The “add one” principle •  Any event in your model that never

happens, happens once instead

•  Simple and often good enough

Page 33: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Laplace smoothing on bigrams

Page 34: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Witten-Bell works here too

•  Let i range over all bigrams • N = total tokens, T = total types

Page 35: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

You’ve got Spam!

Page 36: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

The Turing Test

•  Alan Turing •  "Can machines think?" •  “Imitation Game”

•  Turing is no longer asking if a machine can think - asking if a machine can act like it is thinking.

Page 37: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

The Loebner Prize

•  Annual “Turing Test” since 1991 •  Silver (text) and Gold (text + visual) never

won

•  Bronze: “Most human-like”

Page 38: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Alice

Page 39: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

How does Alice work?

• Heuristic pattern matching rules. •  Think ELIZA, but more rules •  Reacts on words in the input

Page 40: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

What wins the Loebner Prize?

•  Spelling and grammar • Hiding mathematical knowledge •  React like a human (timing)

•  Pretend to pretend to be a robot (Elbot)

Page 41: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

MegaHal

Page 42: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

How does MegaHal Work?

•  Learns how you talk •  Imitates natural language in general •  ... with a healthy dose of you

•  Imitation

Page 43: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

What is MegaHal trained on?

• Conversational sentences designed to help it win the Loebner prize

•  Facts about the world •  References to Hitchhiker's Guide •  TMBG lyrics

Page 44: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Markov chains

• Markov chains are structures where future states depend only on the current state

• Given the present state, the future and past states are independent

•  “Forget where you were. You are here now. Decide where to go next”.

Page 45: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Generating Text with Markov

• Make a probability distribution of what comes next, given the last n-1 words

•  Iterate • Maximum likelihood estimate is fine here

Page 46: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Markov Beyond Bigrams

•  The higher our n, the more sensible our text. – Text plagiarism vs. generation?

• Our space becomes sparser

Page 47: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Making MegaHal

•  Filter stopwords •  Pick a word they said (smartly) •  Forward and backward Markov Chain from that

word – Backward?

•  Some small fixes – my -> your – why -> because

•  Do this several times •  Pick with a heuristic

Page 48: Corpora and Language Modeling - MIT Media Labweb.media.mit.edu/~havasi/MAS.S60/PNLP2.pdf · Hindi, Marathi, and Telugu) ... – Grammar? – N-grams? ... • Discover words that happen

Assignment

•  Load a sufficiently large new text corpus in NLTK

•  Discover words that happen significantly more frequently in your corpus than the Brown corpus

•  Discover its two-word collocations with the most pointwise mutual information

• Use your code from this class to generate text from it