87
Natural Language Processing (CSEP 517): Introduction & Language Models Noah Smith c 2017 University of Washington [email protected] March 27, 2017 1 / 87

Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Natural Language Processing (CSEP 517):Introduction & Language Models

Noah Smithc© 2017

University of [email protected]

March 27, 2017

1 / 87

Page 2: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

What is NLP?

NL ∈ {Mandarin Chinese,English,Spanish,Hindi, . . . , Lushootseed}

Automation of:

I analysis (NL→ R)

I generation (R → NL)

I acquisition of R from knowledge and data

What is R?

2 / 87

Page 3: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

analysisgeneration RNL

3 / 87

Page 4: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

4 / 87

Page 5: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

What does it mean to “know” a language?

5 / 87

Page 6: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Levels of Linguistic Knowledge

phonologyorthography

morphology

syntax

semantics

pragmatics

discourse

phonetics

"shallower"

"deeper"

speech text

lexemes

6 / 87

Page 7: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Orthography

ลกศษยวดกระทงยงยอปดถนนทางขนไปนมสการพระบาทเขาคชฌกฏ หวดปะทะกบเจาถนทออกมาเผชญหนาเพราะเดอดรอนสญจรไมได ผวจ.เรงทกฝายเจรจา กอนทชอเสยงของจงหวดจะเสยหายไปมากกวาน พรอมเสนอหยดจดงาน 15 วน....

7 / 87

Page 8: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Morphology

uygarlastıramadıklarımızdanmıssınızcasına“(behaving) as if you are among those whom we could not civilize”

TIFGOSH ET HA-LELED BA-GAN“you will meet the boy in the park”

unfriend, Obamacare, Manfuckinghattan

8 / 87

Page 9: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

The Challenges of “Words”

I Segmenting text into words (e.g., Thai example)

I Morphological variation (e.g., Turkish and Hebrew examples)

I Words with multiple meanings: bank, mean

I Domain-specific meanings: latex

I Multiword expressions: make a decision, take out, make up, bad hombres

9 / 87

Page 10: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Example: Part-of-Speech Tagging

ikr smh he asked fir yo last name

so he can add u on fb lololol

10 / 87

Page 11: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Example: Part-of-Speech Tagging

I know, right shake my head for your

ikr smh he asked fir yo last name

you Facebook laugh out loud

so he can add u on fb lololol

11 / 87

Page 12: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Example: Part-of-Speech Tagging

I know, right shake my head for your

ikr smh he asked fir yo last name! G O V P D A N

interjection acronym pronoun verb prep. det. adj. noun

you Facebook laugh out loud

so he can add u on fb lolololP O V V O P ∧ !

preposition proper noun

12 / 87

Page 13: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Syntax

NP

NP

Adj.

natural

Noun

language

Noun

processing

vs. NP

Adj.

natural

NP

Noun

language

Noun

processing

13 / 87

Page 14: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Morphology + SyntaxA ship-shipping ship, shipping shipping-ships.

14 / 87

Page 15: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Syntax + Semantics

We saw the woman with the telescope wrapped in paper.

I Who has the telescope?

I Who or what is wrapped in paper?

I An event of perception, or an assault?

15 / 87

Page 16: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Syntax + Semantics

We saw the woman with the telescope wrapped in paper.

I Who has the telescope?

I Who or what is wrapped in paper?

I An event of perception, or an assault?

16 / 87

Page 17: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Syntax + Semantics

We saw the woman with the telescope wrapped in paper.

I Who has the telescope?

I Who or what is wrapped in paper?

I An event of perception, or an assault?

17 / 87

Page 18: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Syntax + Semantics

We saw the woman with the telescope wrapped in paper.

I Who has the telescope?

I Who or what is wrapped in paper?

I An event of perception, or an assault?

18 / 87

Page 19: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Semantics

Every fifteen minutes a woman in this country gives birth.

– Groucho Marx

19 / 87

Page 20: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Semantics

Every fifteen minutes a woman in this country gives birth. Our job is to findthis woman, and stop her!

– Groucho Marx

20 / 87

Page 21: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Can R be “Meaning”?

Depends on the application!

I Giving commands to a robot

I Querying a database

I Reasoning about relatively closed, grounded worlds

Harder to formalize:

I Analyzing opinions

I Talking about politics or policy

I Ideas in science

21 / 87

Page 22: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Why NLP is Hard

1. Mappings across levels are complex.I A string may have many possible interpretations in different contexts, and resolving

ambiguity correctly may rely on knowing a lot about the world.I Richness: any meaning may be expressed many ways, and there are immeasurably

many meanings.I Linguistic diversity across languages, dialects, genres, styles, . . .

2. Appropriateness of a representation depends on the application.

3. Any R is a theorized construct, not directly observable.

4. There are many sources of variation and noise in linguistic input.

22 / 87

Page 23: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Desiderata for NLP Methods(ordered arbitrarily)

1. Sensitivity to a wide range of the phenomena and constraints in human language

2. Generality across different languages, genres, styles, and modalities

3. Computational efficiency at construction time and runtime

4. Strong formal guarantees (e.g., convergence, statistical efficiency, consistency,etc.)

5. High accuracy when judged against expert annotations and/or task-specificperformance

23 / 87

Page 24: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

NLP?= Machine Learning

I To be successful, a machine learner needs bias/assumptions; for NLP, that mightbe linguistic theory/representations.

I R is not directly observable.

I Early connections to information theory (1940s)

I Symbolic, probabilistic, and connectionist ML have all seen NLP as a source ofinspiring applications.

24 / 87

Page 25: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

NLP?= Linguistics

I NLP must contend with NL data as found in the world

I NLP ≈ computational linguistics

I Linguistics has begun to use tools originating in NLP!

25 / 87

Page 26: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Fields with Connections to NLP

I Machine learning

I Linguistics (including psycho-, socio-, descriptive, and theoretical)

I Cognitive science

I Information theory

I Logic

I Theory of computation

I Data science

I Political science

I Psychology

I Economics

I Education

26 / 87

Page 27: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

The Engineering Side

I Application tasks are difficult to define formally; they are always evolving.

I Objective evaluations of performance are always up for debate.

I Different applications require different R.

I People who succeed in NLP for long periods of time are foxes, not hedgehogs.

27 / 87

Page 28: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Today’s Applications

I Conversational agents

I Information extraction and question answering

I Machine translation

I Opinion and sentiment analysis

I Social media analysis

I Rich visual understanding

I Essay evaluation

I Mining legal, medical, or scholarly literature

28 / 87

Page 29: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Factors Changing the NLP Landscape(Hirschberg and Manning, 2015)

I Increases in computing power

I The rise of the web, then the social web

I Advances in machine learning

I Advances in understanding of language in social context

29 / 87

Page 30: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Administrivia

30 / 87

Page 31: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Course Website

http://courses.cs.washington.edu/courses/csep517/17sp/

31 / 87

Page 32: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Your Instructors

Noah (instructor):

I UW CSE professor since 2015, teaching NLP since 2006, studying NLP since1998, first NLP program in 1991

I Research interests: machine learning for structured problems in NLP, NLP forsocial science

George (TA):

I Computer Science Ph.D. student

I Research interests: machine learning for multilingual NLP

32 / 87

Page 33: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Outline of CSE 517

1. Probabilistic language models, which define probability distributions over textpassages. (about 2 weeks)

2. Text classifiers, which infer attributes of a piece of text by “reading” it. (about 1week)

3. Sequence models (about 1 week)

4. Parsers (about 2 weeks)

5. Semantics (about 2 weeks)

6. Machine translation (about 1 week)

33 / 87

Page 34: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Readings

I Main reference text: Jurafsky and Martin, 2008, some chapters from new edition(Jurafsky and Martin, forthcoming) when available

I Course notes from the instructor and others

I Research articles

Lecture slides will include references for deeper reading on some topics.

34 / 87

Page 35: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Evaluation

I Approximately five assignments (A1–5), completed individually (50%).

I Quizzes (20%), given roughly weekly, online

I An exam (30%), to take place at the end of the quarter

35 / 87

Page 36: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Evaluation

I Approximately five assignments (A1–5), completed individually (50%).I Some pencil and paper, mostly programmingI Graded mostly on your writeup (so please take written communication seriously!)

I Quizzes (20%), given roughly weekly, online

I An exam (30%), to take place at the end of the quarter

36 / 87

Page 37: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

To-Do List

I Entrance survey: due Wednesday

I Online quiz: due Friday

I Print, sign, and return the academic integrity statement

I Read: Jurafsky and Martin (2008, ch. 1), Hirschberg and Manning (2015), andSmith (2017);optionally, Jurafsky and Martin (2016) and Collins (2011) §2

I A1, out today, due April 7

37 / 87

Page 38: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Very Quick Review of Probability

I Event space (e.g., X , Y)—in this class, usually discrete

I Random variables (e.g., X, Y )

I Typical statement: “random variable X takes value x ∈ X with probabilityp(X = x), or, in shorthand, p(x)”

I Joint probability: p(X = x, Y = y)

I Conditional probability: p(X = x | Y = y)

I Always true:p(X = x, Y = y) = p(X = x | Y = y) · p(Y = y) = p(Y = y | X = x) · p(X = x)

I Sometimes true: p(X = x, Y = y) = p(X = x) · p(Y = y)

I The difference between true and estimated probability distributions

38 / 87

Page 39: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Very Quick Review of Probability

I Event space (e.g., X , Y)—in this class, usually discrete

I Random variables (e.g., X, Y )

I Typical statement: “random variable X takes value x ∈ X with probabilityp(X = x), or, in shorthand, p(x)”

I Joint probability: p(X = x, Y = y)

I Conditional probability: p(X = x | Y = y)

I Always true:p(X = x, Y = y) = p(X = x | Y = y) · p(Y = y) = p(Y = y | X = x) · p(X = x)

I Sometimes true: p(X = x, Y = y) = p(X = x) · p(Y = y)

I The difference between true and estimated probability distributions

39 / 87

Page 40: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Very Quick Review of Probability

I Event space (e.g., X , Y)—in this class, usually discrete

I Random variables (e.g., X, Y )

I Typical statement: “random variable X takes value x ∈ X with probabilityp(X = x), or, in shorthand, p(x)”

I Joint probability: p(X = x, Y = y)

I Conditional probability: p(X = x | Y = y)

I Always true:p(X = x, Y = y) = p(X = x | Y = y) · p(Y = y) = p(Y = y | X = x) · p(X = x)

I Sometimes true: p(X = x, Y = y) = p(X = x) · p(Y = y)

I The difference between true and estimated probability distributions

40 / 87

Page 41: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Very Quick Review of Probability

I Event space (e.g., X , Y)—in this class, usually discrete

I Random variables (e.g., X, Y )

I Typical statement: “random variable X takes value x ∈ X with probabilityp(X = x), or, in shorthand, p(x)”

I Joint probability: p(X = x, Y = y)

I Conditional probability: p(X = x | Y = y)

I Always true:p(X = x, Y = y) = p(X = x | Y = y) · p(Y = y) = p(Y = y | X = x) · p(X = x)

I Sometimes true: p(X = x, Y = y) = p(X = x) · p(Y = y)

I The difference between true and estimated probability distributions

41 / 87

Page 42: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Very Quick Review of Probability

I Event space (e.g., X , Y)—in this class, usually discrete

I Random variables (e.g., X, Y )

I Typical statement: “random variable X takes value x ∈ X with probabilityp(X = x), or, in shorthand, p(x)”

I Joint probability: p(X = x, Y = y)

I Conditional probability: p(X = x | Y = y)

I Always true:p(X = x, Y = y) = p(X = x | Y = y) · p(Y = y) = p(Y = y | X = x) · p(X = x)

I Sometimes true: p(X = x, Y = y) = p(X = x) · p(Y = y)

I The difference between true and estimated probability distributions

42 / 87

Page 43: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Very Quick Review of Probability

I Event space (e.g., X , Y)—in this class, usually discrete

I Random variables (e.g., X, Y )

I Typical statement: “random variable X takes value x ∈ X with probabilityp(X = x), or, in shorthand, p(x)”

I Joint probability: p(X = x, Y = y)

I Conditional probability: p(X = x | Y = y) =p(X = x, Y = y)

p(Y = y)

I Always true:p(X = x, Y = y) = p(X = x | Y = y) · p(Y = y) = p(Y = y | X = x) · p(X = x)

I Sometimes true: p(X = x, Y = y) = p(X = x) · p(Y = y)

I The difference between true and estimated probability distributions

43 / 87

Page 44: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Very Quick Review of Probability

I Event space (e.g., X , Y)—in this class, usually discrete

I Random variables (e.g., X, Y )

I Typical statement: “random variable X takes value x ∈ X with probabilityp(X = x), or, in shorthand, p(x)”

I Joint probability: p(X = x, Y = y)

I Conditional probability: p(X = x | Y = y) =p(X = x, Y = y)

p(Y = y)I Always true:p(X = x, Y = y) = p(X = x | Y = y) · p(Y = y) = p(Y = y | X = x) · p(X = x)

I Sometimes true: p(X = x, Y = y) = p(X = x) · p(Y = y)

I The difference between true and estimated probability distributions

44 / 87

Page 45: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Very Quick Review of Probability

I Event space (e.g., X , Y)—in this class, usually discrete

I Random variables (e.g., X, Y )

I Typical statement: “random variable X takes value x ∈ X with probabilityp(X = x), or, in shorthand, p(x)”

I Joint probability: p(X = x, Y = y)

I Conditional probability: p(X = x | Y = y) =p(X = x, Y = y)

p(Y = y)I Always true:p(X = x, Y = y) = p(X = x | Y = y) · p(Y = y) = p(Y = y | X = x) · p(X = x)

I Sometimes true: p(X = x, Y = y) = p(X = x) · p(Y = y)

I The difference between true and estimated probability distributions

45 / 87

Page 46: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Very Quick Review of Probability

I Event space (e.g., X , Y)—in this class, usually discrete

I Random variables (e.g., X, Y )

I Typical statement: “random variable X takes value x ∈ X with probabilityp(X = x), or, in shorthand, p(x)”

I Joint probability: p(X = x, Y = y)

I Conditional probability: p(X = x | Y = y) =p(X = x, Y = y)

p(Y = y)I Always true:p(X = x, Y = y) = p(X = x | Y = y) · p(Y = y) = p(Y = y | X = x) · p(X = x)

I Sometimes true: p(X = x, Y = y) = p(X = x) · p(Y = y)

I The difference between true and estimated probability distributions

46 / 87

Page 47: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Language Models: Definitions

I V is a finite set of (discrete) symbols (, “words” or possibly characters); V = |V|I V† is the (infinite) set of sequences of symbols from V whose final symbol is 8I p : V† → R, such that:

I For any x ∈ V†, p(x) ≥ 0

I∑x∈V†

p(X = x) = 1

(I.e., p is a proper probability distribution.)

Language modeling: estimate p from examples, x1:n = 〈x1,x2, . . . ,xn〉.

47 / 87

Page 48: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Immediate Objections

1. Why would we want to do this?

2. Are the nonnegativity and sum-to-one constraints really necessary?

3. Is “finite V” realistic?

48 / 87

Page 49: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Motivation: Noisy Channel ModelsA pattern for modeling a pair of random variables, D and O:

source −→ D −→ channel −→ O

I D is the plaintext, the true message, the missing information, the output

I O is the ciphertext, the garbled message, the observable evidence, the input

I Decoding: select d given O = o.

d∗ = argmaxd

p(d | o)

= argmaxd

p(o | d) · p(d)p(o)

= argmaxd

p(o | d)︸ ︷︷ ︸channel model

· p(d)︸︷︷︸source model

49 / 87

Page 50: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Motivation: Noisy Channel ModelsA pattern for modeling a pair of random variables, D and O:

source −→ D −→ channel −→ O

I D is the plaintext, the true message, the missing information, the output

I O is the ciphertext, the garbled message, the observable evidence, the input

I Decoding: select d given O = o.

d∗ = argmaxd

p(d | o)

= argmaxd

p(o | d) · p(d)p(o)

= argmaxd

p(o | d)︸ ︷︷ ︸channel model

· p(d)︸︷︷︸source model

50 / 87

Page 51: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Motivation: Noisy Channel ModelsA pattern for modeling a pair of random variables, D and O:

source −→ D −→ channel −→ O

I D is the plaintext, the true message, the missing information, the output

I O is the ciphertext, the garbled message, the observable evidence, the input

I Decoding: select d given O = o.

d∗ = argmaxd

p(d | o)

= argmaxd

p(o | d) · p(d)p(o)

= argmaxd

p(o | d)︸ ︷︷ ︸channel model

· p(d)︸︷︷︸source model

51 / 87

Page 52: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Motivation: Noisy Channel ModelsA pattern for modeling a pair of random variables, D and O:

source −→ D −→ channel −→ O

I D is the plaintext, the true message, the missing information, the output

I O is the ciphertext, the garbled message, the observable evidence, the input

I Decoding: select d given O = o.

d∗ = argmaxd

p(d | o)

= argmaxd

p(o | d) · p(d)p(o)

= argmaxd

p(o | d)︸ ︷︷ ︸channel model

· p(d)︸︷︷︸source model

52 / 87

Page 53: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Noisy Channel Example: Speech Recognition

source −→ sequence in V† −→ channel −→ acoustics

I Acoustic model defines p(sounds | d) (channel)

I Language model defines p(d) (source)

53 / 87

Page 54: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Noisy Channel Example: Speech RecognitionCredit: Luke Zettlemoyer

word sequence log p(acoustics | word sequence)

the station signs are in deep in english -14732the stations signs are in deep in english -14735the station signs are in deep into english -14739the station ’s signs are in deep in english -14740the station signs are in deep in the english -14741the station signs are indeed in english -14757the station ’s signs are indeed in english -14760the station signs are indians in english -14790the station signs are indian in english -14799the stations signs are indians in english -14807the stations signs are indians and english -14815

54 / 87

Page 55: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Noisy Channel Example: Machine Translation

Also knowing nothing official about, but having guessed and inferredconsiderable about, the powerful new mechanized methods incryptography—methods which I believe succeed even when one does notknow what language has been coded—one naturally wonders if the problem oftranslation could conceivably be treated as a problem in cryptography. WhenI look at an article in Russian, I say: “This is really written in English, but ithas been coded in some strange symbols. I will now proceed to decode.”

Warren Weaver, 1955

55 / 87

Page 56: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Noisy Channel Examples

I Speech recognition

I Machine translation

I Optical character recognition

I Spelling and grammar correction

56 / 87

Page 57: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Immediate Objections

1. Why would we want to do this?

2. Are the nonnegativity and sum-to-one constraints really necessary?

3. Is “finite V” realistic?

57 / 87

Page 58: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Evaluation: PerplexityIntuitively, language models should assign high probability to real language they havenot seen before.For out-of-sample (“held-out” or “test”) data x1:m:

I Probability of x1:m ism∏i=1

p(xi)

I Log-probability of x1:m ism∑i=1

log2 p(xi)

I Average log-probability per word of x1:m is

l =1

M

m∑i=1

log2 p(xi)

if M =∑m

i=1 |xi| (total number of words in the corpus)

I Perplexity (relative to x1:m) is 2−l

58 / 87

Page 59: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Evaluation: PerplexityIntuitively, language models should assign high probability to real language they havenot seen before.For out-of-sample (“held-out” or “test”) data x1:m:

I Probability of x1:m ism∏i=1

p(xi)

I Log-probability of x1:m ism∑i=1

log2 p(xi)

I Average log-probability per word of x1:m is

l =1

M

m∑i=1

log2 p(xi)

if M =∑m

i=1 |xi| (total number of words in the corpus)

I Perplexity (relative to x1:m) is 2−l

59 / 87

Page 60: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Evaluation: PerplexityIntuitively, language models should assign high probability to real language they havenot seen before.For out-of-sample (“held-out” or “test”) data x1:m:

I Probability of x1:m ism∏i=1

p(xi)

I Log-probability of x1:m ism∑i=1

log2 p(xi)

I Average log-probability per word of x1:m is

l =1

M

m∑i=1

log2 p(xi)

if M =∑m

i=1 |xi| (total number of words in the corpus)

I Perplexity (relative to x1:m) is 2−l

60 / 87

Page 61: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Evaluation: PerplexityIntuitively, language models should assign high probability to real language they havenot seen before.For out-of-sample (“held-out” or “test”) data x1:m:

I Probability of x1:m ism∏i=1

p(xi)

I Log-probability of x1:m ism∑i=1

log2 p(xi)

I Average log-probability per word of x1:m is

l =1

M

m∑i=1

log2 p(xi)

if M =∑m

i=1 |xi| (total number of words in the corpus)

I Perplexity (relative to x1:m) is 2−l

61 / 87

Page 62: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Evaluation: PerplexityIntuitively, language models should assign high probability to real language they havenot seen before.For out-of-sample (“held-out” or “test”) data x1:m:

I Probability of x1:m ism∏i=1

p(xi)

I Log-probability of x1:m ism∑i=1

log2 p(xi)

I Average log-probability per word of x1:m is

l =1

M

m∑i=1

log2 p(xi)

if M =∑m

i=1 |xi| (total number of words in the corpus)

I Perplexity (relative to x1:m) is 2−l

Lower is better.62 / 87

Page 63: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Understanding Perplexity

2

− 1

M

m∑i=1

log2 p(xi)

It’s a branching factor!

I Assign probability of 1 to the test data ⇒ perplexity = 1

I Assign probability of 1|V| to every word ⇒ perplexity = |V|

I Assign probability of 0 to anything ⇒ perplexity = ∞I This motivates a stricter constraint than we had before:

I For any x ∈ V†, p(x) > 0

63 / 87

Page 64: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Perplexity

I Perplexity on conventionally accepted test sets is often reported in papers.I Generally, I won’t discuss perplexity numbers much, because:

I Perplexity is only an intermediate measure of performance.I Understanding the models is more important than remembering how well they

perform on particular train/test sets.

I If you’re curious, look up numbers in the literature; always take them with a grainof salt!

64 / 87

Page 65: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Immediate Objections

1. Why would we want to do this?

2. Are the nonnegativity and sum-to-one constraints really necessary?

3. Is “finite V” realistic?

65 / 87

Page 66: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Is “finite V” realistic?

No

66 / 87

Page 67: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Is “finite V” realistic?

Nonon0-no

nottaNo

/no//no(no|no

67 / 87

Page 68: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

The Language Modeling Problem

Input: x1:n (“training data”)Output: p : V† → R+

, p should be a “useful” measure of plausibility (not grammaticality).

68 / 87

Page 69: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

A Trivial Language Model

p(x) =|{i | xi = x}|

n=cx1:n(x)

n

69 / 87

Page 70: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

A Trivial Language Model

p(x) =|{i | xi = x}|

n=cx1:n(x)

n

What if x is not in the training data?

70 / 87

Page 71: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Using the Chain Rule

p(X = x) =

p(X1 = x1 | X0 = x0)· p(X2 = x2 | X0:1 = x0:1)· p(X3 = x3 | X0:2 = x0:2)...· p(X` = 8 | X0:`−1 = x0:`−1)

=

∏j=1

p(Xj = xj | X0:j−1 = x0:j−1)

71 / 87

Page 72: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Unigram Model

p(X = x) =∏j=1

p(Xj = xj | X0:j−1 = x0:j−1)

assumption=

∏j=1

pθ(Xj = xj) =∏j=1

θxj ≈∏j=1

θxj

Maximum likelihood estimate:

∀v ∈ V, θv =|{i, j | [xi]j = v}|

N

=cx1:n(v)

N

where N =∑n

i=1 |xi|.Also known as “relative frequency estimation.”

72 / 87

Page 73: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

73 / 87

Page 74: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Unigram Model

p(X = x) =∏j=1

p(Xj = xj | X0:j−1 = x0:j−1)

assumption=

∏j=1

pθ(Xj = xj) =∏j=1

θxj ≈∏j=1

θxj

Maximum likelihood estimate:

∀v ∈ V, θv =|{i, j | [xi]j = v}|

N

=cx1:n(v)

N

where N =∑n

i=1 |xi|.Also known as “relative frequency estimation.”

74 / 87

Page 75: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Unigram Models: Assessment

Pros:

I Easy to understand

I Cheap

I Good enough for information retrieval(maybe)

Cons:

I “Bag of words” assumption islinguistically inaccurate

I p(the the the the)�p(I want ice cream)

I Data sparseness; high variance in theestimator

I “Out of vocabulary” problem

75 / 87

Page 76: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Markov Models ≡ n-gram Models

p(X = x) =∏j=1

p(Xj = xj | X0:j−1 = x0:j−1)

assumption=

∏j=1

pθ(Xj = xj | Xj−n+1:j−1 = xj−n+1:j−1)

(n− 1)th-order Markov assumption ≡ n-gram model

I Unigram model is the n = 1 case

I For a long time, trigram models (n = 3) were widely used

I 5-gram models (n = 5) are not uncommon now in MT

76 / 87

Page 77: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Estimating n-Gram Models

unigram bigram trigram

pθ(x) =∏j=1

θxj

∏j=1

θxj |xj−1

∏j=1

θxj |xj−2xj−1

Parameters: θv θv|v′ θv|v′′v′

∀v ∈ V ∀v ∈ V, v′ ∈ V ∪ {©} ∀v ∈ V, v′, v′′ ∈ V ∪ {©}

MLE:c(v)

N

c(v′v)∑u∈V c(v

′u)

c(v′′v′v)∑u∈V c(v

′′v′u)

General case:∏j=1

θxj |xj−n+1:j−1θv|h, ∀v ∈ V,h ∈ (V ∪ {©})n−1

c(hv)∑u∈V c(hu)

77 / 87

Page 78: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

The Problem with MLE

I The curse of dimensionality: the number of parameters grows exponentially in n

I Data sparseness: most n-grams will never be observed, even if they arelinguistically plausible

I No one actually uses the MLE!

78 / 87

Page 79: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Smoothing

A few years ago, I’d have spent a whole lecture on this! /I Simple method: add λ > 0 to every count (including zero-counts) before

normalizingI What makes it hard: ensuring that the probabilities over all sequences sum to one

I Otherwise, perplexity calculations break

I Longstanding champion: modified Kneser-Ney smoothing (Chen and Goodman,1998)

I Stupid backoff: reasonable, easy solution when you don’t care about perplexity(Brants et al., 2007)

79 / 87

Page 80: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Interpolation

If p and q are both language models, then so is

αp+ (1− α)q

for any α ∈ [0, 1].

I This idea underlies many smoothing methods

I Often a new model q only beats a reigning champion p when interpolated with it

I How to pick the “hyperparameter” α?

80 / 87

Page 81: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Algorithms To Know

I Score a sentence x

I Train from a corpus x1:n

I Sample a sentence given θ

81 / 87

Page 82: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

n-gram Models: Assessment

Pros:

I Easy to understand

I Cheap (with modern hardware; Linand Dyer, 2010)

I Good enough for machinetranslation, speech recognition, . . .

Cons:

I Markov assumption is linguisticallyinaccurate

I (But not as bad as unigrammodels!)

I Data sparseness; high variance in theestimator

I “Out of vocabulary” problem

82 / 87

Page 83: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Dealing with Out-of-Vocabulary Terms

I Define a special OOV or “unknown” symbol unk. Transform some (or all) rarewords in the training data to unk.

I / You cannot fairly compare two language models that apply different unktreatments!

I Build a language model at the character level.

83 / 87

Page 84: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

What’s wrong with n-grams?

Data sparseness: most histories and most words will be seen only rarely (if at all).

84 / 87

Page 85: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

What’s wrong with n-grams?

Data sparseness: most histories and most words will be seen only rarely (if at all).

Next central idea: teach histories and words how to share.

85 / 87

Page 86: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

Log-Linear Models: Definitions

We define a conditional log-linear model p(Y | X) as:

I Y is the set of events/outputs (, for language modeling, V)

I X is the set of contexts/inputs (, for n-gram language modeling, Vn−1)

I φ : X × Y → Rd is a feature vector function

I w ∈ Rd are the model parameters

pw(Y = y | X = x) =expw · φ(x, y)∑

y′∈Yexpw · φ(x, y′)

86 / 87

Page 87: Natural Language Processing (CSEP 517): Introduction & Language … · 2017-03-28 · 1. Probabilistic language models, which de ne probability distributions over text passages. (about

References I

Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. Large language models in machinetranslation. In Proc. of EMNLP-CoNLL, 2007.

Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling.Technical Report TR-10-98, Center for Research in Computing Technology, Harvard University, 1998.

Michael Collins. Log-linear models, MEMMs, and CRFs, 2011. URLhttp://www.cs.columbia.edu/~mcollins/crf.pdf.

Julia Hirschberg and Christopher D. Manning. Advances in natural language processing. Science, 349(6245):261–266, 2015. URL https://www.sciencemag.org/content/349/6245/261.full.

Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural LanguageProcessing, Computational Linguistics, and Speech Recognition. Prentice Hall, second edition, 2008.

Daniel Jurafsky and James H. Martin. N-grams (draft chapter), 2016. URLhttps://web.stanford.edu/~jurafsky/slp3/4.pdf.

Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural LanguageProcessing, Computational Linguistics, and Speech Recognition. Prentice Hall, third edition, forthcoming.URL https://web.stanford.edu/~jurafsky/slp3/.

Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce. Morgan and Claypool, 2010.

Noah A. Smith. Probabilistic language models 1.0, 2017. URLhttp://homes.cs.washington.edu/~nasmith/papers/plm.17.pdf.

87 / 87