50
Sentence processing Emily Morgan LSA 2019 Summer Institute UC Davis

Sentence processing - Linguistic Society

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sentence processing - Linguistic Society

Sentence processingEmily Morgan

LSA 2019 Summer InstituteUC Davis

Page 2: Sentence processing - Linguistic Society

Some sentences are harder to process than others.

• More predictable words are:• read faster• more likely to be skipped over in reading• less likely to evoke regressions in reading

• They also evoke distinctive neural responses using e.g. Event Related Potentials (ERPs)

2

He bought her a pearl necklace for her… birthdaycollection

Page 3: Sentence processing - Linguistic Society

ERP responses to more/less predictable words

3

N400

(Kutas & Hillyard, 1984)

Page 4: Sentence processing - Linguistic Society

Processing difficulty arises from syntactic structure as well as word choice• The complex houses married children and their

families.• The warehouse fires a dozen employees every year.• The old man the boat.• The prime number few.• The horse raced past the barn fell.(These are all garden path sentences, of varying severity.)

4

Page 5: Sentence processing - Linguistic Society

• This is the cat that the dog chased.• This is the rat that cat killed.(Which cat?)• This is the rat that the cat that the dog chased

killed.• This is the cheese that the rat ate.• (Which rat?)• This is the cheese that the rat that the cat that the

dog chased killed ate.

5

Page 6: Sentence processing - Linguistic Society

Sentences are full of ambiguitiesOne morning I shot an elephant in my pajamas.

How he got in my pajamas I’ll never know.

(Groucho Marx)

(Ford et al., 1982)

The woman discussed the dogs on the beach.

• What does on the beach modify?• dogs (90%); discussed (10%)

The woman kept the dogs on the beach.

• What does on the beach modify?• kept (95%); dogs (5%)

6

Page 7: Sentence processing - Linguistic Society

Sentence processing is incremental• i.e. Comprehenders don’t wait until they have the

full sentence to process it

7

The boy will eat…vs.The boy will move…

Page 8: Sentence processing - Linguistic Society

Theoretical questions• Why are some sentences easier/harder to process

than others?• How does a comprehender rapidly disambiguate

between possible interpretations?• (Noting that both processing difficulty and

disambiguation occur incrementally/with incomplete input)

8

Page 9: Sentence processing - Linguistic Society

Possible factors influencing both processing difficult and ambiguity resolution• Memory constraints

• This is the cheese that the rat that the cat that the dog chased killed ate.

• Whoi did you hope that the candidate said that he admired _____i?

• Expectations• i.e. how predictable is a word/grammatical structure, given

the context (preceding words, real-world context, etc.)• He gave her a pearl necklace for her birthday/collection.

In order to test/disentangle these possibilities, we need to be able to model the predictability of words and syntactic structures in sentences

9

Page 10: Sentence processing - Linguistic Society

Outline• Different models of sentence probability• n-grams• Probabilistic Context Free Grammars• Recurrent Neural Networks

• Applying these models to psycholinguistic questions• How does a comprehender rapidly disambiguate

between possible interpretations?• Why are some sentences easier/harder to process than

others?

10

Page 11: Sentence processing - Linguistic Society

Introduction to N-grams

Slides from Dan Jurafsky

(Stanford University Natural Language Processing group)

Language Modeling

Page 12: Sentence processing - Linguistic Society

Dan Jurafsky

Probabilistic Language Models

• Today’s goal: assign a probability to a sentence• Machine Translation:• P(high winds tonite) > P(large winds tonite)

• Spell Correction• The office is about fifteen minuets from my house

• P(about fifteen minutes from) > P(about fifteen minuets from)

• Speech Recognition• P(I saw a van) >> P(eyes awe of an)

• + Summarization, question-answering, etc., etc.!!

Why?

Page 13: Sentence processing - Linguistic Society

Dan Jurafsky

Probabilistic Language Modeling

• Goal: compute the probability of a sentence or sequence of words:

P(W) = P(w1,w2,w3,w4,w5…wn)

• Related task: probability of an upcoming word:P(w5|w1,w2,w3,w4)

• A model that computes either of these:P(W) or P(wn|w1,w2…wn-1) is called a language model.

• Better: the grammar But language model or LM is standard

Page 14: Sentence processing - Linguistic Society

Dan Jurafsky

How to compute P(W)

• How to compute this joint probability:

• P(its, water, is, so, transparent, that)

• Intuition: let’s rely on the Chain Rule of Probability

Page 15: Sentence processing - Linguistic Society

Dan Jurafsky

Reminder: The Chain Rule

• Recall the definition of conditional probabilities

p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)

• More variables:P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)

• The Chain Rule in GeneralP(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)

Page 16: Sentence processing - Linguistic Society

Dan Jurafsky

The Chain Rule applied to compute joint probability of words in sentence

P(“its water is so transparent”) =P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is

so)

P(w1w2…wn ) = P(wi |w1w2…wi−1)i∏

Page 17: Sentence processing - Linguistic Society

Dan Jurafsky

How to estimate these probabilities

• Could we just count and divide?

• No! Too many possible sentences!• We’ll never see enough data for estimating these

P(the | its water is so transparent that) =

Count(its water is so transparent that the)Count(its water is so transparent that)

Page 18: Sentence processing - Linguistic Society

Dan Jurafsky

Markov Assumption• Simplifying assumption:

• Or maybe

P(the | its water is so transparent that) ≈ P(the | that)

P(the | its water is so transparent that) ≈ P(the | transparent that)

Andrei Markov

Page 19: Sentence processing - Linguistic Society

Dan Jurafsky

Markov Assumption

• In other words, we approximate each component in the product

P(w1w2…wn ) ≈ P(wi |wi−k…wi−1)i∏

P(wi |w1w2…wi−1) ≈ P(wi |wi−k…wi−1)

Page 20: Sentence processing - Linguistic Society

Dan Jurafsky

Simplest case: Unigram model

fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

Some automatically generated sentences from a unigram model

P(w1w2…wn ) ≈ P(wi)i∏

Page 21: Sentence processing - Linguistic Society

Dan Jurafsky

Condition on the previous word:

Bigram model

texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november

P(wi |w1w2…wi−1) ≈ P(wi |wi−1)

Page 22: Sentence processing - Linguistic Society

Dan Jurafsky

N-gram models

• We can extend to trigrams, 4-grams, 5-grams• In general this is an insufficient model of language

• because language has long-distance dependencies:

“The computer(s) which I had just put into the machine room on the fifth floor is (are) crashing.”

• But we can often get away with N-gram models

Page 23: Sentence processing - Linguistic Society

Introduction to N-grams

Language Modeling

Page 24: Sentence processing - Linguistic Society

Estimating N-gram Probabilities

Language Modeling

Page 25: Sentence processing - Linguistic Society

Dan Jurafsky

Estimating bigram probabilities

• Relative frequency estimation

P(wi |wi−1) =count(wi−1,wi)count(wi−1)

P(wi |wi−1) =c(wi−1,wi)c(wi−1)

Page 26: Sentence processing - Linguistic Society

Dan Jurafsky

An example

<s> I am Sam </s><s> Sam I am </s><s> I do not like green eggs and ham </s>

P(wi |wi−1) =c(wi−1,wi)c(wi−1)

Page 27: Sentence processing - Linguistic Society

Dan Jurafsky

More examples: Berkeley Restaurant Project sentences

• can you tell me about any good cantonese restaurants close by• mid priced thai food is what i’m looking for• tell me about chez panisse• can you give me a listing of the kinds of food that are available• i’m looking for a good place to eat breakfast• when is caffe venezia open during the day

Page 28: Sentence processing - Linguistic Society

Dan Jurafsky

Raw bigram counts

• Out of 9222 sentences

Page 29: Sentence processing - Linguistic Society

Dan Jurafsky

Raw bigram probabilities

• Normalize by unigrams:

• Result:

Page 30: Sentence processing - Linguistic Society

Dan Jurafsky

Bigram estimates of sentence probabilities

P(<s> I want english food </s>) =P(I|<s>) × P(want|I) × P(english|want) × P(food|english) × P(</s>|food)

= .000031

Page 31: Sentence processing - Linguistic Society

Dan Jurafsky

What kinds of knowledge?

• P(english|want) = .0011• P(chinese|want) = .0065• P(to|want) = .66• P(eat | to) = .28• P(food | to) = 0• P(want | spend) = 0• P (i | <s>) = .25

Page 32: Sentence processing - Linguistic Society

Dan Jurafsky

Google N-Gram Release, August 2006

Page 33: Sentence processing - Linguistic Society

Dan Jurafsky

Google N-Gram Release

• serve as the incoming 92• serve as the incubator 99• serve as the independent 794• serve as the index 223• serve as the indication 72• serve as the indicator 120• serve as the indicators 45• serve as the indispensable 111• serve as the indispensible 40• serve as the individual 234

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Page 34: Sentence processing - Linguistic Society

Dan Jurafsky

Google Book N-grams

• http://ngrams.googlelabs.com/

Page 35: Sentence processing - Linguistic Society

Estimating N-gram Probabilities

Language Modeling

Page 36: Sentence processing - Linguistic Society

N-grams as language models for computational psycholinguistics

• Advantages• Relatively easy to calculate• Do a surprisingly good job (given how simple they are) at

predicting empirical behavior (as we’ll see later)

• Disadvantages• Can’t capture long-distance dependencies• Don’t represent any underlying linguistic structure, e.g.

syntax

38

Page 37: Sentence processing - Linguistic Society

Grammars• A grammar is a structured set of production rules• Most commonly used for syntactic description, but

also used in semantics, phonology, etc.• e.g. Context-Free Grammars/Phrase Structure Rules

• A grammar licenses a derivation if all the derivation’s rules are present in the grammar

39

OKX

Page 38: Sentence processing - Linguistic Society

Context-Free Grammars (CFGs)• Formally, a Context-Free Grammar (CFG) consists

of:• a set of non-terminal symbols (e.g. S, NP, VP, N, V, etc.)

• i.e. symbols from which further derivation will occur• These represent phrasal or lexical categories

• a set of terminal symbols (e.g. the, dog, chase, etc.)• i.e. symbols from which no further derivation will occur• These represent lexical items

• a start symbol (e.g. S)• i.e. the non-terminal symbol that starts every tree

• a set of rules of the form X à Y1 Y2 … Yn• where X is a non-terminal and Yi are either non-terminal or

terminal symbols• e.g. S à NP VP; NP à Det N; N à dog; etc.

40

Page 39: Sentence processing - Linguistic Society

Context-Free Grammars (CFGs)• A CFG derivation starts with the start symbol (e.g.

S) and recursively expands the non-terminal categories using rules in the grammar• The resulting tree is called the derivation tree

41

Page 40: Sentence processing - Linguistic Society

CFG exampleContext-free Grammars: an example IIS →NP VPNP→Det NNP→NP PPPP→P NPVP→V

Det→ theN → dogN → catP → nearV → growled

Here is a derivation and the resulting derivation tree:

S

42

Page 41: Sentence processing - Linguistic Society

Context-Free Grammars (CFGs)

43

• CFGs can tell us about which trees and sentences are/are not licensed by the grammar• But they don’t tell us anything about which trees

and sentences are more probable• So we augment them with probabilities

à Probabilistic Context-Free Grammars (PCFGs)

Page 42: Sentence processing - Linguistic Society

Probabilistic Context-Free Grammars (PCFGs)• Formally, a PCFG consists of:• a set of non-terminal symbols (e.g. S, NP, VP, N, V, etc.)

• i.e. symbols from which further derivation will occur• These represent phrasal or lexical categories

• a set of terminal symbols (e.g. the, dog, chase, etc.)• i.e. symbols from which no further derivation will occur• These represent lexical items

• a start symbol (e.g. S)• i.e. the non-terminal symbol that starts every tree

• a set of rules of the form X à Y1 Y2 … Yn• where X is a non-terminal and Yi are either non-terminal or

terminal symbols• e.g. S à NP VP; NP à Det N; etc.

• probabilities for each rule such that for each non-terminal X, the sum of the probabilities of all rules with X on the left-hand side = 1, i.e.:

44

![#→%&'])Rules

𝑃(𝑋 → 𝑌./) = 1 (for each non-terminal X)

Page 43: Sentence processing - Linguistic Society

Example PCFG1 S →NP VP0.8 NP →Det N0.2 NP →NP PP1 PP →P NP1 VP →V

1 Det → the0.5 N → dog0.5 N → cat1 P → near1 V → growled

S

NP

NP

Det

the

N

dog

PP

P

near

NP

Det

the

N

cat

VP

V

growled

0.2

0.8

0.8

0.5

0.5

P(T) = 1× 0.2× 0.8× 1× 0.5× 0.8× 1××0.8× 1× 0.5× 1× 1= 0.032

45

Page 44: Sentence processing - Linguistic Society

Example PCFG1 S →NP VP0.8 NP →Det N0.2 NP →NP PP1 PP →P NP1 VP →V

1 Det → the0.5 N → dog0.5 N → cat1 P → near1 V → growled

S

NP

NP

Det

the

N

dog

PP

P

near

NP

Det

the

N

cat

VP

V

growled

0.2

0.8

0.8

0.5

0.5

P(T) = 1× 0.2× 0.8× 1× 0.5× 0.8× 1××0.8× 1× 0.5× 1× 1= 0.032

46

P(T) = P(S à NP VP) *P(NP à NP PP) *P(NP à Det N) *P(Det à the) *P(N à dog) *P(PP à P NP) *P(P à near) *P(NP à Det N) *P(Det à the) *P(N à cat) *P(VP à V) *P(V à growled)

Page 45: Sentence processing - Linguistic Society

Example PCFG1 S →NP VP0.8 NP →Det N0.2 NP →NP PP1 PP →P NP1 VP →V

1 Det → the0.5 N → dog0.5 N → cat1 P → near1 V → growled

S

NP

NP

Det

the

N

dog

PP

P

near

NP

Det

the

N

cat

VP

V

growled

0.2

0.8

0.8

0.5

0.5

P(T) = 1× 0.2× 0.8× 1× 0.5× 0.8× 1××0.8× 1× 0.5× 1× 1= 0.032

47

P(T) = P(S à NP VP) *P(NP à NP PP) *P(NP à Det N) *P(Det à the) *P(N à dog) *P(PP à P NP) *P(P à near) *P(NP à Det N) *P(Det à the) *P(N à cat) *P(VP à V) *P(V à growled)

Page 46: Sentence processing - Linguistic Society

Estimating PCFG probabilities• Relative frequency estimation• We need a syntactically annotated dataset, aka. a

Treebank• Fortunately, these exist for English and various

other languages• But constructing these datasets is much more

difficult/time-consuming than simply collecting a corpus of unannotated text

𝑃 LHS → RHS =count(LHS → RHS)

count(LHS)

48

Page 47: Sentence processing - Linguistic Society

An example

49

Page 48: Sentence processing - Linguistic Society

Recurrent Neural Networks (RNNs)• A type of connectionist model• Rely on emergent representations• Unlike symbolic models (including n-grams and PCFGs)

where we define the symbols and the rules,• RNNS are trained on huge amounts of data and develop

their own representations that maximize their fit to the training data

• Today’s state-of-the-art language models• Used in machine translation, speech-to-text, etc.

50

Page 49: Sentence processing - Linguistic Society

Recurrent Neural Networks (RNNs)• They’re a black box--we don’t understand their

internal representation or why they work• Harder to use them for computational psycholinguistics

• If we understood them better, maybe they would tell us something about human language processing• We’ll return to this in the last week of the course

51

Page 50: Sentence processing - Linguistic Society

So far: Language models• n-grams• Probabilistic Context-Free Grammars (PCFGs)• (Recurrent Neural Networks; RNNs)

How can we use these models to investigate:• How do comprehenders rapidly disambiguate

ambiguous sentences?• What makes words and sentences easier/more difficult

to process?

52