Sentence processing - Linguistic Society

Sentence processingEmily Morgan

LSA 2019 Summer InstituteUC Davis

Some sentences are harder to process than others.

• More predictable words are:• read faster• more likely to be skipped over in reading• less likely to evoke regressions in reading

• They also evoke distinctive neural responses using e.g. Event Related Potentials (ERPs)

2

He bought her a pearl necklace for her… birthdaycollection

ERP responses to more/less predictable words

3

N400

(Kutas & Hillyard, 1984)

Processing difficulty arises from syntactic structure as well as word choice• The complex houses married children and their

families.• The warehouse fires a dozen employees every year.• The old man the boat.• The prime number few.• The horse raced past the barn fell.(These are all garden path sentences, of varying severity.)

4

• This is the cat that the dog chased.• This is the rat that cat killed.(Which cat?)• This is the rat that the cat that the dog chased

killed.• This is the cheese that the rat ate.• (Which rat?)• This is the cheese that the rat that the cat that the

dog chased killed ate.

5

Sentences are full of ambiguitiesOne morning I shot an elephant in my pajamas.

How he got in my pajamas I’ll never know.

(Groucho Marx)

(Ford et al., 1982)

The woman discussed the dogs on the beach.

• What does on the beach modify?• dogs (90%); discussed (10%)

The woman kept the dogs on the beach.

• What does on the beach modify?• kept (95%); dogs (5%)

6

Sentence processing is incremental• i.e. Comprehenders don’t wait until they have the

full sentence to process it

7

The boy will eat…vs.The boy will move…

Theoretical questions• Why are some sentences easier/harder to process

than others?• How does a comprehender rapidly disambiguate

between possible interpretations?• (Noting that both processing difficulty and

disambiguation occur incrementally/with incomplete input)

8

Possible factors influencing both processing difficult and ambiguity resolution• Memory constraints

• This is the cheese that the rat that the cat that the dog chased killed ate.

• Whoi did you hope that the candidate said that he admired _____i?

• Expectations• i.e. how predictable is a word/grammatical structure, given

the context (preceding words, real-world context, etc.)• He gave her a pearl necklace for her birthday/collection.

In order to test/disentangle these possibilities, we need to be able to model the predictability of words and syntactic structures in sentences

9

Outline• Different models of sentence probability• n-grams• Probabilistic Context Free Grammars• Recurrent Neural Networks

• Applying these models to psycholinguistic questions• How does a comprehender rapidly disambiguate

between possible interpretations?• Why are some sentences easier/harder to process than

others?

10

Introduction to N-grams

Slides from Dan Jurafsky

(Stanford University Natural Language Processing group)

Language Modeling

Dan Jurafsky

Probabilistic Language Models

• Today’s goal: assign a probability to a sentence• Machine Translation:• P(high winds tonite) > P(large winds tonite)

• Spell Correction• The office is about fifteen minuets from my house

• P(about fifteen minutes from) > P(about fifteen minuets from)

• Speech Recognition• P(I saw a van) >> P(eyes awe of an)

• + Summarization, question-answering, etc., etc.!!

Why?

Dan Jurafsky

Probabilistic Language Modeling

• Goal: compute the probability of a sentence or sequence of words:

P(W) = P(w1,w2,w3,w4,w5…wn)

• Related task: probability of an upcoming word:P(w5|w1,w2,w3,w4)

• A model that computes either of these:P(W) or P(wn|w1,w2…wn-1) is called a language model.

• Better: the grammar But language model or LM is standard

Dan Jurafsky

How to compute P(W)

• How to compute this joint probability:

• P(its, water, is, so, transparent, that)

• Intuition: let’s rely on the Chain Rule of Probability

Dan Jurafsky

Reminder: The Chain Rule

• Recall the definition of conditional probabilities

p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)

• More variables:P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)

• The Chain Rule in GeneralP(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)

Dan Jurafsky

The Chain Rule applied to compute joint probability of words in sentence

P(“its water is so transparent”) =P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is

so)

€

P(w1w2…wn ) = P(wi |w1w2…wi−1)i∏

Dan Jurafsky

How to estimate these probabilities

• Could we just count and divide?

• No! Too many possible sentences!• We’ll never see enough data for estimating these

€

P(the | its water is so transparent that) =

Count(its water is so transparent that the)Count(its water is so transparent that)

Dan Jurafsky

Markov Assumption• Simplifying assumption:

• Or maybe

€

P(the | its water is so transparent that) ≈ P(the | that)

€

P(the | its water is so transparent that) ≈ P(the | transparent that)

Andrei Markov

Dan Jurafsky

Markov Assumption

• In other words, we approximate each component in the product

€

P(w1w2…wn ) ≈ P(wi |wi−k…wi−1)i∏

€

P(wi |w1w2…wi−1) ≈ P(wi |wi−k…wi−1)

Dan Jurafsky

Simplest case: Unigram model

fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

Some automatically generated sentences from a unigram model

€

P(w1w2…wn ) ≈ P(wi)i∏

Dan Jurafsky

Condition on the previous word:

Bigram model

texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november

€

P(wi |w1w2…wi−1) ≈ P(wi |wi−1)

Dan Jurafsky

N-gram models

• We can extend to trigrams, 4-grams, 5-grams• In general this is an insufficient model of language

• because language has long-distance dependencies:

“The computer(s) which I had just put into the machine room on the fifth floor is (are) crashing.”

• But we can often get away with N-gram models

Introduction to N-grams

Language Modeling

Estimating N-gram Probabilities

Language Modeling

Dan Jurafsky

Estimating bigram probabilities

• Relative frequency estimation

€

P(wi |wi−1) =count(wi−1,wi)count(wi−1)

€

P(wi |wi−1) =c(wi−1,wi)c(wi−1)

Dan Jurafsky

An example

<s> I am Sam </s><s> Sam I am </s><s> I do not like green eggs and ham </s>

€

P(wi |wi−1) =c(wi−1,wi)c(wi−1)

Dan Jurafsky

More examples: Berkeley Restaurant Project sentences

• can you tell me about any good cantonese restaurants close by• mid priced thai food is what i’m looking for• tell me about chez panisse• can you give me a listing of the kinds of food that are available• i’m looking for a good place to eat breakfast• when is caffe venezia open during the day

Dan Jurafsky

Raw bigram counts

• Out of 9222 sentences

Dan Jurafsky

Raw bigram probabilities

• Normalize by unigrams:

• Result:

Dan Jurafsky

Bigram estimates of sentence probabilities

P(<s> I want english food </s>) =P(I|<s>) × P(want|I) × P(english|want) × P(food|english) × P(</s>|food)

= .000031

Dan Jurafsky

What kinds of knowledge?

• P(english|want) = .0011• P(chinese|want) = .0065• P(to|want) = .66• P(eat | to) = .28• P(food | to) = 0• P(want | spend) = 0• P (i | <s>) = .25

Dan Jurafsky

Google N-Gram Release, August 2006

…

Dan Jurafsky

Google N-Gram Release

• serve as the incoming 92• serve as the incubator 99• serve as the independent 794• serve as the index 223• serve as the indication 72• serve as the indicator 120• serve as the indicators 45• serve as the indispensable 111• serve as the indispensible 40• serve as the individual 234

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Dan Jurafsky

Google Book N-grams

• http://ngrams.googlelabs.com/

http://ngrams.googlelabs.com/

Estimating N-gram Probabilities

Language Modeling

N-grams as language models for computational psycholinguistics

• Advantages• Relatively easy to calculate• Do a surprisingly good job (given how simple they are) at

predicting empirical behavior (as we’ll see later)

• Disadvantages• Can’t capture long-distance dependencies• Don’t represent any underlying linguistic structure, e.g.

syntax

38

Grammars• A grammar is a structured set of production rules• Most commonly used for syntactic description, but

also used in semantics, phonology, etc.• e.g. Context-Free Grammars/Phrase Structure Rules

• A grammar licenses a derivation if all the derivation’s rules are present in the grammar

39

OKX

Context-Free Grammars (CFGs)• Formally, a Context-Free Grammar (CFG) consists

of:• a set of non-terminal symbols (e.g. S, NP, VP, N, V, etc.)

• i.e. symbols from which further derivation will occur• These represent phrasal or lexical categories

• a set of terminal symbols (e.g. the, dog, chase, etc.)• i.e. symbols from which no further derivation will occur• These represent lexical items

• a start symbol (e.g. S)• i.e. the non-terminal symbol that starts every tree

• a set of rules of the form X à Y1 Y2 … Yn• where X is a non-terminal and Yi are either non-terminal or

terminal symbols• e.g. S à NP VP; NP à Det N; N à dog; etc.

40

Context-Free Grammars (CFGs)• A CFG derivation starts with the start symbol (e.g.

S) and recursively expands the non-terminal categories using rules in the grammar• The resulting tree is called the derivation tree

41

CFG exampleContext-free Grammars: an example IIS →NP VPNP→Det NNP→NP PPPP→P NPVP→V

Det→ theN → dogN → catP → nearV → growled

Here is a derivation and the resulting derivation tree:

S

42

Context-Free Grammars (CFGs)

43

• CFGs can tell us about which trees and sentences are/are not licensed by the grammar• But they don’t tell us anything about which trees

and sentences are more probable• So we augment them with probabilities

à Probabilistic Context-Free Grammars (PCFGs)

Probabilistic Context-Free Grammars (PCFGs)• Formally, a PCFG consists of:• a set of non-terminal symbols (e.g. S, NP, VP, N, V, etc.)

• i.e. symbols from which further derivation will occur• These represent phrasal or lexical categories

• a set of terminal symbols (e.g. the, dog, chase, etc.)• i.e. symbols from which no further derivation will occur• These represent lexical items

• a start symbol (e.g. S)• i.e. the non-terminal symbol that starts every tree

• a set of rules of the form X à Y1 Y2 … Yn• where X is a non-terminal and Yi are either non-terminal or

terminal symbols• e.g. S à NP VP; NP à Det N; etc.

• probabilities for each rule such that for each non-terminal X, the sum of the probabilities of all rules with X on the left-hand side = 1, i.e.:

44

![#→%&'])Rules

𝑃(𝑋 → 𝑌./) = 1 (for each non-terminal X)

Example PCFG1 S →NP VP0.8 NP →Det N0.2 NP →NP PP1 PP →P NP1 VP →V

1 Det → the0.5 N → dog0.5 N → cat1 P → near1 V → growled

S

NP

NP

Det

the

N

dog

PP

P

near

NP

Det

the

N

cat

VP

V

growled

0.2

0.8

0.8

0.5

0.5

P(T) = 1× 0.2× 0.8× 1× 0.5× 0.8× 1××0.8× 1× 0.5× 1× 1= 0.032

45



S

NP

NP

Det

the

N

dog

PP

P

near

NP

Det

the

N

cat

VP

V

growled

0.2

0.8

0.8

0.5

0.5

P(T) = 1× 0.2× 0.8× 1× 0.5× 0.8× 1××0.8× 1× 0.5× 1× 1= 0.032

46

P(T) = P(S à NP VP) *P(NP à NP PP) *P(NP à Det N) *P(Det à the) *P(N à dog) *P(PP à P NP) *P(P à near) *P(NP à Det N) *P(Det à the) *P(N à cat) *P(VP à V) *P(V à growled)



S

NP

NP

Det

the

N

dog

PP

P

near

NP

Det

the

N

cat

VP

V

growled

0.2

0.8

0.8

0.5

0.5

P(T) = 1× 0.2× 0.8× 1× 0.5× 0.8× 1××0.8× 1× 0.5× 1× 1= 0.032

47

P(T) = P(S à NP VP) *P(NP à NP PP) *P(NP à Det N) *P(Det à the) *P(N à dog) *P(PP à P NP) *P(P à near) *P(NP à Det N) *P(Det à the) *P(N à cat) *P(VP à V) *P(V à growled)

Estimating PCFG probabilities• Relative frequency estimation• We need a syntactically annotated dataset, aka. a

Treebank• Fortunately, these exist for English and various

other languages• But constructing these datasets is much more

difficult/time-consuming than simply collecting a corpus of unannotated text

𝑃 LHS → RHS =count(LHS → RHS)

count(LHS)

48

An example

49

Recurrent Neural Networks (RNNs)• A type of connectionist model• Rely on emergent representations• Unlike symbolic models (including n-grams and PCFGs)

where we define the symbols and the rules,• RNNS are trained on huge amounts of data and develop

their own representations that maximize their fit to the training data

• Today’s state-of-the-art language models• Used in machine translation, speech-to-text, etc.

50

Recurrent Neural Networks (RNNs)• They’re a black box--we don’t understand their

internal representation or why they work• Harder to use them for computational psycholinguistics

• If we understood them better, maybe they would tell us something about human language processing• We’ll return to this in the last week of the course

51

So far: Language models• n-grams• Probabilistic Context-Free Grammars (PCFGs)• (Recurrent Neural Networks; RNNs)

How can we use these models to investigate:• How do comprehenders rapidly disambiguate

ambiguous sentences?• What makes words and sentences easier/more difficult

to process?

52

Documents

Sentence processing - Linguistic Society