Learning Language from Distributional Evidence Christopher Manning Depts of CS and Linguistics Stanford University Workshop on Where Does Syntax Come From?,

Learning Language from Distributional Evidence

Christopher ManningDepts of CS and Linguistics

Stanford University

Workshop on Where Does Syntax Come From?, MIT, Oct 2007

2

There’s a lot to agree on![C. D. Yang. 2004. UG, statistics or both? Trends in CogSci 8(10)]

Both endowment (priors/biases) and learning from experience contribute to language acquisition

“To be effective, a learning algorithm … must have an appropriate representation of the relevant … data.”

Languages, and hence models of language, have intricate structure, which must be modeled.

3

More points of agreement

Learning language structure requires priors or biases, to work at all, but especially at speed

Yang is “in favor of probabilistic learning mechanisms that may well be domain-general”

I am too. Probabilistic methods (and especially Bayesian prior

+ likelihood methods) are perfect for this! Probabilistic models can achieve and explain:

gradual learning and robustness in acquisition non-homogeneous grammars of individuals gradual language change over time [and also other stuff, like online processing]

4

The disagreements are important, but two levels down

In his discussion of Saffran, Aslin, and Newport (1998), Yang contrasts “statistical learning” (of syllable bigram transition probabilities) from use of UG principles such as one primary stress per word.

I would place both parts in the same probabilistic model Stress is a cue for word learning, just as syllable transition

probabilities are a cue (and it’s very subtle in speech!!!) Probability theory is an effective means of combining multiple,

often noisy, sources of information with prior beliefs

Yang keeps probabilities outside of the grammar, by suggesting that the child maintains a probability distribution over a collection of competing grammars

I would place the probabilities inside the grammar It is more economical, explanatory, and effective.

5

The central questions

1. What representations are appropriate for human languages?

2. What biases are required to learn languages successfully? Linguistically informed biases – but

perhaps fairly general ones are enough

3. How much of language structure can be acquired from the linguistic input? This gives a lower bound on how much is

innate.

6

1. A mistaken meme: language as a homogeneous, discrete system

Joos (1950: 701–702): “Ordinary mathematical techniques fall mostly into

two classes, the continuous (e.g., the infinitesimal calculus) and the discrete or discontinuous (e.g., finite group theory). Now it will turn out that the mathematics called ‘linguistics’ belongs to the second class. It does not even make any compromise with continuity as statistics does, or infinite-group theory. Linguistics is a quantum mechanics in the most extreme sense. All continuities, all possibilities of infinitesimal gradation, are shoved outside of linguistics in one direction or the other.”

[cf. Chambers 1995]

7

The quest for homogeneity

Bloch (1948: 7): “The totality of the possible utterances of

one speaker at one time in using a language to interact with one other speaker is an idiolect. … The phrase ‘with one other speaker’ is intended to exclude the possibility that an idiolect might embrace more than one style of speaking.”

Sapir (1921: 147) “Everyone knows that language is variable”

8

Variation is everywhere

The definition of an idiolect fails, as variation occurs even internal to the usage of a speaker in one style.

As least as: Black voters also turned out at least as well

as they did in 1996, if not better in some regions, including the South, according to exit polls. Gore was doing as least as well among black voters as President Clinton did that year.

(Associated Press, 2000)

9

Linguistic Facts vs. Linguistic Theories

Weinreich, Labov and Herzog (1968) see 20th century linguistics as having gone astray by mistakenly searching for homogeneity in language, on the misguided assumption that only homogeneous systems can be structured

Probability theory provides a method for describing structure in variable systems!

10

The need for probability models inside syntax

The motivation comes from two sides: Categorical linguistic theories claim too much:

They place a hard categorical boundary of grammaticality, where really there is a fuzzy edge, determined by many conflicting constraints and issues of conventionality vs. human creativity

Categorical linguistic theories explain too little: They say nothing at all about the soft constraints

which explain how people choose to say things Something that language educators, computational NLP

people – and historical linguists and sociolinguists dealing with real language – usually want to know about

11

Clausal argument subcategorization frames

Problem: in context, language is used more flexibly than categorical constraints suggest:

E.g., most subcategorization frame ‘facts’ are wrong

Pollard and Sag (1994) inter alia [regard vs. consider]: *We regard Kim to be an acceptable candidate We regard Kim as an acceptable candidate

The New York Times: As 70 to 80 percent of the cost of blood tests, like

prescriptions, is paid for by the state, neither physicians nor patients regard expense to be a consideration.

Conservatives argue that the Bible regards homosexuality to be a sin.

And the same pattern repeats for many verbs and frames….

12

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

__ NP as NP __ NP NP __ NP as AdjP __ NP as PP __ NP as VP[ing] __ NP VP[inf]

Probability mass functions: subcategorization of regard

13

Bresnan and Nikitina (2003) on the Dative Alternation

Pinker (1981), Krifka (2001): verbs of instantaneous force allow dative alternation but not “verbs of continuous imparting of force” like push

“As Player A pushed him the chips, all hell broke loose at the table.”

www.cardplayer.com/?sec=afeature&art id=165

Pinker (1981), Levin (1993), Krifka (2001): verbs of instrument of communication allow dative shift but not verbs of manner of speaking

“Hi baby.” Wade says as he stretches. You just mumble him an answer. You were comfy on that soft leather couch. Besides …

www.nsyncbitches.com/thunder/fic/break.htm

In context such usages are unremarkable! It’s just the productivity and context-dependence of language Examples are rare → because these are gradient constraints.

Here data is gathered from a really huge corpus: the web.

14

The disappearing hard constraints of categorical grammars

We see the same thing over and over Another example is constraints on Heavy NP Shift

[cf. Wasow 2002] You start with a strong categorical theory, which is

mostly right. People point out exceptions and counter examples You either weaken it in the face of counterexamples Or you exclude the examples from consideration Either way you end up without an interesting theory There’s little point in aiming for explanatory

adequacy when the descriptive adequacy of the used representations [as opposed to particular descriptions] just isn’t there.

There is insight in the probability distributions!

15

Explaining more:What do people say?

What people say has two parts: Contingent facts about the world

People have been talking a lot about Iraq lately The way speakers choose to express ideas

using the resources of the language People don’t often put that clauses pre-verbally:

It appears almost certain that we will have to take a loss

That we will have to take a loss appears almost certain

The latter is properly part of people’s Knowledge of Language. Part of syntax.

16

Variation is part of competence [Labov 1972: 125]

“The variable rules themselves require at so many points the recognition of grammatical categories, of distinctions between grammatical boundaries, and are so closely interwoven with basic categorical rules, that it is hard to see what would be gained by extracting a grain of performance from this complex system. It is evident that [both the categorical and the variable rules proposed] are a part of the speaker’s knowledge of language.”

17

What do people say?

Simply delimiting a set of grammatical sentences provides only a very weak description of a language, and of the ways people choose to express ideas in it

Probability densities over sentences and sentence structures can give a much richer view of language structure and use

In particular, we find that (apparently) categorical constraints in one language often reappear as the same soft generalizations and tendencies in other languages

[Givón 1979, Bresnan, Dingare, and Manning 2001] Linguistic theory should be able to uniformly capture

these constraints, rather than only recognizing them when they are categorical

18

Explaining more: what determines ditransitive vs. NP PP for dative verb

[Bresnan, Cueni, Nikitina, and Baayen 2005] Build mixed effects [logistic regression] model over

a corpus of examples Model is able to pull apart the correlations between

various predictive variables Explanatory variables:

Discourse accessibility, definiteness, pronominality, animacy (Thompson 1990, Collins 1995)

Differential length in words of recipient and theme (Arnold et al. 2000, Wasow 2002, Szmrecsanyi 2004b)

Structural parallelism in dialogue (Weiner and Labov 1983, Bock1986, Szmrecsanyi 2004a)

Number, person (Aissen1999, 2003; Haspelmath 2004; Bresnan and Nikitina 2003)

Concreteness of theme Broad semantic class of verb (transfer, prevent, communicate, …)

19

Explaining more: what determines ditransitive vs. NP PP for dative verb

[Bresnan, Cueni, Nikitina, and Baayen 2005] What does one learn?

Almost all the predictive variables have independent significant effects

Only a couple fail to: e.g., number of recipient Shows that reductionist theories of the phenomenon that

tries to reduce things to one or two factors are wrong First object NP is preferred to be:

Given, animate, definite, pronoun, shorter Model can predict whether to use a double object or NP

PP construction correctly 94% of the time. It captures much of what is going on in this choice These factors exceed in importance differences from

individual variation

20

the reporter who the senator attacked

Statistical parsing models also give insight into processing [Levy 2004]

A realistic model of human sentence processing explain: Robustness to arbitrary input, accurate disambiguation Inference on the basis of incomplete input [Tanenhaus et

al 1995, Altmann and Kamide 1999, Kaiser and Trueswell 2004] Sentence processing difficulty is differential and localized

On the traditional view, resource limitations, especially memory, drive processing difficulty

Locality-driven processing [Gibson 1998, 2000]: multiple and/or more distant dependencies are harder to process

the reporter who attacked the senator

Processing

Easy

Hard

21

Expectation-driven processing [Hale 2001, Levy 2005]

Alternative paradigm: Expectation-based models of syntactic processing Expectations are weighted averages over

probabilities Structures we expect are easy to process Modern computational linguistics techniques

of statistical parsing precise psycholinguistic model

Model matches empirical results of many recent experiments better than traditional memory-limitation models

22

Example: Verb-final domainsLocality predictions and empirical results

[Konieczny 2000] looked at reading times at German final verbs

Locality-based models (Gibson 1998) predict difficulty for longer clauses

But Konieczny found that final verbs were read faster in longer clauses

Prediction

easy

hard

hard

Result

fast

fastest

slow

Er hat die Gruppe auf den Berg geführt

Er hat die Gruppe geführt He led the group

He led the group to the mountain

Er hat die Gruppe auf den sehr schönen Berg geführt

He led the group to the very beautiful mountain

23

Once we’ve seen a PP goal we’re unlikely to see another

So the expectation of seeing anything else goes up

Rigorously tested: for pi(w), using a PCFG derived empirically from a syntactically annotated corpus of German (the NEGRA treebank)

Seeing more = having more information More information = more accurate expectations

Deriving Konieczny’s results

auf den Berg

PP

geführt

VNP?

PP-goal?PP-loc?Verb?ADVP?

die Gruppe

VP

NP

S

NPVfin

Er hat

24

450

460

470

480

490

500

510

520

No PP Short PP Long PP

Reading time (ms)

14.8

15

15.2

15.4

15.6

15.8

16

16.2

Reading time at final verb

Negative Log probability

Er hat die Gruppe (auf den (sehr schönen) Berg) geführt

Predictions from the Hale 2001/Levy 2004 model

Locality-based models (e.g., Gibson 1998, 2000) would violate monotonicity

Loca

lity-b

ase

d d

ifficu

lty (

ord

inal)

1

2

3

25

2.&3. Learning sentence structure from distributional evidence

Start with raw language, learn syntactic structure

Some have argued that learning syntax from positive data alone is impossible:

Gold, 1967: Non-identifiability in the limit Chomsky, 1980 : The poverty of the stimulus

Many others have felt it should be possible: Lari and Young, 1990 Carroll and Charniak, 1992 Alex Clark, 2001 Mark Paskin, 2001

… but it is a hard problem

26

Language learning idea 1: Lexical affinity models

Words select other words on syntactic grounds

Link up pairs with high mutual information [Yuret, 1998]: Greedy linkage [Paskin, 2001]: Iterative re-estimation with EM

Evaluation: compare linked pairs (undirected) to a gold standard

congress narrowly passed the amended bill

39.7

AccuracyMethod

Paskin, 2001

41.7Random

27

Idea: Word Classes Mutual information between words does not necessarily

indicate syntactic selection.

Individual words like brushbacks are entwined with semantic facts about the world.

Syntactic classes, like NOUN and ADVERB are bleached of word-specific semantics.

We could build dependency models over word classes. [cf. Carroll and Charniak, 1992]

NOUN ADVERB VERB DET PARTICIPLE NOUNcongress narrowly passed the amended billcongress narrowly passed the amended bill

expect brushbacks but no beanballs

28

Problems: Word Class Models

Issues: Too simple a model – doesn’t work much better

supervised No representation of valence/distance of arguments)

NOUN NOUN VERB

stock prices fell

NOUN NOUN VERB

stock prices fell

41.7Random

53.2

44.7

Adjacent Words

Carroll and Charniak, 92

congress narrowly passed the amended bill

29

Bias: Using better dependency representations [Klein and Manning 2004]

Classes? Distance Local Factor

Paskin 01 P(a | h)Carroll & Charniak 92 P(c(a) | c(h))Klein/Manning (DMV) P(c(a) | c(h), d)

arghead

distance

?

55.9Adjacent Words

63.6Klein/Manning (DMV)

30

Idea: Can we learn phrase structure constituency as distributional clustering

president the __ of

president the __ said

governor the __ of

governor the __ appointed

said sources __

said president __ that

reported sources __

presidentgovernor

saidreported

thea

the president said that the downturn was over

[Finch and Chater 92, Schütze 93, many others]

31

Distributional learning

There is much debate in the child language acquisition literature about distributional learning and the possibility of kids successfully using it:

[Maratsos & Chalkley 1980] suggest it [Pinker 1984] suggests it‘s impossible (too many

correlations, too abstract properties needed) [Reddington et al. 1998] say it succeeds because

there are dominant cues relevant to language [Mintz et al. 2002] look at distributional structure of

input Speaking in terms of engineering, let me just tell

you that it works really well! It’s one of the most successful techniques that we

have – my group uses it everywhere for NLP.

32

Idea: Distributional Syntax? [Klein and Manning NIPS 2003]

factory payrolls fell in september

NP PP

VP

S

payrolls __ fell in september

ContextSpan

factory __ septpayrolls fell in

Can we use distributional clustering for learning syntax?

33

Problem: Identifying Constituents

the final votetwo decadesmost people

decided totook most of

go with

of thewith a

without many

in the endon timefor now

the finalthe intitialtwo of the

Distributional classes are easy to find…

PP

VP

NP-

+

… but figuring out which are constituents is hard.

Pri

nci

pal

Com

ponent

2

Pri

nci

pal

Com

ponent

2

Principal Component 1 Principal Component 1

34

A Nested Distributional Model: Constituent-Context Model (CCM)

P(S|T) =

factory payrolls fell in september

+

- - - - -

P(fpfis|+)P(__|+)

P(fp|+)P(__ fell|+)

P(fis|+)P(p __ |+)

P(is|+)P(fell __ |+)

+ +

+

35

Initialization: A little UG?Tree Uniform

Split Uniform

36

Results: Constituency

van Zaanen, 00 35.6

Right-Branch 70.0

Our Model (CCM) 81.6

Treebank Parse

CCM Parse

37

Combining the two models[Klein and Manning ACL 2004]

Supervised PCFG constituency recall is at 92.8 Qualitative improvements

Subject-verb groups gone, modifier placement improved

Random 45.6

DMV 62.7

CCM + DMV 64.7

Random 39.4

CCM 81.0

CCM + DMV 88.0

!Constituency

Evaluation

Dependency Evaluation

38

Beyond surface syntax…[Levy and Manning, ACL 2004]

origin?

Syntactic category, parent, grandparent (subj vs obj extraction; VP finiteness

Syntactic path (Gildea & Jurafsky 2002):

<SBAR,S,VP,S,VP>

Plus: feature conjunctions, specialized features for expletive subject dislocations, passivizations, passing featural information properly through coordinations, etc., etc.

cf. Campbell (2004 ACL) – a lot of linguistic knowledge

Presence of daughters (NP under S)

Head words (wanted vs to vs eat)

39

Evaluation on dependency metric: gold-standard input trees

70 75 80 85 90 95 100

Overall

NP

S

VP

ADJP

SBAR

ADVP

Accuracy (F1)

Levy&Manning,gold input trees

CF gold tree deps

40

She opened the door with a key.

Fortunately, the corkscrew opened the bottle.

The bottle opened easily.

The door opened as they approached.

This key opens the door of the cottage.

She opened the bottle very carefully.

He opened the bottle with a corkscrew.

He opened the door.

Instances of open:

instrument

patient

agent she, he

door, bottle

key, corkscrew

Word Distributions:

{ agent=subject, patient=object }

{ patient=subject }

{ agent=subject, patient=object, instrument=obl_with }

{ instrument=subject, patient=object }

Allowed Linkings:

Might we also learn a linking to argument structure distributionally?

Why it might be possible:

41

Probabilistic Model Learning[Grenager and Manning 2006 EMNLP]

vv

ll

oo

give

{ 0=subj, 1=obj2, 2=obj1 }

[ 0=subj, M=np 1=obj2, 2=obj1 ]

r1r1 w1

w1subj 0 plunge

s1s1

r2r2 w2

w2np M today

s2s2

r3r3 w3

w3obj1 2 them

s3s3

r4r4s4

s4 w4w4

obj2 1 test

Given a set of observed verb instances, what are the most likely model parameters?

Use unsupervised learning in a structured probabilistic graphical model

A good application for EM! M-step:

Trivial computation E-Step:

We compute conditional distributions over possible role vectors for each instance

And we repeat

42

Verb: giveVerb: give

Linkings:Linkings:

Roles:Roles:

{0=subj,1=obj2, 2=obj1}{0=subj,1=obj2, 2=obj1} 0.460.46

{0=subj,1=obj1, 2=to}{0=subj,1=obj1, 2=to}

{0=subj,1=obj1}{0=subj,1=obj1}0.190.19

……

00

11

22

……

Verb: payVerb: pay

Linkings:Linkings:

Roles:Roles:

{0=subj,1=obj1}{0=subj,1=obj1} 0.320.32

{0=subj,1=obj1, 2=for}{0=subj,1=obj1, 2=for}

{0=subj}{0=subj}0.210.21

{0=subj,1=obj1, 2=to}{0=subj,1=obj1, 2=to}

00

11

22

……

{0=subj,1=obj2, 2=obj1}{0=subj,1=obj2, 2=obj1}

……it, he, bill, they, that, …it, he, bill, they, that, …

power, right, stake, …power, right, stake, …

them, it, him, dept., …them, it, him, dept., …

……

0.050.05

……

it, they, company, he, …it, they, company, he, …

$, bill, price, tax …$, bill, price, tax …

stake, gov., share, amt., …stake, gov., share, amt., …

……

0.070.07

0.050.05

0.050.05

……

Semantic role induction results

The model achieve some traction, but it’s hard Learning becomes harder with greater abstraction This is the right research frontier to explore!

43

Conclusions

Probabilistic models give precise descriptions of a variable, uncertain world

There are many phenomena in syntax that cry out for non-categorical or probabilistic representations of language

Probabilistic models can –and should – be used over rich linguistic representations

They support effective learning and processing Language learning does require biases or priors But a lot more can be learned from modest amounts of

input than people have thought There’s not much evidence of a poverty of the stimulus

preventing such models being used in acquisition.

Documents

Learning Language from Distributional Evidence Christopher Manning Depts of CS and Linguistics Stanford University Workshop on Where Does Syntax Come From?,