An introduction to computational psycholinguistics

An introduction to computational psycholinguistics:

Modeling human sentence processing

Shravan Vasishth

University of Potsdam, Germany

http://www.ling.uni-potsdam.de/∼vasishth

[email protected]

September 2005, Bochum

Neural structure

1

A model of the neuron

2

Activation functions for translating net input to activation

3

A model of layered neural connections

4

Five assumptions

• Neurons integrate information

• Neurons pass information about the level of their input.

• Brain structure is layered.

• The influence of one neuron on another depends on the strength of the connection

between them.

• Learning is achieved by changing the strengths of the connections between neurons.

5

The computations

Net input to unit i from units j = 1 . . . n each with activation aj, and with weight

of connection from j to i being wij:

Netinputi =n

X

j=

ajwij (1)

Activation ai of unit i, f an activation function from inputs to activation values:

ai = f(netinputi) (2)

6

Learning by weight change

Netinputi =n

X

j=

ajwij (3)

ai = f(netinputi) (4)

• Notice that the activity of i, ai, is a function of the weights wij and the activations

aj. So changing wij will change ai.

• In order for this simple network to do something useful, for a given set of input

activations aj, it should output some particular value for ai. Example: computing the

logical AND function.

7

The AND network: the single-layered perceptron

Assume a threshold activation function: if netinputi is greater than 0, output a 1.

Bias is −1.5.

Netinput =0 × 1 + 0 × 1 − 1.5 = −1.5 (5)

Netinput =0 × 1 + 1 × 1 − 1.5 = −0.5 (6)

Netinput =1 × 1 + 0 × 1 − 1.5 = −0.5 (7)

Netinput =1 × 1 + 1 × 1 − 1.5 = +0.5 (8)

8

How do we decide what the weights are?

Let the wj = .. Now the same network fails to compute AND.

Netinput =0 × 0.5 + 0 × 0.5 − 1.5 = −1.5 (9)

Netinput =0 × 0.5 + 1 × 0.5 − 1.5 = −1 (10)

Netinput =1 × 0.5 + 0 × 0.5 − 1.5 = −1 (11)

Netinput =1 × 0.5 + 1 × 0.5 − 1.5 = −0.5 (12)

9

The Delta rule for changing weights to get the desired output

We can repeatedly cycle through the simple network and adjust the weights so that

we achieved the desired ai. Here’s a rule for doing this:

∆wij = [ai(desired) − ai(obtained)]ajε (13)

ε: learning rate parameter (determines how large the change will be on each learning

trial)

This is, in effect, a process of learning.

10

How the delta rule fixes the weights in the AND network


Let ai(desired) = ; ε = 0.5. Consider now the activations we get:

Netinput =0 × 0.5 + 0 × 0.5 − 1.5 = −1.5 (15)

Netinput =0 × 0.5 + 1 × 0.5 − 1.5 = −1 (16)

Netinput =1 × 0.5 + 0 × 0.5 − 1.5 = −1 (17)

Netinput =1 × 0.5 + 1 × 0.5 − 1.5 = −0.5 ⇐ (18)

We don’t need to mess with the first three since we already have a desired value (less

than zero). Look at the last one. Say desired a = .

11

∆wi = [ai(desired) − ai(obtained)]aε (19)

= [1 − (−0.5)] × 1 × 0.5 (20)

= .75 (21)

We just performed what’s called a a training sweep.

Sweep: the presentation of a single input pattern causing activation to propagate

through the network and the appropriate weight adjustments to be carried out.

Epoch: One cycle of showing all the inputs in turn.

Now if we recompute the netinput with the incremented weights, our network starts

to behave as intended:

Netinput =1 × 1.25 + 1 × 1.25 − 1.5 = 1 (22)

12

Rationale for the delta rule


• If obtained activity is too low, then [ai(desired) − ai(obtained)] > . This increases

the weight.

• If obtained activity is too high, then [ai(desired) − ai(obtained)] < . This decreases

the weight.

• For any input unit j, the greater its activation aj the greater its influence on the weight

change. The delta rule concentrates the weight change to units with high activity

because these are the most influential in determining the (incorrect) output.

There are other rules one can use. This is just an example of one of them.

13

Let’s do some simulation with tlearn

DEMO: Steps

• Create a network with 2 input and 1 one output node (plus a bias node with a fixed

output)

• Create a data file, and a teacher: the data file is the input and the teacher is the output

you want the network to learn to produce.

Input Output

0 0 0

1 0 0

0 1 0

1 1 1

• Creating the network

14

The AND network’s configuration

NODES:#define the nodes

nodes = 1 # number of units (excluding input units)

inputs = 2 # number of input nodes

outputs = 1 # number of output nodes

output node is 1 #always start counting output nodes from 1

CONNECTIONS:

groups = 0 # how many groups of connections must have same value?

1 from i1-i2 #connections

1 from 0 #bias node is always numbered 0, it outputs 1

SPECIAL:

selected = 1 # units selected for printing out the output of

weight_limit = 1.0 # causes initial weights to be +/-0.5

15

Training the network

• Set the training sweeps, learning rate (how fast weights change), the momentum (how

similar is the weight change from one cycle to the next—helps avoid local maxima),

random seed (for the initial random weights), training method (random or sequential),

• We can compute the error in any one case: desired-actual.

• How to evaluate (quantify) the performance of the network as a whole? Note that the

network will give four different actual activations in response to the four input pairs.

We need some notion of average error.

• Suggestions?

16

Average error: Root mean square

Root mean square error =

rP

j(tk−ok)

k

17

Exercise: Learning OR

• Build a network that can recognize logical OR and then XOR.

• Are these two networks also able to learn using the procedure we used for AND?

Readings for tomorrow: Elman 1990, 1991, 1993. Just skim them.

18

How to make the network predict what will come next?

Key issue: if we want the network to predict, we need to have a notion of time, of

now and now+1.

Any suggestions?

19

Simple recurrent networks

• Idea: Use recurrent connections to provide the network with a dynamic memory.

• The hidden node activation at time step t − 1 will be fed right back to the hidden

nodes at time t (the regular input nodes will also be providing input to the hidden

nodes). Context nodes serve to tell the network what came earlier in time.

output, y(t)

hidden, z(t)

input, x(t) hidden, z(t-1)

copy z

20

Let’s take up the demos

Your printouts contain copies of Chapters 8 and 12 of Exercises in rethinking

innateness, Plunkett and Elman.

21

Elman 1990

22

Christiansen and Chater on recursion

• Chomsky showed that natural language grammars exhibit recursion, and that this rules

out finite state machines as models of language

• According to Chomsky, this entails that language is innate: the child’s language

exposure involves so few recursive structures that it could not possible learn recursion

from experience

• C(hristiansen+Chater): if connectionist models can reflect the limits on our ability to

process recursion, they constitute a performance model

• C notes a broader issue: Symbolic rules apply without limit (infinitely), but in the

real-life we observe (though experiments) limits on processing ability. The reason for

this boundedness of processing falls out of the hardware’s (wetware’s) architecture

• C proceeds to demonstrate that human constraints on processing recursion fall out

from the architecture of simple recurrent networks

23

Three kinds of recursion (acc. to Chomsky)

1. Counting recursion: anbn

2. Cross-serial embeddings: anbmcndm

3. Center embeddings: anbmcmdn

4. (Baseline: right-branching)

24

Benchmark: n-gram models

• In order to compare their results with an alternative frequency-based method of

computing predictability, they looked at the predictions made by 2- and 3-gram models.

• Diplomarbeit topic: try to find a better probabilistic parsing measure of predicting the

next word, compared to the SRN baseline. In John Hale’s work we will see an example

(though the goal of that work is different from the present discussion).

25

Three distinct languages

All languages contain only nouns (Ns) and verbs (Vs), both singular and plural:

• L1: aNaNbV bV (ignores agreement)

• L2: aNbV bV aN (respects agreement)

• L3: aNbNaV bV (respects agreement)

• Each language also had right-branchings: aNaV bNbV (respects agreement)

26

Method

27

Method

Trained the network to make grammaticality judgements:

• 16 words in vocabulary: 4 of each of sing.Ns, plu.Ns, sing.Vs, plu.Vs

• Corpora: 5,000 variable length sentences

• SRN architecture: 17 input and output nodes (one node for End of Sentence marker),

2-100 Hidden Units

• Test corpora: 500 novel sentences

• Teaching the network: Compare outputs with estimates of true conditional probabilities

given prior context.

28

Determining the intended output of the network

Let ci be a category (sing.N, plu.N, sing.V, plu.V)

P (cp | c, c, . . . , cp−) =freq(c, c, . . . , cpp−

, cp)

freq(c, c, . . . , cp−)(24)

Let wn be a word of category cp, and Cp the number of items in that category.

P (wn | c, c, . . . , cp−) =freq(c, c, . . . , cpp−

, cp)

freq(c, c, . . . , cp−) × Cp

(25)

29

Determing performance error

Squared error:

X

j∈W

(outj − P (wn = j))(26)

W is the set of words in the language (including EOS marker).

P

j∈W

(outj − P (wn = j))

number of W(27)

Mean square error is used as a measure of average error over the corpus.

30

Computing grammaticality judgments

The Grammatical Prediction Error GPE for predicting a particular word:

GPE = 1 −hits

hits + false alarms + misses(28)

31

Computing hits, false alarms, misses(simplified version)

Let G be the set of activated units that are grammatical (predictions), U

ungrammatical.

hits H =X

i∈G

ui (29)

false alarms F =X

i∈U

ui (30)

Let mi be the amount of activation that a unit i fell short of from a target activation.

misses M =X

i∈G

mi (31)

32

So, the bigger the mi for any unit, the bigger the miss M.

ti =f i

P

j∈G

f j

(32)

mi =

(

0 if ti − ui ≤ ,

ti − ui otherwise(33)

33

Results

• Above 10-15 units the performance is not affected.

• Counting recursion easiest, then cross-dependencies, and then center embeddings.

• By comparison, bi-gram and tri-gram found center embeddings easier than cross-

dependencies.

• Strange result: SRNs with at least 10 HUs had a lower MSE for center embeddings

than right branchings—once the first verb is encountered, the next is more predictable

in center embeddings.

• Above 15 hidden units or so, the number o HUs does not matter.

34

Results: modeling Bach et al. 1986

35

Results: Modeling Right branching embedding

36

Results: Modeling grammatical but unacceptable embeddings

Aside 37

Reconsidering the grammaticality/ungrammaticality distinction

The ACT-R model predicts associative retrieval interference at the first verb:

(1) a. [NP1 [NP2 [NP3 VP3] VP2] VP1]

# [The patient [who the nurse [who the clinic had hired] admitted] met Jack.]

b. [NP1 [NP2 [NP3 VP3] . . . ] VP1]

?[The patient [who the nurse [who the clinic had hired] . . . ] met Jack.]

(2) a. [NP1+ [NP2+ [NP3+ VP3] VP2] VP1]

The carpenter who the craftsman that the peasant had carried to the bus-stop had hurt

yesterday supervised the apprentice.

b. [NP1+ [NP2- [NP3+ VP3] VP2] VP1]

The carpenter who the pillar that the peasant had carried to the bus-stop had hurt yesterdaysupervised the apprentice.

Aside 38

Self-paced reading results: English (Suckow et al. 2005)

400

500

600

700

800

900

1000

English self−paced reading experiment

Position

Mea

n R

eadi

ng T

ime

(mse

cs)

Det N1 who Det N2 that Det N3 had V3 to the NP had V2 Adv V1 the NP

Grammatical, Similar NPsGrammatical, Dissimilar NPsUngrammatical, Similar NPsUngrammatical, Dissimilar NPs

Similar NPsDissimilar NPs

Encoding interference

Storage andRetrieval interference

(Un−)Grammaticality

Aside 39

Self-paced reading results: German (Suckow et al. 2005)

1000

1500

2000

German self−paced reading experiment

Position

Mea

n R

eadi

ng T

ime

(mse

cs)

NP1 who NP2 who NP3 V3 V2 V1 NP

Grammatical, Similar NPsGrammatical, Dissimilar NPsUngrammatical, Similar NPsUngrammatical, Dissimilar NPs

Similar NPsDissimilar NPs

Encoding interference Storage andRetrieval interference

(Un−)Grammaticality

40

Summary of some possible research projects for you

1. Making “experience” more realistic: The corpora do not reflect frequencies of occurrence of center embeddings and/orcenter-embedding types. The technology for computing these is now available (e.g. (Korthals, 2001; Kurz, 2000)). Modify

the training corpora frequencies to reflect our best possible estimate of these frequencies and then examine the performanceof the networks.

2. Probabilistic models of processing versus SRNs: C compare their model’s performance against simple n-gram models.

Replicate their experiment results, and try to find a measure using probabilistic methods that can outperform their system ordo as well.

3. Extending empirical coverage: modeling Suckow et al.’s data: Without messing with the architecture, carry outsimulations to model the interference effects and the verb reading pattern asymmetries for English and German found by

(Suckow, Vasishth, & Lewis, 2005) The poster handout of this paper is available from my home page. The simulationsshould be able to account for all the data covered by C as well as the new data (i.e.regression testing is necessary for a

meaningful step forward).

4. The Chinese and German relative clause puzzle: (Konieczny & Muller, 2004) showed that word order may not be ableto explain the German S/O relative clause facts: in German, the verb is final in both S and O relatives, and yet the ORs are

harder. In Chinese (Hsiao & Gibson, 2003) SRs have the form [tiVO]Si while ORs are [SVti]Oi (which is also the canonicalorder in Chinese). In Chinese SRs are harder than ORs. An explanation for this is the locality of the head-arguments. SRNs

should be able to explain this in terms of experience (word order). How to reconcile these two opposing results in an SRNarchitecture?

5. Extending the coverage with Japanese double embeddings: Will the network perform well with 5-clause NPs and 2verbs (ditransitive plus transitive)? We’d probably have to add HUs. What does this mean? Relevant empirical work: (Lewis

& Nakayama, 2001) and references cited there.

41

Reassessing working memory: MacDonald and Christiansen 2002

Background:

• Just and Carpenter (1992) developed a working memory capacity theory of

comprehension which posits a linguistic working memory functionally separated from

the representation of linguistic knowledge.

• Waters and Caplan (1996) argued against JC and favor of two differentiated working

memories for language. One is dedicated to “obligatory” unconscious psycholinguistic

processes, and another to controlled, verbally mediated tasks.

• MC present a model where processing capacity emerges as a function of the architecture

and experience. Capacity is not a primitive but a dependent variable.

“In all cases, the choice of processing architecture has a direct effect on the claims

for the role of working memory in language comprehension” (p. 36)

42

The Just and Carpenter capacity theory

• Declarative knowledge and production rules are stored in permanent memory.

• A separate working memory space is used to process and store current input and partial

products of ongoing computations. (cf. the ACT-R architecture).

• In one processing cycle, several productions can fire in parallel. (cf. ACT-R architecture).

• Activation decay occurs in working memory. Working memory capacity and aphasia is

modeled by a maximum activation parameter.

• Note that there is a clear distinction between working memory resources and knowledge

of language.

• “It is not clear how one could readily incorporate a comprehensive role for experience

into the 3CAPS framework.”

43

The connectionist approach

• The network is the memory.

• Others have modeled capacity by varying the number of HUs, the amount of training

the network gets, efficiency of information passing between units, and the amount of

“noise” in the input signal

• In this paper, MC use experience as the factor determining processing ability.

44

An empirical starting point

(3) a. The reporter that the senator attacked admitted the error.

b. The reporter that attacked the senator admitted the error.

Main results (Just and Carpenter 1992):

Daneman and Carpenter’s span task: read some sentences aloud, and then try to recall

the last words of each sentence. The number of sentences are gradually increased. We

differentiate between high- and low-span subjects below.

• Averaging over RC type, shorter RTs for high-span subjects at main verb.

• Averaging over span types, shorter RT at the main verb in subject RCs.

• Span × RC type: in SRCs no effect of span, but in ORCs high-spans has less difficulty

at the main verb.

45

MC’s hypotheses

• High-span participants read more of all kinds of sentences (not just ORs) than low

spans.

• SRCs have a canonical order, ORCs do not (recall Elman simulations and papers; recall

also the Chinese and German conundrum I discussed earlier).

On to the modeling. But first: What kind of network would you use?

46

The network

• Create a probabilistic CFG with a 30-word vocabulary, with Subject-verb agreement,

present/past tense, intransitive, transitive verbs, S/O RCs with multiple embeddings

allowed (also complex agreement patterns are allowed).

• “Crucially, subject- and object-relative constructions occurred with equal frequency

(2.5% of the sentences in the training set).

• Training corpus: S/O RCs randomly interleaved with simple IV, TV sentences. 10,000

words in each corpus (10 separate ones created, one for each subject-simulation), mean

length 4.5 words, 3-27 words per sentence.

• One epoch was a pass through the entire corpus. To model experience, the results of

training after 1, 2, and 3 epochs was reported.

Diplomarbeit topic: How would the network do if (a) the corpus had no RCs at all,

or (b) a lot more RCs than 2.5%, or (c) the percentages of RCs and other constructions

actually reflected a large text corpus’ structures? (c) will be an important demonstration.

47

Results for the RC experiment

• Averaging over RC type, lower error for high-span subjects at main verb.

• Averaging over epochs, lower error at the main verb in subject RCs.

• Epoch × RC type: in SRCs no effect of epoch, but in ORCs 3-epoch runs has lower

error at the main verb.

48

Conclusions

• “First, capacity is not some primitive, independent property of networks or humans in

our account but is instead strictly emergent from other architectural and experiential

factors.”

• “Second, capacity is not independent of knowledge, so that one cannot manipulate

factors underlying the capacity of a network (e.g. hidden unit layer size, activation

function, weight decay, connectivity pattern, training) without also affecting the

knowledge embedded within that network.”

Instead of capacity enabling skill, capacity is skill.

[This discussion is only an excerpt from the paper]

49

Diplomarbeit ideas

• Modeling individual differences: How would the network do if (a) the corpus had no

RCs at all, or (b) a lot more RCs than 2.5%, or (c) the percentages of RCs and other

constructions actually reflected a large text corpus’ structures? Surely (c) is the acid

test.

• Reconsidering capacity: Oberauer (No Date) defined a battery of tests that gives

a composite score of working memory. Nobody has ever checked if this composite

score matches the high-low- capacity results of King and Just, and whether SRNs could

encode the claim that a composite capacity score of spatial, numerical, and verbal tasks

can be reflected in the network’s skill level.

• Distance and interference effects: How do SRNs deal with those? (Konieczny,

2000), (Lewis & Vasishth, 2005), (Suckow et al., 2005), (Van Dyke & Lewis, 2003),

(Vasishth & Lewis, 2005). . .

50

Closing remarks

• Connectionist modeling has many important insights for us, about the role of experience,

subtle emergent properties of architectures.

• But as you may have found while doing the demos, it’s sometimes nontrivial to replicate

a result. This brings up the question of free parameters in the model. What are the

free parameters in connectionist models? We know that architectures like ACT-R have

many (without even including productions and data structures as free parameters).

• Connectionist models don’t yet have the coverage symbolic models do, but that’s a job

for people like you. As we saw, there are many challenging problems and the knowledge

we have from probabilistic methods is heavily underexplored at the moment.

• There’s an ongoing debate about symbolic versus connectionist models. Recall Newell’s

1973 paper and think about whether this debate is contentful. Issues: parallelism,

symbolic, localist data structures versus distributed ones, content-addressability of

information. . . can you think of a symbolic model that has the properties of

connectionist models? What does this say about this binary opposition?

51

References

*References

Hsiao, F., & Gibson, E. (2003). Processing relative clauses in chinese. Cognition, 90,

3–27.

Just, M. A., & Carpenter, P. A. (1992). A capacity theory of comprehension: Individual

differences in working memory. Psychological Review, 99(1), 122–149.

Konieczny, L. (2000). Locality and parsing complexity. Journal of Psycholinguistic

Research, 29(6), 627–645.

Konieczny, L., & Muller, D. (2004). Word order does not account for the advantage

of subject-extracted over object-extracted relative clauses. In AMLaP proceedings

(p. 34). Aix en Provence.

Korthals, C. (2001). Self embedded relative clauses in a corpus of German newspaper

texts. In K. Striegnitz (Ed.), Proceedings of the Sixth ESSLLI Student Session (pp.

179–190). Finland: University of Helsinki.

52

Kurz, D. (2000). A statistical account of word order variation in German. In Proceedings of

the COLING Workshop on Linguistically Interpreted Corpora. Luxembourg: COLING.

Lewis, R. L., & Nakayama, M. (2001). Syntactic and positional similarity effects in the

processing of Japanese embeddings. In M. Nakayama (Ed.), Sentence Processing in

East Asian Languages (pp. 85–113). Stanford, CA.

Lewis, R. L., & Vasishth, S. (2005). An activation-based model of sentence processing as

skilled memory retrieval. Cognitive Science, 29, 1-45.

Oberauer, K. (No Date). Working memory capacity—facets of a cognitive ability construct.

(Unpublished manuscript)

Suckow, K., Vasishth, S., & Lewis, R. L. (2005). Interference and memory overload during

parsing. In Proceedings of amlap 2005. Ghent, Belgium.

Van Dyke, J., & Lewis, R. L. (2003). Distinguishing effects of structure and decay on

attachment and repair: A cue-based parsing account of recovery from misanalyzed

ambiguities. Journal of Memory and Language, 49, 285–316.

Vasishth, S., & Lewis, R. L. (2005). Argument-head distance and processing complexity:

Explaining both locality and anti-locality effects. (Submitted to Language)

53

Waters, G. S., & Caplan, D. (1996). Processing resource capacity and the comprehension

of garden path sentences. Memory and Cognition, 24(3), 342–355.

54

Documents

An introduction to computational psycholinguistics