Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
An introduction to computational psycholinguistics:
Modeling human sentence processing
Shravan Vasishth
University of Potsdam, Germany
http://www.ling.uni-potsdam.de/∼vasishth
September 2005, Bochum
Neural structure
1
A model of the neuron
2
Activation functions for translating net input to activation
3
A model of layered neural connections
4
Five assumptions
• Neurons integrate information
• Neurons pass information about the level of their input.
• Brain structure is layered.
• The influence of one neuron on another depends on the strength of the connection
between them.
• Learning is achieved by changing the strengths of the connections between neurons.
5
The computations
Net input to unit i from units j = 1 . . . n each with activation aj, and with weight
of connection from j to i being wij:
Netinputi =n
X
j=
ajwij (1)
Activation ai of unit i, f an activation function from inputs to activation values:
ai = f(netinputi) (2)
6
Learning by weight change
Netinputi =n
X
j=
ajwij (3)
ai = f(netinputi) (4)
• Notice that the activity of i, ai, is a function of the weights wij and the activations
aj. So changing wij will change ai.
• In order for this simple network to do something useful, for a given set of input
activations aj, it should output some particular value for ai. Example: computing the
logical AND function.
7
The AND network: the single-layered perceptron
Assume a threshold activation function: if netinputi is greater than 0, output a 1.
Bias is −1.5.
Netinput =0 × 1 + 0 × 1 − 1.5 = −1.5 (5)
Netinput =0 × 1 + 1 × 1 − 1.5 = −0.5 (6)
Netinput =1 × 1 + 0 × 1 − 1.5 = −0.5 (7)
Netinput =1 × 1 + 1 × 1 − 1.5 = +0.5 (8)
8
How do we decide what the weights are?
Let the wj = .. Now the same network fails to compute AND.
Netinput =0 × 0.5 + 0 × 0.5 − 1.5 = −1.5 (9)
Netinput =0 × 0.5 + 1 × 0.5 − 1.5 = −1 (10)
Netinput =1 × 0.5 + 0 × 0.5 − 1.5 = −1 (11)
Netinput =1 × 0.5 + 1 × 0.5 − 1.5 = −0.5 (12)
9
The Delta rule for changing weights to get the desired output
We can repeatedly cycle through the simple network and adjust the weights so that
we achieved the desired ai. Here’s a rule for doing this:
∆wij = [ai(desired) − ai(obtained)]ajε (13)
ε: learning rate parameter (determines how large the change will be on each learning
trial)
This is, in effect, a process of learning.
10
How the delta rule fixes the weights in the AND network
∆wij = [ai(desired) − ai(obtained)]ajε (14)
Let ai(desired) = ; ε = 0.5. Consider now the activations we get:
Netinput =0 × 0.5 + 0 × 0.5 − 1.5 = −1.5 (15)
Netinput =0 × 0.5 + 1 × 0.5 − 1.5 = −1 (16)
Netinput =1 × 0.5 + 0 × 0.5 − 1.5 = −1 (17)
Netinput =1 × 0.5 + 1 × 0.5 − 1.5 = −0.5 ⇐ (18)
We don’t need to mess with the first three since we already have a desired value (less
than zero). Look at the last one. Say desired a = .
11
∆wi = [ai(desired) − ai(obtained)]aε (19)
= [1 − (−0.5)] × 1 × 0.5 (20)
= .75 (21)
We just performed what’s called a a training sweep.
Sweep: the presentation of a single input pattern causing activation to propagate
through the network and the appropriate weight adjustments to be carried out.
Epoch: One cycle of showing all the inputs in turn.
Now if we recompute the netinput with the incremented weights, our network starts
to behave as intended:
Netinput =1 × 1.25 + 1 × 1.25 − 1.5 = 1 (22)
12
Rationale for the delta rule
∆wij = [ai(desired) − ai(obtained)]ajε (23)
• If obtained activity is too low, then [ai(desired) − ai(obtained)] > . This increases
the weight.
• If obtained activity is too high, then [ai(desired) − ai(obtained)] < . This decreases
the weight.
• For any input unit j, the greater its activation aj the greater its influence on the weight
change. The delta rule concentrates the weight change to units with high activity
because these are the most influential in determining the (incorrect) output.
There are other rules one can use. This is just an example of one of them.
13
Let’s do some simulation with tlearn
DEMO: Steps
• Create a network with 2 input and 1 one output node (plus a bias node with a fixed
output)
• Create a data file, and a teacher: the data file is the input and the teacher is the output
you want the network to learn to produce.
Input Output
0 0 0
1 0 0
0 1 0
1 1 1
• Creating the network
14
The AND network’s configuration
NODES:#define the nodes
nodes = 1 # number of units (excluding input units)
inputs = 2 # number of input nodes
outputs = 1 # number of output nodes
output node is 1 #always start counting output nodes from 1
CONNECTIONS:
groups = 0 # how many groups of connections must have same value?
1 from i1-i2 #connections
1 from 0 #bias node is always numbered 0, it outputs 1
SPECIAL:
selected = 1 # units selected for printing out the output of
weight_limit = 1.0 # causes initial weights to be +/-0.5
15
Training the network
• Set the training sweeps, learning rate (how fast weights change), the momentum (how
similar is the weight change from one cycle to the next—helps avoid local maxima),
random seed (for the initial random weights), training method (random or sequential),
• We can compute the error in any one case: desired-actual.
• How to evaluate (quantify) the performance of the network as a whole? Note that the
network will give four different actual activations in response to the four input pairs.
We need some notion of average error.
• Suggestions?
16
Average error: Root mean square
Root mean square error =
rP
j(tk−ok)
k
17
Exercise: Learning OR
• Build a network that can recognize logical OR and then XOR.
• Are these two networks also able to learn using the procedure we used for AND?
Readings for tomorrow: Elman 1990, 1991, 1993. Just skim them.
18
How to make the network predict what will come next?
Key issue: if we want the network to predict, we need to have a notion of time, of
now and now+1.
Any suggestions?
19
Simple recurrent networks
• Idea: Use recurrent connections to provide the network with a dynamic memory.
• The hidden node activation at time step t − 1 will be fed right back to the hidden
nodes at time t (the regular input nodes will also be providing input to the hidden
nodes). Context nodes serve to tell the network what came earlier in time.
output, y(t)
hidden, z(t)
input, x(t) hidden, z(t-1)
copy z
20
Let’s take up the demos
Your printouts contain copies of Chapters 8 and 12 of Exercises in rethinking
innateness, Plunkett and Elman.
21
Elman 1990
22
Christiansen and Chater on recursion
• Chomsky showed that natural language grammars exhibit recursion, and that this rules
out finite state machines as models of language
• According to Chomsky, this entails that language is innate: the child’s language
exposure involves so few recursive structures that it could not possible learn recursion
from experience
• C(hristiansen+Chater): if connectionist models can reflect the limits on our ability to
process recursion, they constitute a performance model
• C notes a broader issue: Symbolic rules apply without limit (infinitely), but in the
real-life we observe (though experiments) limits on processing ability. The reason for
this boundedness of processing falls out of the hardware’s (wetware’s) architecture
• C proceeds to demonstrate that human constraints on processing recursion fall out
from the architecture of simple recurrent networks
23
Three kinds of recursion (acc. to Chomsky)
1. Counting recursion: anbn
2. Cross-serial embeddings: anbmcndm
3. Center embeddings: anbmcmdn
4. (Baseline: right-branching)
24
Benchmark: n-gram models
• In order to compare their results with an alternative frequency-based method of
computing predictability, they looked at the predictions made by 2- and 3-gram models.
• Diplomarbeit topic: try to find a better probabilistic parsing measure of predicting the
next word, compared to the SRN baseline. In John Hale’s work we will see an example
(though the goal of that work is different from the present discussion).
25
Three distinct languages
All languages contain only nouns (Ns) and verbs (Vs), both singular and plural:
• L1: aNaNbV bV (ignores agreement)
• L2: aNbV bV aN (respects agreement)
• L3: aNbNaV bV (respects agreement)
• Each language also had right-branchings: aNaV bNbV (respects agreement)
26
Method
27
Method
Trained the network to make grammaticality judgements:
• 16 words in vocabulary: 4 of each of sing.Ns, plu.Ns, sing.Vs, plu.Vs
• Corpora: 5,000 variable length sentences
• SRN architecture: 17 input and output nodes (one node for End of Sentence marker),
2-100 Hidden Units
• Test corpora: 500 novel sentences
• Teaching the network: Compare outputs with estimates of true conditional probabilities
given prior context.
28
Determining the intended output of the network
Let ci be a category (sing.N, plu.N, sing.V, plu.V)
P (cp | c, c, . . . , cp−) =freq(c, c, . . . , cpp−
, cp)
freq(c, c, . . . , cp−)(24)
Let wn be a word of category cp, and Cp the number of items in that category.
P (wn | c, c, . . . , cp−) =freq(c, c, . . . , cpp−
, cp)
freq(c, c, . . . , cp−) × Cp
(25)
29
Determing performance error
Squared error:
X
j∈W
(outj − P (wn = j))(26)
W is the set of words in the language (including EOS marker).
P
j∈W
(outj − P (wn = j))
number of W(27)
Mean square error is used as a measure of average error over the corpus.
30
Computing grammaticality judgments
The Grammatical Prediction Error GPE for predicting a particular word:
GPE = 1 −hits
hits + false alarms + misses(28)
31
Computing hits, false alarms, misses(simplified version)
Let G be the set of activated units that are grammatical (predictions), U
ungrammatical.
hits H =X
i∈G
ui (29)
false alarms F =X
i∈U
ui (30)
Let mi be the amount of activation that a unit i fell short of from a target activation.
misses M =X
i∈G
mi (31)
32
So, the bigger the mi for any unit, the bigger the miss M.
ti =f i
P
j∈G
f j
(32)
mi =
(
0 if ti − ui ≤ ,
ti − ui otherwise(33)
33
Results
• Above 10-15 units the performance is not affected.
• Counting recursion easiest, then cross-dependencies, and then center embeddings.
• By comparison, bi-gram and tri-gram found center embeddings easier than cross-
dependencies.
• Strange result: SRNs with at least 10 HUs had a lower MSE for center embeddings
than right branchings—once the first verb is encountered, the next is more predictable
in center embeddings.
• Above 15 hidden units or so, the number o HUs does not matter.
34
Results: modeling Bach et al. 1986
35
Results: Modeling Right branching embedding
36
Results: Modeling grammatical but unacceptable embeddings
Aside 37
Reconsidering the grammaticality/ungrammaticality distinction
The ACT-R model predicts associative retrieval interference at the first verb:
(1) a. [NP1 [NP2 [NP3 VP3] VP2] VP1]
# [The patient [who the nurse [who the clinic had hired] admitted] met Jack.]
b. [NP1 [NP2 [NP3 VP3] . . . ] VP1]
?[The patient [who the nurse [who the clinic had hired] . . . ] met Jack.]
(2) a. [NP1+ [NP2+ [NP3+ VP3] VP2] VP1]
The carpenter who the craftsman that the peasant had carried to the bus-stop had hurt
yesterday supervised the apprentice.
b. [NP1+ [NP2- [NP3+ VP3] VP2] VP1]
The carpenter who the pillar that the peasant had carried to the bus-stop had hurt yesterdaysupervised the apprentice.
Aside 38
Self-paced reading results: English (Suckow et al. 2005)
400
500
600
700
800
900
1000
English self−paced reading experiment
Position
Mea
n R
eadi
ng T
ime
(mse
cs)
Det N1 who Det N2 that Det N3 had V3 to the NP had V2 Adv V1 the NP
Grammatical, Similar NPsGrammatical, Dissimilar NPsUngrammatical, Similar NPsUngrammatical, Dissimilar NPs
Similar NPsDissimilar NPs
Encoding interference
Storage andRetrieval interference
(Un−)Grammaticality
Aside 39
Self-paced reading results: German (Suckow et al. 2005)
1000
1500
2000
German self−paced reading experiment
Position
Mea
n R
eadi
ng T
ime
(mse
cs)
NP1 who NP2 who NP3 V3 V2 V1 NP
Grammatical, Similar NPsGrammatical, Dissimilar NPsUngrammatical, Similar NPsUngrammatical, Dissimilar NPs
Similar NPsDissimilar NPs
Encoding interference Storage andRetrieval interference
(Un−)Grammaticality
40
Summary of some possible research projects for you
1. Making “experience” more realistic: The corpora do not reflect frequencies of occurrence of center embeddings and/orcenter-embedding types. The technology for computing these is now available (e.g. (Korthals, 2001; Kurz, 2000)). Modify
the training corpora frequencies to reflect our best possible estimate of these frequencies and then examine the performanceof the networks.
2. Probabilistic models of processing versus SRNs: C compare their model’s performance against simple n-gram models.
Replicate their experiment results, and try to find a measure using probabilistic methods that can outperform their system ordo as well.
3. Extending empirical coverage: modeling Suckow et al.’s data: Without messing with the architecture, carry outsimulations to model the interference effects and the verb reading pattern asymmetries for English and German found by
(Suckow, Vasishth, & Lewis, 2005) The poster handout of this paper is available from my home page. The simulationsshould be able to account for all the data covered by C as well as the new data (i.e.regression testing is necessary for a
meaningful step forward).
4. The Chinese and German relative clause puzzle: (Konieczny & Muller, 2004) showed that word order may not be ableto explain the German S/O relative clause facts: in German, the verb is final in both S and O relatives, and yet the ORs are
harder. In Chinese (Hsiao & Gibson, 2003) SRs have the form [tiVO]Si while ORs are [SVti]Oi (which is also the canonicalorder in Chinese). In Chinese SRs are harder than ORs. An explanation for this is the locality of the head-arguments. SRNs
should be able to explain this in terms of experience (word order). How to reconcile these two opposing results in an SRNarchitecture?
5. Extending the coverage with Japanese double embeddings: Will the network perform well with 5-clause NPs and 2verbs (ditransitive plus transitive)? We’d probably have to add HUs. What does this mean? Relevant empirical work: (Lewis
& Nakayama, 2001) and references cited there.
41
Reassessing working memory: MacDonald and Christiansen 2002
Background:
• Just and Carpenter (1992) developed a working memory capacity theory of
comprehension which posits a linguistic working memory functionally separated from
the representation of linguistic knowledge.
• Waters and Caplan (1996) argued against JC and favor of two differentiated working
memories for language. One is dedicated to “obligatory” unconscious psycholinguistic
processes, and another to controlled, verbally mediated tasks.
• MC present a model where processing capacity emerges as a function of the architecture
and experience. Capacity is not a primitive but a dependent variable.
“In all cases, the choice of processing architecture has a direct effect on the claims
for the role of working memory in language comprehension” (p. 36)
42
The Just and Carpenter capacity theory
• Declarative knowledge and production rules are stored in permanent memory.
• A separate working memory space is used to process and store current input and partial
products of ongoing computations. (cf. the ACT-R architecture).
• In one processing cycle, several productions can fire in parallel. (cf. ACT-R architecture).
• Activation decay occurs in working memory. Working memory capacity and aphasia is
modeled by a maximum activation parameter.
• Note that there is a clear distinction between working memory resources and knowledge
of language.
• “It is not clear how one could readily incorporate a comprehensive role for experience
into the 3CAPS framework.”
43
The connectionist approach
• The network is the memory.
• Others have modeled capacity by varying the number of HUs, the amount of training
the network gets, efficiency of information passing between units, and the amount of
“noise” in the input signal
• In this paper, MC use experience as the factor determining processing ability.
44
An empirical starting point
(3) a. The reporter that the senator attacked admitted the error.
b. The reporter that attacked the senator admitted the error.
Main results (Just and Carpenter 1992):
Daneman and Carpenter’s span task: read some sentences aloud, and then try to recall
the last words of each sentence. The number of sentences are gradually increased. We
differentiate between high- and low-span subjects below.
• Averaging over RC type, shorter RTs for high-span subjects at main verb.
• Averaging over span types, shorter RT at the main verb in subject RCs.
• Span × RC type: in SRCs no effect of span, but in ORCs high-spans has less difficulty
at the main verb.
45
MC’s hypotheses
• High-span participants read more of all kinds of sentences (not just ORs) than low
spans.
• SRCs have a canonical order, ORCs do not (recall Elman simulations and papers; recall
also the Chinese and German conundrum I discussed earlier).
On to the modeling. But first: What kind of network would you use?
46
The network
• Create a probabilistic CFG with a 30-word vocabulary, with Subject-verb agreement,
present/past tense, intransitive, transitive verbs, S/O RCs with multiple embeddings
allowed (also complex agreement patterns are allowed).
• “Crucially, subject- and object-relative constructions occurred with equal frequency
(2.5% of the sentences in the training set).
• Training corpus: S/O RCs randomly interleaved with simple IV, TV sentences. 10,000
words in each corpus (10 separate ones created, one for each subject-simulation), mean
length 4.5 words, 3-27 words per sentence.
• One epoch was a pass through the entire corpus. To model experience, the results of
training after 1, 2, and 3 epochs was reported.
Diplomarbeit topic: How would the network do if (a) the corpus had no RCs at all,
or (b) a lot more RCs than 2.5%, or (c) the percentages of RCs and other constructions
actually reflected a large text corpus’ structures? (c) will be an important demonstration.
47
Results for the RC experiment
• Averaging over RC type, lower error for high-span subjects at main verb.
• Averaging over epochs, lower error at the main verb in subject RCs.
• Epoch × RC type: in SRCs no effect of epoch, but in ORCs 3-epoch runs has lower
error at the main verb.
48
Conclusions
• “First, capacity is not some primitive, independent property of networks or humans in
our account but is instead strictly emergent from other architectural and experiential
factors.”
• “Second, capacity is not independent of knowledge, so that one cannot manipulate
factors underlying the capacity of a network (e.g. hidden unit layer size, activation
function, weight decay, connectivity pattern, training) without also affecting the
knowledge embedded within that network.”
Instead of capacity enabling skill, capacity is skill.
[This discussion is only an excerpt from the paper]
49
Diplomarbeit ideas
• Modeling individual differences: How would the network do if (a) the corpus had no
RCs at all, or (b) a lot more RCs than 2.5%, or (c) the percentages of RCs and other
constructions actually reflected a large text corpus’ structures? Surely (c) is the acid
test.
• Reconsidering capacity: Oberauer (No Date) defined a battery of tests that gives
a composite score of working memory. Nobody has ever checked if this composite
score matches the high-low- capacity results of King and Just, and whether SRNs could
encode the claim that a composite capacity score of spatial, numerical, and verbal tasks
can be reflected in the network’s skill level.
• Distance and interference effects: How do SRNs deal with those? (Konieczny,
2000), (Lewis & Vasishth, 2005), (Suckow et al., 2005), (Van Dyke & Lewis, 2003),
(Vasishth & Lewis, 2005). . .
50
Closing remarks
• Connectionist modeling has many important insights for us, about the role of experience,
subtle emergent properties of architectures.
• But as you may have found while doing the demos, it’s sometimes nontrivial to replicate
a result. This brings up the question of free parameters in the model. What are the
free parameters in connectionist models? We know that architectures like ACT-R have
many (without even including productions and data structures as free parameters).
• Connectionist models don’t yet have the coverage symbolic models do, but that’s a job
for people like you. As we saw, there are many challenging problems and the knowledge
we have from probabilistic methods is heavily underexplored at the moment.
• There’s an ongoing debate about symbolic versus connectionist models. Recall Newell’s
1973 paper and think about whether this debate is contentful. Issues: parallelism,
symbolic, localist data structures versus distributed ones, content-addressability of
information. . . can you think of a symbolic model that has the properties of
connectionist models? What does this say about this binary opposition?
51
References
*References
Hsiao, F., & Gibson, E. (2003). Processing relative clauses in chinese. Cognition, 90,
3–27.
Just, M. A., & Carpenter, P. A. (1992). A capacity theory of comprehension: Individual
differences in working memory. Psychological Review, 99(1), 122–149.
Konieczny, L. (2000). Locality and parsing complexity. Journal of Psycholinguistic
Research, 29(6), 627–645.
Konieczny, L., & Muller, D. (2004). Word order does not account for the advantage
of subject-extracted over object-extracted relative clauses. In AMLaP proceedings
(p. 34). Aix en Provence.
Korthals, C. (2001). Self embedded relative clauses in a corpus of German newspaper
texts. In K. Striegnitz (Ed.), Proceedings of the Sixth ESSLLI Student Session (pp.
179–190). Finland: University of Helsinki.
52
Kurz, D. (2000). A statistical account of word order variation in German. In Proceedings of
the COLING Workshop on Linguistically Interpreted Corpora. Luxembourg: COLING.
Lewis, R. L., & Nakayama, M. (2001). Syntactic and positional similarity effects in the
processing of Japanese embeddings. In M. Nakayama (Ed.), Sentence Processing in
East Asian Languages (pp. 85–113). Stanford, CA.
Lewis, R. L., & Vasishth, S. (2005). An activation-based model of sentence processing as
skilled memory retrieval. Cognitive Science, 29, 1-45.
Oberauer, K. (No Date). Working memory capacity—facets of a cognitive ability construct.
(Unpublished manuscript)
Suckow, K., Vasishth, S., & Lewis, R. L. (2005). Interference and memory overload during
parsing. In Proceedings of amlap 2005. Ghent, Belgium.
Van Dyke, J., & Lewis, R. L. (2003). Distinguishing effects of structure and decay on
attachment and repair: A cue-based parsing account of recovery from misanalyzed
ambiguities. Journal of Memory and Language, 49, 285–316.
Vasishth, S., & Lewis, R. L. (2005). Argument-head distance and processing complexity:
Explaining both locality and anti-locality effects. (Submitted to Language)
53
Waters, G. S., & Caplan, D. (1996). Processing resource capacity and the comprehension
of garden path sentences. Memory and Cognition, 24(3), 342–355.
54