Upload
john-lamb
View
213
Download
1
Embed Size (px)
Citation preview
Methods and Techniques in Neuroscience
An introduction to neural network models
Alan Pickering Autumn Term 2005
Outline: Part 1 Neural learning mechanisms
Learning through connectionsHebbian learning
What is connectionism (some definitions / distinctions) ?
ConnectionismComputational modellingNeural net modelling
Building a simple modelTerminology & structureHow information passes through a model - exercise 1How a model learns (supervised & unsupervised learning) - exercise 2
Neural learning Originally, it was thought that new
associations/memories were formed by the growth of new nerve cells in the brain
Santiago Ramon y Cajal (1852-1934)
Proposed that learning might occur through the strengthening of existing connections between nerve cells
Donald O. Hebb (1904-1985)
Formulated Cajal’s ideas into a hypothetical biological mechanism dubbed ‘Hebbian learning’
Hebbian Learning
Learned assocations through the strengthening of connections….
URUS
CS
Hebbian learning:When two joining cells fire simultaneously, the connection between them strengthens (Hebb, 1949)Discovered at a biomolecular level by Lomo (1966) (Long-term potentiation).
What is connectionism?Starting definition:
“Connectionism is the study of how learning can occur through the strengthening and/or weakening of connections between representations of pieces of information and/or behavioural responses”
We can relate this definition straight back to classical and operant conditioning (connections strengthened between stimulus and response, or between CS and CR)
Connectionist modelling is usually concerned with more complex associations and larger numbers of connections. Computers are used to store and process all of this information
A connectionist model is computer simulation of learning
What is connectionism?
Biological plausibility:Some (but not all) connectionists appeal to the physical similarity of connectionist models to networks of neurons in the brain (neural net modelling)
Neurons: NodesSynapses: Connections (weights)
Structure and terminology of connectionist models
Input nodes (units)
Hidden nodes (units)
Output nodes (units)
Output layer
Input layer
How a connectionist model works
Activation (au)?
Activation (aj) .8 .7 .1
1 1 0Outputs(outj)
What input arrives at the output node (node u)?
From j1: 1*0.8 = 0.8From j2: 1*-0.5 = -0.5From j3: 0*0.2 = 0.0Sum of inputs = 0.3Iu = ∑j outj * wju
j3j2j1
uw = weightsa = activationo = output
+0.8 -0.5 +0.2Weights(wju)
Exercise 1
Activation (au)?
Activation (aj) .8 .7 .6
1 1 1Outputs(outj)
What input arrives at the output node (node u)?
Iu = ∑j outj * wjuFrom j1: 1*0.8 = ???From j2: 1*-0.5 = ???From j3: 1*0.2 = ???Sum of inputs = ??? j3j2j1
uw = weightsa = activationo = output
+0.8 -0.7 +0.4Weights(wju)
Activations and Outputs 1
Activation (aj) .8 .7 .1
Activation = membrane potential of cell
Activation = 0= resting state (-70mV)
+ve activation = cell being depolarised
-ve activation =cell being hyperpolarised (inhibited)
Cells fire (send an action potential) when sufficiently depolarised
Output = mean firing rate of cell
Output of node j1 is a function of its activation, outj = f(aj)
j3j2j1
u
Activations and Outputs 2
How to convert activations into outputs?
ThresholdGain (degree of nonlinearity)
j3j2j1
u
Excu
Inhu
.1Activation (aj) .8 .7
.6k1
+0.2+0.8 +0.5Weights(wji)
-0.7
1 1 0Outputs(outj)
1
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
ACTIVATION
OU
TP
UT
An example output function
Hi gain
mid gain
lo gain
threshold
Activations and Outputs 3
Creating an activation equation
dau(t)/dt =
dau(t)/dt = ce * Excu(t)
dau(t)/dt = ce * Excu(t) * (Max – au(t))
dau(t)/dt = ce * Excu(t) * (Max – au(t)) + ci * Inhu(t) * (au(t) - Min)
dau(t)/dt = ce * Excu(t) * (Max – au(t)) + ci * Inhu(t) * (au(t) - Min) - cd * au(t)
j3j2j1
u
Excu
Inhu
.1Activation (aj) .8 .7
.6k1
+0.2+0.8 +0.5Weights(wji)
-0.7
1 1 0Outputs(outj)
1
au
0 200 400 600 800 1000 1200 1400 1600 1800 2000-0.5
0
0.5
1A
ctiva
tion o
f u n
ode
0 200 400 600 800 1000 1200 1400 1600 1800 20000
0.5
1
Outp
ut
of j nodes
0 200 400 600 800 1000 1200 1400 1600 1800 20000
0.5
1
Outp
ut
of k n
odes
TIMESTEP
Simulated Activations and Outputs
How does a connectionist model learn?
Following Cajal and Hebb, connectionist models learn through changes in the strength of connections:
Weight changes
There are two types of learning:Unsupervised learning
In unsupervised learning the weight changes are made automatically and in relation to the degrees of association between incoming activations (E.g. Classical conditioning -- associations strengthen through temporal contiguity)
Supervised learning In supervised learning the weight changes are made in
proportion to the error at the output. In order to calculate the error at the output a teaching pattern is required for comparison =>hence “supervised” (E.g. Learning to talk or spell)
Unsupervised learning: 1
The most commonly used unsupervised learning rule is the “Hebb rule” (c.f. Hebbian learning):
∆w = k au aj
∆w = weight changek = a constant (e.g. 0.6)au = post-synaptic activationaj = pre-synaptic activation
Unsupervised learning: 2Example: Classical conditioning
∆w = weight changek = a constant (e.g. 0.6)au = post-synaptic activationaj = pre-synaptic activation
Before conditioning
food tone
1
1 0
1.0 0.0
food tone
10
0
0.01.0
During conditioning
1
food tone
1
1
1.0 0.6
After conditioning
1.0
food tone
0 1
.6
0.6
∆w = k au aj
Supervised learning: 1 In supervised learning the weight changes are made in proportion to the error at the output
In order to calculate the error at the output a teaching pattern is required for comparison (hence supervised) (E.g. Learning to read or spell)
The most commonly used supervised learning rule is the “delta rule” (Rosenblatt, 1966; Rumelhart & McClelland, 1986):
∆wju = k up aj∆wju = weight changek = a constantup = error at output for input pattern paj = pre-synaptic activation
In order to read we need to learn relation how orthography (how a word looks) maps onto phonology (how it sounds)
Imagine a network is learning how to pronounce ‘hit’. Say that hit has the orthography; 1 1 0 and the phonology; 0 1 1 ………
Example: Learning to read
Supervised learning: 2
∆wju = k up aj
∆wju = weight changek = a constantip = error at output for input pattern paj = pre-synaptic activation
Example: Learning to read‘Hit’ has:Orthography 1 1 0Phonology 0 1 1
011
110Teaching pattern
010First - the error is calculated at each output
0
Second the blame is apportioned (to active units contributing to the error)
Exercise 2Example: Learning to read
∆wju = k up aj
∆w = weight changek = a constantup = error at outputaj = pre-synaptic activation
‘Cat’ has:Orthography 0 1 0Phonology 1 1 0
010
011Teaching pattern
111Which connection/s will be altered?
Remember:1st find the error2nd apportion blame3rd alter weights coming from blamed input node/s to errorful output node/s
u1 u3u2
j1 j3j2
Neural network models: Part 1 summary
Recap on neural learning mechanisms Learning through connections: Cajal & Hebb both suggested that
learning in the human brain may occur through changes in the strength of connections between neurons
Hebbian learning; Hebb formulated a mechanism by which this associative learning might occur; synchronous pre- and post-synaptic firing increases the strength of the connection
What is connectionism (some definitions / distinctions) ?
Connectionism deals with the ways in which statistical structure in the environment can be learned by the strengthening and/or weakening of connections between representations of items of information
Connectionism involves computer simulations of learning Neural net modelling: Connectionist models gains credibility because
they resemble networks of neurons in the brain
Neural network models: Part 1 Summary
Building a connectionist modelStructure
They are a network of ‘nodes’ and ‘weighted connections’. Information passes through the from node to node through the ‘weighted connections, usually in one direction.
Terminology Nodes can be thought of as neurons, and weighted connections as
synapses between neuronsThe flow of information through the model
The weighted connections control whether or not information passes from one layer of nodes to the next (∑j outj * wju)
Unsupervised learning As in Hebbian learning, unsupervised models use correlations of pre
and post-synaptic firing to change strengths of connections (∆wju = k au aj)
Supervised learning Supervised learning relies on the error at the output (provided by a
teaching patter) to determine the changes in connection strength (∆wju= k up aj)
Outline: Part 2
Why connectionism? (the case for parallel distributed processing)
Parallel processing Distributed processing (representations)
An example of how connectionist models can help us understand learning:
Learning of inflectional morphology in early childhood Pinker & Prince (1988) Rumelhart & McClelland (1986)
Parallel distributed processing (PDP)
Rumelhart & McClelland (1986) made a very strong case for the use of connectionist models, highlighting the qualities of parallel and distributed processing
Parallel ProcessingParallel processing can be contrasted with serial processing
Processing several different pieces of information at the same time, rather than one after the other:
Example: face processing
Distributed ProcessingDistributed processing can be contrasted with localist processing
Representations of information are distribute across the whole neural network, rather than occupying specific locations
Example: Karl Lashley and the ‘Engram’ (location of memory)
Parallel processingExample: Face processing
When looking at this face we recognise it not by looking at individual features one at a time (the eyes, the nose, the smile, the grimace), but by processing these features and their spatial configuration in parallel
Parallel processingExample: Face processing
Smile nodesNose nodes Grimace node
Margaret Thatcher Tony Blair
Eye nodes
Distributed processingExample: Lashley and the search for the ‘Engram’
Karl Spencer Lashley (1890-1958)
Pioneering researcher in the biological foundations of memory in the rat
Lashley and colleages lesioned the brains of rats in order to test whether or not they had removed the part responsible for memory. They found no single locus (engram) which appeared to be solely responsible for memory
Lashley concluded that memory was distributed throughout the cortex, rather than localised in one specific place
A connectionist model of the acquisition of inflectional
morphologyInflectional morphologyThe way in which we change words to convey:
PluralityPast tense
Examples:Cat + -s --> Cats (Plural)Play + -ed --> Played (Past tense) - *we focus on this*
In English:90% of verbs have a regular morphology10% an irregular morphology
Acquisition of inflectional morphology
Past tense morphologyRegular Morphology (90% verbs)
talk => talkedram => rammedpit => pitted
Irregular Morphology (10% verbs)hit => hit ‘no change’come => came ‘vowel change’sleep => slept ‘vowel change’go => went ‘arbitrary’
How do children learn which words require a regular ending and which are irregular?
Acquisition of inflectional morphology
U-shaped DevelopmentInitially, children’s early inflections are correctLater they start making errors:
HittedSleepedGoed
Over-regularisation errors
Later still children recover from these errors
Phase 1: Rote learning -> initial error free performancePhase 2: Rule extraction -> over-regularisation errorsPhase 3: Rule + rote -> recovery from errors
Acquisition of inflectional morphology
Dual-route model (Pinker & Prince, 1988)?‘Rule’ route deals with regular verbsExceptions route deals with irregulars
Exceptions Rule
Input Stem
Output Inflection
Errors in the middle of development occur due to overuse of the ‘rule system
But do we really need two routes to explain this pattern of learning?
Acquisition of inflectional morphology
Connectionist model (McClelland & Rumelhart, 1986)?
A single route connectionist model for learning the past tense No rule route for producing the regular ending The network ‘learned’ to associate regular &
irregular English verb stems with their past tense forms.
Wickelfeature Representation of Stem
Wickelfeature Representation of Past Tense
One single network learns to produce past tense for all the verbs it is taught
Acquisition of inflectional morphology
Connectionist model (McClelland & Rumelhart, 1986)?
Like the children the network:
Made over-regularisation errors
Demonstrated u-shaped development in its performance on irregular verbs
50
55
60
65
7075
80
85
90
95
100
0 100 200
% C
orr
ect
Pa
st T
en
se
Training Epochs
Irregulars
Regulars
Vocabulary discontinuity
Acquisition of inflectional morphology
Connectionist model of the acquisition of inflectional morphology (McClelland & Rumelhart, 1986)
This model demonstrates that behaviour which looks as if it is driven by a knowledge of rules can in fact be driven by a distributed representation of statistical structure of the input (in this case a distributed representation of language)
This explanation of how it is possible that ‘rule-like behaviour’ can occur in the absence of any actual representation of the rule is an important contribution of connectionist models
Neural network models: Part 2 Summary
Why connectionism? (the case for parallel distributed processing)
Parallel processingConnectionist models process information in parallel, rather than serially; this has intuitive appeal when we consider how we take information in
Distributed processing (representations)Connectionist models represent learned information in the distributed connections across the network, rather than in single locations or ‘rules’; this helps explain why it is diffiicult to find a single location for memory in the brain
An example of how connectionist models can help us understand learning:
Rumelhart & McClelland’s (1986) model This models the acquisition of inflectional morphology and demonstrates how networks can show ‘rule-like’ behaviour, in the absence of any representation of a rule.
Outline: Part 3
A history of neural network models
Single layer networks Rosenblatt’s perceptron (1958) Minsky & Papert’s (1969) criticism of
perceptrons
Multi layer networks McClelland & Rumelhart’s (1986)
‘Backpropagation of error’ learning rule
Structure and terminology of connectionist models
Input nodes (units)
Hidden nodes (units)
Output nodes (units)
Output layer
Input layer
One-layer vs multi-layer networksOne-layer networks were the first connectionist models to emerge in the 1950s
Frank Rosenblatt’s (1958) ‘Perceptron’ In these networks learning occurs through changes in the
weights of only one layer
However, networks which only change one layer of weights have some important limitations
These were pointed out by Minsky & Papert (1969)…
Minsky & Papert (1969) Minsky & Papert (1969) made the point that single-
layer networks cannot solve ‘non-linearly separable problems’
Maths Example: The XOR problem
inputs
output
Task: to learn to solve XORinput output1 1 01 0 10 1 10 0 0
If you think about the sums of the inputs then we can see why this isn’t linearly separable:
When the sum of the inputs increases to 2, the desired output goes back down to 0
sums2110
Single layer networks cannot solve this kind of problem
Minsky & Papert (1969)The inability to solve non-linear problems is a problem for any model of human learning because humans can solve non-linear problems..
Psychological Example 1: Learning to eat the right amount
We all have to learn to eat enough to stay fit, but not so much as to make us sick
This is like solving the XOR problem (we can learn to eat some of the food but not all of it)
Psychological Example 2: Connected and unconnected figuresa dcb
Minsky & Papert (1969) Psychological Example 2: Connected and
unconnected figures
a dcb
How might a single layer network try to solve this? (1 = connected, 0 = unconnected)
All figures have three horizontal lines so have to work this out on the basis of the presence of vertical lines (at particular locations)
The net might start by discriminating between two connected and unconnected figures (e.g. c and d) by locating a vertical line (e.g. in the bottom left)
But this also leads the net to discriminate between figures which we want to group together (e.g. c and b)
A solution: Multi-layered networks Rosenblatt & Minsky & Papert agreed that this
problem would be solved if you could train networks with more than one layer:
Networks with ‘hidden units’ can redescribe the input into a format that can be separated linearly
“Hidden units allow the network to treat physically similar inputs as different, as the need arises”
Multi-layered networks solve XORA multilayered network can solve the XOR problem:
inputs
output
You can set up a two layered network like the example on the left (with appropriate weights) to solve the XOR problem
Task: to learn to solve XORinput hidden output1 1 0 0 01 0 1 0 10 1 0 1 10 0 0 0 0
+1 -1 -1 +1
+1+1
But the real problem envisaged by Minsky & Papert was how to train the weights on a network with two layers……..???
Training multi-layered networksWhen a network learns through the delta rule, a teaching pattern is there to correct the weights leading into the output by measuring the difference between the teaching pattern and the output:
010
Teaching pattern 110
BUT!! - there is no teaching pattern for the hidden layer, which can be used to change the weights from the input..!!
The inability to find a learning rule which would change all weights in a network stifled connectionist research until a solution was found in 1986….
1 1 0
Back-propagation of errorMcClelland & Rumelhart (1986) came up with a learning rule to solve this: ‘back-prop’
Teaching pattern 110
Backprop learning:Backprop is an extension of the delta (supervised learning) rule
Error at the output is used to assign blame to particular hidden unitsThen this blame is converted to error, and this is then used to calculate weight changes to the weights from input to hidden units
1 1 0
0 1 0
0 1
Summary: Part 3
Single layer networks Rosenblatt’s perceptron (1958)
The first connectionist models were known as ‘perceptrons’, and learned through changes to a single layer of weights
Minsky & Papert’s (1969) criticism of perceptrons
Single layer networks cannot be set up to solve non-linearly separable problems (e.g. the XOR problem)
This is a problem because humans can solve non-linearly separable problems (e.g. M & P’s connected figures discrimination)
Summary: Part 3A history of connectionist modelsMulti-layer networks
Multi-layered networks can solve non-linearly separable problems by redescribing the input in a linearly separable way at a set of intermediary nodes called ‘hidden units’Minsky & Papert (1969) knew that mutilayered networks could provide a solution to non-linearly separable problems, but were not very optimistic about finding a way of training both layers of weights, until…..
McClelland & Rumelhart (1986)… came up with a learning rule which could change the weights in multiple layers of weights - ‘Back-propagation of error’ (BP)BP works as an extension of normal ‘delta rule’ supervised learning, but by changing the weights arriving at the hidden units in relation to the blame assigned to these units
Outline: Part 4
Some applications
Modelling human memoryMcClelland’s (1981) Jets and Sharks model
Modelling double dissociations in acquired dyslexiaPlaut & Shallice (1988)
Parallel distributed processing in memory
McClelland’s (1981) ‘Jets and sharks’ model of memoryA simulation of how humans might store information about people
Imagine the Jets and Sharks are two rival gangs in your town
You know a lot about the gang membersHow old they are (20s, 30s, 40s)How well educated they are (Junior high, High school, College)What their marital status is (single, married)What their job is (pusher, bookie, burglar)
How do you access this information????
Jets and SharksA computer (or conventional database) might store the information indexed to name:The name links all the information about that person together:
Name indexing is good for answering questions like. “is Fred a burglar?”
But bad at answering, “who is a burglar?”
John HS single jet burglarTerry JH married shark pusherFred JH married shark burglar
Jets and SharksMcClelland set up the following connectionist database: Links between names
and professions /education /marital status etc.. are made by excitatory connections (green)
Within areas of knowledge, categorical items have inhibitory connections (red) - these inhibitory connections help the network give a concrete answer (I.e. jet or shark, but not both)
Jets and SharksContent addressability:
By putting activation into the network at the burglar node, we get information about who is a burglar (al, jim, john, doug, lance, george) - this is known as ‘content addressability’
There is also information about what age the burglars mostly are, whether the burglars are mostly jets or sharks etc…
Jets and SharksTypicality effects:
McClelland’s network also nicely models an aspect of human memory called ‘typicality’:
If we ask the net to tell us the name of a pusher, it is more likely to retrieve some pushers than others (Fred & Nick, but not Ol)
This is because Ol is not a typical pusher (and does not benefit from the excitation coming from the activated typical pusher nodes
Parallel processing in memoryMcClelland’s (1981) ‘Jets and sharks’ model of memory shows how a database of information can be set from which information about several different attributes (e.g. marital status, name, gang etc..) can be retrieved in parallel
Furthermore, more than one address in memory (e.g. several names) can be accessed at once (in parallel) by activation of an attribute (e.g. jets) - this is an aspect of human memory called ‘content addressability’, and contrasts with a memory system in which items are searched one-by-one (serially)
The memory in the network is distributed across all of the connection weights…(a distributed database)..
Connectionist models of double dissociations
Double dissociations are situations in neuropsychology in which you find that one brain damaged patient has a deficit in cognitive function A, but not B, whereas another patient has a deficit in B but not A
This has been traditionally interpreted as indicating that A and B are cognitive functions which are independent (and located in different parts of the brain)
Example: Dissociation between conditioned and expected fear
A double dissociation in acquired dyslexia
People can occasionally acquire dyslexia (reading difficulty) after brain injury
Different types of acquired dyslexia have been identified:Difficulty reading concrete words (e.g. tack) vs. difficulty reading abstract words (e.g. tact)While most patients show a superiority for concrete words, some demonstrate better performance with abstract words (Warrington, 1981) This has been described as a double dissociation, and researchers have suggested separable semantic memory stores for concrete and abstract words (at different locations in the brain)
Plaut & Shallice (1993) constructed a connectionist simulation to determine whether we do need to posit two separate stores on the basis of this double dissociation…..
A double dissociation in acquired dyslexia
After training, Plaut & Shallice’s (1993) model was able to correctly read concrete and abstract words
The next step was to lesion the model in several different ways (cut some connections), to determine what deficits would occur
Orthography & semantics
phonologyPlaut & Shallice found that if you lesion several models in different locations you can model the double dissociation with a single system
A double dissociation in acquired dyslexia
Plaut & Shallice’s (1993) finding is very significant as it shows that double dissociations do not necessarily mean that there are two separate systems involved.
In this case both concrete and abstract words are represented in a distributed fashion across the entire system rather than in separate localised stores
Summary: Part 4What connectionist models can do
Modelling human memory McClelland’s (1981) Jets and Sharks model is a
distributed memory database (rather than a serially accessed database
It neatly models content addressability and typicality - two aspects of human memory which a serially accessed memory store cannot model
Modelling double dissociations in acquired dyslexia Plaut & Shallice (1993) show that clinical double
dissociations of ability to read concrete and abstract words can be modelled in a single route network, in which information about both is processed in parallel, and distribuited across the net