Methods and Techniques in Neuroscience An introduction to neural network models Alan Pickering Autumn Term 2005

Methods and Techniques in Neuroscience

An introduction to neural network models

Alan Pickering Autumn Term 2005

Outline: Part 1 Neural learning mechanisms

Learning through connectionsHebbian learning

What is connectionism (some definitions / distinctions) ?

ConnectionismComputational modellingNeural net modelling

Building a simple modelTerminology & structureHow information passes through a model - exercise 1How a model learns (supervised & unsupervised learning) - exercise 2

Neural learning Originally, it was thought that new

associations/memories were formed by the growth of new nerve cells in the brain

Santiago Ramon y Cajal (1852-1934)

Proposed that learning might occur through the strengthening of existing connections between nerve cells

Donald O. Hebb (1904-1985)

Formulated Cajal’s ideas into a hypothetical biological mechanism dubbed ‘Hebbian learning’

Hebbian Learning

Learned assocations through the strengthening of connections….

URUS

CS

Hebbian learning:When two joining cells fire simultaneously, the connection between them strengthens (Hebb, 1949)Discovered at a biomolecular level by Lomo (1966) (Long-term potentiation).

What is connectionism?Starting definition:

“Connectionism is the study of how learning can occur through the strengthening and/or weakening of connections between representations of pieces of information and/or behavioural responses”

We can relate this definition straight back to classical and operant conditioning (connections strengthened between stimulus and response, or between CS and CR)

Connectionist modelling is usually concerned with more complex associations and larger numbers of connections. Computers are used to store and process all of this information

A connectionist model is computer simulation of learning

What is connectionism?

Biological plausibility:Some (but not all) connectionists appeal to the physical similarity of connectionist models to networks of neurons in the brain (neural net modelling)

Neurons: NodesSynapses: Connections (weights)

Structure and terminology of connectionist models

Input nodes (units)

Hidden nodes (units)

Output nodes (units)

Output layer

Input layer

How a connectionist model works

Activation (au)?

Activation (aj) .8 .7 .1

1 1 0Outputs(outj)

What input arrives at the output node (node u)?

From j1: 1*0.8 = 0.8From j2: 1*-0.5 = -0.5From j3: 0*0.2 = 0.0Sum of inputs = 0.3Iu = ∑j outj * wju

j3j2j1

uw = weightsa = activationo = output

+0.8 -0.5 +0.2Weights(wju)

Exercise 1

Activation (au)?


1 1 1Outputs(outj)

What input arrives at the output node (node u)?

Iu = ∑j outj * wjuFrom j1: 1*0.8 = ???From j2: 1*-0.5 = ???From j3: 1*0.2 = ???Sum of inputs = ??? j3j2j1

uw = weightsa = activationo = output

+0.8 -0.7 +0.4Weights(wju)

Activations and Outputs 1


Activation = membrane potential of cell

Activation = 0= resting state (-70mV)

+ve activation = cell being depolarised

-ve activation =cell being hyperpolarised (inhibited)

Cells fire (send an action potential) when sufficiently depolarised

Output = mean firing rate of cell

Output of node j1 is a function of its activation, outj = f(aj)

j3j2j1

u


How to convert activations into outputs?

ThresholdGain (degree of nonlinearity)

j3j2j1

u

Excu

Inhu

.1Activation (aj) .8 .7

.6k1

+0.2+0.8 +0.5Weights(wji)

-0.7

1 1 0Outputs(outj)

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

ACTIVATION

OU

TP

UT

An example output function

Hi gain

mid gain

lo gain

threshold


Creating an activation equation

dau(t)/dt =

dau(t)/dt = ce * Excu(t)

dau(t)/dt = ce * Excu(t) * (Max – au(t))

dau(t)/dt = ce * Excu(t) * (Max – au(t)) + ci * Inhu(t) * (au(t) - Min)

dau(t)/dt = ce * Excu(t) * (Max – au(t)) + ci * Inhu(t) * (au(t) - Min) - cd * au(t)

j3j2j1

u

Excu

Inhu

.1Activation (aj) .8 .7

.6k1

+0.2+0.8 +0.5Weights(wji)

-0.7

1 1 0Outputs(outj)

1

au

0 200 400 600 800 1000 1200 1400 1600 1800 2000-0.5

0

0.5

1A

ctiva

tion o

f u n

ode

0 200 400 600 800 1000 1200 1400 1600 1800 20000

0.5

1

Outp

ut

of j nodes

0 200 400 600 800 1000 1200 1400 1600 1800 20000

0.5

1

Outp

ut

of k n

odes

TIMESTEP

Simulated Activations and Outputs

How does a connectionist model learn?

Following Cajal and Hebb, connectionist models learn through changes in the strength of connections:

Weight changes

There are two types of learning:Unsupervised learning

In unsupervised learning the weight changes are made automatically and in relation to the degrees of association between incoming activations (E.g. Classical conditioning -- associations strengthen through temporal contiguity)

Supervised learning In supervised learning the weight changes are made in

proportion to the error at the output. In order to calculate the error at the output a teaching pattern is required for comparison =>hence “supervised” (E.g. Learning to talk or spell)

Unsupervised learning: 1

The most commonly used unsupervised learning rule is the “Hebb rule” (c.f. Hebbian learning):

∆w = k au aj

∆w = weight changek = a constant (e.g. 0.6)au = post-synaptic activationaj = pre-synaptic activation

Unsupervised learning: 2Example: Classical conditioning

∆w = weight changek = a constant (e.g. 0.6)au = post-synaptic activationaj = pre-synaptic activation

Before conditioning

food tone

1

1 0

1.0 0.0

food tone

10

0

0.01.0

During conditioning

1

food tone

1

1

1.0 0.6

After conditioning

1.0

food tone

0 1

.6

0.6

∆w = k au aj

Supervised learning: 1 In supervised learning the weight changes are made in proportion to the error at the output

In order to calculate the error at the output a teaching pattern is required for comparison (hence supervised) (E.g. Learning to read or spell)

The most commonly used supervised learning rule is the “delta rule” (Rosenblatt, 1966; Rumelhart & McClelland, 1986):

∆wju = k up aj∆wju = weight changek = a constantup = error at output for input pattern paj = pre-synaptic activation

In order to read we need to learn relation how orthography (how a word looks) maps onto phonology (how it sounds)

Imagine a network is learning how to pronounce ‘hit’. Say that hit has the orthography; 1 1 0 and the phonology; 0 1 1 ………

Example: Learning to read

Supervised learning: 2

∆wju = k up aj

∆wju = weight changek = a constantip = error at output for input pattern paj = pre-synaptic activation

Example: Learning to read‘Hit’ has:Orthography 1 1 0Phonology 0 1 1

011

110Teaching pattern

010First - the error is calculated at each output

0

Second the blame is apportioned (to active units contributing to the error)

Exercise 2Example: Learning to read

∆wju = k up aj

∆w = weight changek = a constantup = error at outputaj = pre-synaptic activation

‘Cat’ has:Orthography 0 1 0Phonology 1 1 0

010

011Teaching pattern

111Which connection/s will be altered?

Remember:1st find the error2nd apportion blame3rd alter weights coming from blamed input node/s to errorful output node/s

u1 u3u2

j1 j3j2

Neural network models: Part 1 summary

Recap on neural learning mechanisms Learning through connections: Cajal & Hebb both suggested that

learning in the human brain may occur through changes in the strength of connections between neurons

Hebbian learning; Hebb formulated a mechanism by which this associative learning might occur; synchronous pre- and post-synaptic firing increases the strength of the connection

What is connectionism (some definitions / distinctions) ?

Connectionism deals with the ways in which statistical structure in the environment can be learned by the strengthening and/or weakening of connections between representations of items of information

Connectionism involves computer simulations of learning Neural net modelling: Connectionist models gains credibility because

they resemble networks of neurons in the brain

Neural network models: Part 1 Summary

Building a connectionist modelStructure

They are a network of ‘nodes’ and ‘weighted connections’. Information passes through the from node to node through the ‘weighted connections, usually in one direction.

Terminology Nodes can be thought of as neurons, and weighted connections as

synapses between neuronsThe flow of information through the model

The weighted connections control whether or not information passes from one layer of nodes to the next (∑j outj * wju)

Unsupervised learning As in Hebbian learning, unsupervised models use correlations of pre

and post-synaptic firing to change strengths of connections (∆wju = k au aj)

Supervised learning Supervised learning relies on the error at the output (provided by a

teaching patter) to determine the changes in connection strength (∆wju= k up aj)

Outline: Part 2

Why connectionism? (the case for parallel distributed processing)

Parallel processing Distributed processing (representations)

An example of how connectionist models can help us understand learning:

Learning of inflectional morphology in early childhood Pinker & Prince (1988) Rumelhart & McClelland (1986)

Parallel distributed processing (PDP)

Rumelhart & McClelland (1986) made a very strong case for the use of connectionist models, highlighting the qualities of parallel and distributed processing

Parallel ProcessingParallel processing can be contrasted with serial processing

Processing several different pieces of information at the same time, rather than one after the other:

Example: face processing

Distributed ProcessingDistributed processing can be contrasted with localist processing

Representations of information are distribute across the whole neural network, rather than occupying specific locations

Example: Karl Lashley and the ‘Engram’ (location of memory)

Parallel processingExample: Face processing

When looking at this face we recognise it not by looking at individual features one at a time (the eyes, the nose, the smile, the grimace), but by processing these features and their spatial configuration in parallel

Parallel processingExample: Face processing

Smile nodesNose nodes Grimace node

Margaret Thatcher Tony Blair

Eye nodes

Distributed processingExample: Lashley and the search for the ‘Engram’

Karl Spencer Lashley (1890-1958)

Pioneering researcher in the biological foundations of memory in the rat

Lashley and colleages lesioned the brains of rats in order to test whether or not they had removed the part responsible for memory. They found no single locus (engram) which appeared to be solely responsible for memory

Lashley concluded that memory was distributed throughout the cortex, rather than localised in one specific place

A connectionist model of the acquisition of inflectional

morphologyInflectional morphologyThe way in which we change words to convey:

PluralityPast tense

Examples:Cat + -s --> Cats (Plural)Play + -ed --> Played (Past tense) - *we focus on this*

In English:90% of verbs have a regular morphology10% an irregular morphology

Acquisition of inflectional morphology

Past tense morphologyRegular Morphology (90% verbs)

talk => talkedram => rammedpit => pitted

Irregular Morphology (10% verbs)hit => hit ‘no change’come => came ‘vowel change’sleep => slept ‘vowel change’go => went ‘arbitrary’

How do children learn which words require a regular ending and which are irregular?


U-shaped DevelopmentInitially, children’s early inflections are correctLater they start making errors:

HittedSleepedGoed

Over-regularisation errors

Later still children recover from these errors

Phase 1: Rote learning -> initial error free performancePhase 2: Rule extraction -> over-regularisation errorsPhase 3: Rule + rote -> recovery from errors


Dual-route model (Pinker & Prince, 1988)?‘Rule’ route deals with regular verbsExceptions route deals with irregulars

Exceptions Rule

Input Stem

Output Inflection

Errors in the middle of development occur due to overuse of the ‘rule system

But do we really need two routes to explain this pattern of learning?


Connectionist model (McClelland & Rumelhart, 1986)?

A single route connectionist model for learning the past tense No rule route for producing the regular ending The network ‘learned’ to associate regular &

irregular English verb stems with their past tense forms.

Wickelfeature Representation of Stem

Wickelfeature Representation of Past Tense

One single network learns to produce past tense for all the verbs it is taught


Connectionist model (McClelland & Rumelhart, 1986)?

Like the children the network:

Made over-regularisation errors

Demonstrated u-shaped development in its performance on irregular verbs

50

55

60

65

7075

80

85

90

95

100

0 100 200

% C

orr

ect

Pa

st T

en

se

Training Epochs

Irregulars

Regulars

Vocabulary discontinuity


Connectionist model of the acquisition of inflectional morphology (McClelland & Rumelhart, 1986)

This model demonstrates that behaviour which looks as if it is driven by a knowledge of rules can in fact be driven by a distributed representation of statistical structure of the input (in this case a distributed representation of language)

This explanation of how it is possible that ‘rule-like behaviour’ can occur in the absence of any actual representation of the rule is an important contribution of connectionist models

Neural network models: Part 2 Summary

Why connectionism? (the case for parallel distributed processing)

Parallel processingConnectionist models process information in parallel, rather than serially; this has intuitive appeal when we consider how we take information in

Distributed processing (representations)Connectionist models represent learned information in the distributed connections across the network, rather than in single locations or ‘rules’; this helps explain why it is diffiicult to find a single location for memory in the brain

An example of how connectionist models can help us understand learning:

Rumelhart & McClelland’s (1986) model This models the acquisition of inflectional morphology and demonstrates how networks can show ‘rule-like’ behaviour, in the absence of any representation of a rule.

Outline: Part 3

A history of neural network models

Single layer networks Rosenblatt’s perceptron (1958) Minsky & Papert’s (1969) criticism of

perceptrons

Multi layer networks McClelland & Rumelhart’s (1986)

‘Backpropagation of error’ learning rule

Structure and terminology of connectionist models

Input nodes (units)

Hidden nodes (units)

Output nodes (units)

Output layer

Input layer

One-layer vs multi-layer networksOne-layer networks were the first connectionist models to emerge in the 1950s

Frank Rosenblatt’s (1958) ‘Perceptron’ In these networks learning occurs through changes in the

weights of only one layer

However, networks which only change one layer of weights have some important limitations

These were pointed out by Minsky & Papert (1969)…

Minsky & Papert (1969) Minsky & Papert (1969) made the point that single-

layer networks cannot solve ‘non-linearly separable problems’

Maths Example: The XOR problem

inputs

output

Task: to learn to solve XORinput output1 1 01 0 10 1 10 0 0

If you think about the sums of the inputs then we can see why this isn’t linearly separable:

When the sum of the inputs increases to 2, the desired output goes back down to 0

sums2110

Single layer networks cannot solve this kind of problem

Minsky & Papert (1969)The inability to solve non-linear problems is a problem for any model of human learning because humans can solve non-linear problems..

Psychological Example 1: Learning to eat the right amount

We all have to learn to eat enough to stay fit, but not so much as to make us sick

This is like solving the XOR problem (we can learn to eat some of the food but not all of it)

Psychological Example 2: Connected and unconnected figuresa dcb

Minsky & Papert (1969) Psychological Example 2: Connected and

unconnected figures

a dcb

How might a single layer network try to solve this? (1 = connected, 0 = unconnected)

All figures have three horizontal lines so have to work this out on the basis of the presence of vertical lines (at particular locations)

The net might start by discriminating between two connected and unconnected figures (e.g. c and d) by locating a vertical line (e.g. in the bottom left)

But this also leads the net to discriminate between figures which we want to group together (e.g. c and b)

A solution: Multi-layered networks Rosenblatt & Minsky & Papert agreed that this

problem would be solved if you could train networks with more than one layer:

Networks with ‘hidden units’ can redescribe the input into a format that can be separated linearly

“Hidden units allow the network to treat physically similar inputs as different, as the need arises”

Multi-layered networks solve XORA multilayered network can solve the XOR problem:

inputs

output

You can set up a two layered network like the example on the left (with appropriate weights) to solve the XOR problem

Task: to learn to solve XORinput hidden output1 1 0 0 01 0 1 0 10 1 0 1 10 0 0 0 0

+1 -1 -1 +1

+1+1

But the real problem envisaged by Minsky & Papert was how to train the weights on a network with two layers……..???

Training multi-layered networksWhen a network learns through the delta rule, a teaching pattern is there to correct the weights leading into the output by measuring the difference between the teaching pattern and the output:

010

Teaching pattern 110

BUT!! - there is no teaching pattern for the hidden layer, which can be used to change the weights from the input..!!

The inability to find a learning rule which would change all weights in a network stifled connectionist research until a solution was found in 1986….

1 1 0

Back-propagation of errorMcClelland & Rumelhart (1986) came up with a learning rule to solve this: ‘back-prop’

Teaching pattern 110

Backprop learning:Backprop is an extension of the delta (supervised learning) rule

Error at the output is used to assign blame to particular hidden unitsThen this blame is converted to error, and this is then used to calculate weight changes to the weights from input to hidden units

1 1 0

0 1 0

0 1

Summary: Part 3

Single layer networks Rosenblatt’s perceptron (1958)

The first connectionist models were known as ‘perceptrons’, and learned through changes to a single layer of weights

Minsky & Papert’s (1969) criticism of perceptrons

Single layer networks cannot be set up to solve non-linearly separable problems (e.g. the XOR problem)

This is a problem because humans can solve non-linearly separable problems (e.g. M & P’s connected figures discrimination)

Summary: Part 3A history of connectionist modelsMulti-layer networks

Multi-layered networks can solve non-linearly separable problems by redescribing the input in a linearly separable way at a set of intermediary nodes called ‘hidden units’Minsky & Papert (1969) knew that mutilayered networks could provide a solution to non-linearly separable problems, but were not very optimistic about finding a way of training both layers of weights, until…..

McClelland & Rumelhart (1986)… came up with a learning rule which could change the weights in multiple layers of weights - ‘Back-propagation of error’ (BP)BP works as an extension of normal ‘delta rule’ supervised learning, but by changing the weights arriving at the hidden units in relation to the blame assigned to these units

Outline: Part 4

Some applications

Modelling human memoryMcClelland’s (1981) Jets and Sharks model

Modelling double dissociations in acquired dyslexiaPlaut & Shallice (1988)

Parallel distributed processing in memory

McClelland’s (1981) ‘Jets and sharks’ model of memoryA simulation of how humans might store information about people

Imagine the Jets and Sharks are two rival gangs in your town

You know a lot about the gang membersHow old they are (20s, 30s, 40s)How well educated they are (Junior high, High school, College)What their marital status is (single, married)What their job is (pusher, bookie, burglar)

How do you access this information????

Jets and SharksA computer (or conventional database) might store the information indexed to name:The name links all the information about that person together:

Name indexing is good for answering questions like. “is Fred a burglar?”

But bad at answering, “who is a burglar?”

John HS single jet burglarTerry JH married shark pusherFred JH married shark burglar

Jets and SharksMcClelland set up the following connectionist database: Links between names

and professions /education /marital status etc.. are made by excitatory connections (green)

Within areas of knowledge, categorical items have inhibitory connections (red) - these inhibitory connections help the network give a concrete answer (I.e. jet or shark, but not both)

Jets and SharksContent addressability:

By putting activation into the network at the burglar node, we get information about who is a burglar (al, jim, john, doug, lance, george) - this is known as ‘content addressability’

There is also information about what age the burglars mostly are, whether the burglars are mostly jets or sharks etc…

Jets and SharksTypicality effects:

McClelland’s network also nicely models an aspect of human memory called ‘typicality’:

If we ask the net to tell us the name of a pusher, it is more likely to retrieve some pushers than others (Fred & Nick, but not Ol)

This is because Ol is not a typical pusher (and does not benefit from the excitation coming from the activated typical pusher nodes

Parallel processing in memoryMcClelland’s (1981) ‘Jets and sharks’ model of memory shows how a database of information can be set from which information about several different attributes (e.g. marital status, name, gang etc..) can be retrieved in parallel

Furthermore, more than one address in memory (e.g. several names) can be accessed at once (in parallel) by activation of an attribute (e.g. jets) - this is an aspect of human memory called ‘content addressability’, and contrasts with a memory system in which items are searched one-by-one (serially)

The memory in the network is distributed across all of the connection weights…(a distributed database)..

Connectionist models of double dissociations

Double dissociations are situations in neuropsychology in which you find that one brain damaged patient has a deficit in cognitive function A, but not B, whereas another patient has a deficit in B but not A

This has been traditionally interpreted as indicating that A and B are cognitive functions which are independent (and located in different parts of the brain)

Example: Dissociation between conditioned and expected fear

A double dissociation in acquired dyslexia

People can occasionally acquire dyslexia (reading difficulty) after brain injury

Different types of acquired dyslexia have been identified:Difficulty reading concrete words (e.g. tack) vs. difficulty reading abstract words (e.g. tact)While most patients show a superiority for concrete words, some demonstrate better performance with abstract words (Warrington, 1981) This has been described as a double dissociation, and researchers have suggested separable semantic memory stores for concrete and abstract words (at different locations in the brain)

Plaut & Shallice (1993) constructed a connectionist simulation to determine whether we do need to posit two separate stores on the basis of this double dissociation…..


After training, Plaut & Shallice’s (1993) model was able to correctly read concrete and abstract words

The next step was to lesion the model in several different ways (cut some connections), to determine what deficits would occur

Orthography & semantics

phonologyPlaut & Shallice found that if you lesion several models in different locations you can model the double dissociation with a single system


Plaut & Shallice’s (1993) finding is very significant as it shows that double dissociations do not necessarily mean that there are two separate systems involved.

In this case both concrete and abstract words are represented in a distributed fashion across the entire system rather than in separate localised stores

Summary: Part 4What connectionist models can do

Modelling human memory McClelland’s (1981) Jets and Sharks model is a

distributed memory database (rather than a serially accessed database

It neatly models content addressability and typicality - two aspects of human memory which a serially accessed memory store cannot model

Modelling double dissociations in acquired dyslexia Plaut & Shallice (1993) show that clinical double

dissociations of ability to read concrete and abstract words can be modelled in a single route network, in which information about both is processed in parallel, and distribuited across the net

Documents

Methods and Techniques in Neuroscience An introduction to neural network models Alan Pickering Autumn Term 2005