09 Artificial Neural Networks and Classification

8/18/2019 09 Artificial Neural Networks and Classification

1/43

Artifcial Neural Networks andClassifcation

An artifcial neural network is a simple

brain-like device that can learn byadjusting connections between its

neurons


2/43

The brain as a computer


3/43

The brain’s architecture

Human (and animal) brains have a ‘computer’

architecture which consists o a comple! web o about "#"" hi$hl% inter&connected

processin$ units called neurons

'rocessin$ involves si$nals bein$ sent rom neuron to neuron

b% complicated electrochemical reaction in a hi$hl% parallel manner


4/43

The neuron A neuron is nerve cell consistin$ o

a cell bod% (soma) containin$ a nucleus’

branchin$ out rom the bod% a number o fbres called dendrites a sin$le lon$ fbre called the axon

a centimetre or lon$er’

The a!on branches and connects to thedendrites o other neurons

the connectin$ unction is called the synapse each neuron connects to between a do*en and "##+###

other neurons’


5/43

A real neuron


6/43

,i$nal propa$ation

Chemical transmitter substances are releasedrom the s%napses and enter the dendrites’

This raises or lowers the electrical potential o thecell bod%

s%napses that raise potential are called excitatory those that lower it are inhibitory

-hen a threshold is reached+ an electrical pulse+the action potential+ is sent down the a!on (fring) This spreads into the a!on’s branches reachin$

s%napses and releasin$ transmitters into cell bodieso other neurons


7/43

.rain versus computer ,tora$e capacit%

brain has more neurons than computer has bits

,peed

brain is much slower than a computer a neuron has frin$ speed o "#&/ secs compared to computer switchin$

speed o "#&"" secs

.rain relies on massive parallelism or perormance %ou can reco$nise %our mother in #" secs

The brain is more suited to intelli$ence processin$ and learnin$ 0t is $ood at ormin$ associations

this seems to be the basis o learnin$

0s more ault tolerant neurons die all the time and computation continues

Task perormance e!hibits graceul degradation in contrast to brittleness o computers


8/43

Artifcial neural networks


9/43

-hat is an artifcial neuralnetwork1 An artifcial neural network (ann) is a $rossl%

oversimplifed version o the brain’s architecture

0t has ar ewer ’neurons’ several hundred or thousand

0t has much simpler internal structure

The frin$ mechanism is less comple!

The si$nals consist o real numbers passed romone neuron to another


10/43

How does a network behave1 2ost anns can be re$arded as input&output

devices

numerical input is propa$ated throu$h the network romneuron to neuron till it reaches the output

The connections between neurons havenumerical weights which are used to combinethe si$nals reachin$ a neuron

3earnin$ involves establishin$ the wei$ht values(strengths) to achieve a particular $oal 0n theor% the stren$ths could be pro$rammed rather than

learnt but or the most part this would be impossibl% tedious


11/43

4esi$nin$ a network Creatin$ an ann re5uires the ollowin$ to be

specifed

Network topology the number o units the pattern o interconnectivit% amon$st them

the mathematical t%pe o the wei$hts

Transer unction This combines the inputs impin$in$ on the unit and produces

the unit activation level which then becomes the output si$nal Representation or e!amples earning law

This states how wei$hts are to be modifed to achieve thelearnin$ $oal


12/43

Network topolo$% & neurons and

la%ers ,pecifes how man% nodes (neurons) there are and

how the% are connected in a ully connected network each node is connected to ever%

other 6ten networks are or$anised in layers (slabs) with no

connections between nodes in a la%er & onl% across The frst la%er is the input layer + the last+ the output layer 3a%ers between the input and output la%ers are called hidden

The input units t%picall% do not carr% out internalcomputation+ ie do not have transer unctions the% merel% pass on their si$nal values

The output units send their si$nal directl% to theoutside world


13/43

Network topolo$% & wei$hts -ei$hts are usuall% real&valued

At the start o learnin$+ their values are oten set

randoml% 0 there is a connection rom a to b then a has

in7uence over the activation value o b !xcitatory in"uence

hi$h activation in unit a contributes to hi$h activation in unit b

is modelled b% a positive wei$ht #nhibitory in"uence

hi$h activation in unit a contributes to low activation in unit b

is modelled b% a ne$ative wei$ht


14/43

Network topolo$% & 7ow o

computation Althou$h connections are uni&directional+ some

networks have pairs o units connected in both

directions there is a connection rom unit a to unit b and one

back rom unit b to unit a

Networks in which there is no loopin$ back oconnections are called eed-orward si$nals are 8ed orward8 rom input throu$h to output

Networks in which outputs are eventuall% edback into the network as inputs are calledrecurrent


15/43

9!amples o eed&orwardtopolo$ies

6-node input layer

2-node output layer

Single layer network

4-node input layer

4-node hidden layer 1-node output layer

Two layer network with 1 hidden layer


16/43

The transer unction &

combinin$ input si$nals The input si$nals to a neuron must be combined into

a sin$le value+ the activation level to be output

:suall% this transer takes place in two sta$es frst the inputs are combined

and then passed throu$h another unction to produce the

output

The most common method o combination is the

weighted sum

1 1 ...

n n sum w x w x= + +

Here x i is the si$nal and wi is the wei$ht on

connection # and n is the number o input si$nals


17/43

The transer unction & the

activation level The wei$hted sum is passed throu$h an activation

unction to produce the output si$nal (activation level) %′ Commonl% used unctions are

inear The output is ust the wei$hted sum

inear threshold ($tep unction) The wei$hted sum is thresholded at a value c i it is less than

c+ then y ′ % &+ otherwise y ′ % '

$igmoid response (logistic) unction a continuous version o the step unction which produces

$raceul de$radation around the 8step8 at c

)(

1

1c sum

e

y−−

+=′


18/43

Activation unction$raphs

c0

1

,i$moid

c0

1

,tep

c

0

3inear


19/43

9!ample

w" ; #/

w


20/43

3earnin$ with Anns


21/43

-hat tasks can a network learn1 Networks can be trained or the ollowin$ tasks

2lassifcation 3attern association

e$ 9n$lish verbs mapped to their past tense

2ontent addressable4associative memory e$ can recall>restore whole ima$e when provided with a part o

it

These all involve mappin$s The mappin$ o input to output is determined b% the settin$s

o all the wei$hts in the network (the weight vector ) ? this iswhat is learnt

The network node conf$uration to$ether with the wei$htvector is the knowled$e structure


22/43

3earnin$ laws

3earnin$ provides a means o fndin$ the wei$htsettin$s to implement a mappin$ This is onl% possible i the network is capable o

representin$ the mappin$ The more comple! the mappin$+ the lar$er the network

that will be re5uired includin$ a $reater number ohidden la%ers

0nitiall%+ wei$hts are set at random and altered inresponse to the trainin$ data

A re$ime or wei$ht alteration to achieve there5uired mappin$ is called a learning law

9ven i a network can represent a mappin$+ aparticular learnin$ law ma% not be able to learn it


23/43

@epresentation o trainin$e!amples :nlike decision trees which handle both

discrete and continuous (numeric) attributes+

anns can handle onl% the latter All discrete attributes must be converted

(encoded) to be numeric This also applies to the class

,everal wa%s are available and the choiceaects the success o learnin$


24/43

4escription attributes 0t is desirable or all attributes to have values in

the same ran$e This is usuall% taken to be # to "

Achieved or numeric attributes usin$normalisation

value → (value - min value) 4 (max value - min value)

Bor discrete attributes can use

'-out-o-N encoding (distributed) N binar% (#&") units used to represent the N values o the

attribute+ one or each

local encoding values mapped to numbers in ran$e # to "

more suited to ordered values


25/43

Class attribute

"&out&o&N or local encodin$ can be used or the class

The network output ater learnin$ is usuall% onl%

appro!imate e$ in a binar% class problem with classes represented b% # and "+the network mi$ht output #= and this would be taken as ‘"’

:sin$ "&out&o&N encodin$ allows or a probabilisticinterpretation+ e$

classes or car domain unacc+ acc+ $ood+ v$ood

can be represented with our binar% units

e$ acc → (#+ "+ #+ #)

6utput o (#


26/43

Network conf$uration

9ncodin$ o trainin$ e!amples aects networksi*e

0nput la%er will have one unit or each numeric attribute

one or each locall% encoded discrete attribute

" or each binar% discrete attribute

k or each distributed encodin$ o a discrete attributewhere the attribute has kE< values

:suall% have a small number o hidden la%ers(one or two)


27/43

'%ramid structure

Hidden la%ers are used to reduce thedimensionalit% o the input

A network has a pyramid structure i the frst hidden la%er ewer nodes than the input la%er

each hidden la%er has less than its predecessor

the output la%er has least

The p%ramid structure acilitates learnin$ 0n classifcation each hidden la%er appears to partiall%

classi% the e!amples until the actual classes arereached in the output la%er


28/43

The learnin$ process

Classifcation learnin$ uses a eedback mechanism

An e!ample is ed throu$h the network usin$the e!istin$ wei$hts

The output value is 5F the correct output value+ie the class in the e!ample+ is T (tar$et)

0 5 ≠ T + some or all o the wei$hts are chan$edsli$htl%

The e!tent o the chan$e usuall% depends onT &5+ called the error


29/43

The delta rule

A wei$ht+ wi + on a connection carr%in$ si$nal+

x i + can be modifed b% addin$ an amount ∆wi

proportional to the error∆wi ; η (T-5) x i

where η is the learning rate

η is a positive constant usuall% set at about #"

and $raduall% decreased durin$ learnin$ The update ormula or wi is then

wi ← wi G ∆wi


30/43

Trainin$ epochs

Bor each e!ample in the trainin$ set the description attribute values are ed as input to

the network and propa$ated throu$h to the output each wei$ht is updated

This constitutes one epoch or cycle o learnin$

The process is repeated till it is decided to stop 2an% thousands o epochs ma% be necessar%

The fnal set o wei$hts represent the learnedmappin$


31/43

-orked e!ample & $ol domain Conversion o attributes

Attribute Values

Outlook sunny, overcast, rain

Temperature -0 to 10 !

"umidity lo#, normal, hi$h,%indy

!lass

true, &alse

yes, no

Attribute Values

'unnyOvercast(ain

0, 10, 10, 1

Temperature 0 to 1 T ← (T+50)!00)o# *ormal"i$h%indy

0, 10, 10, 10, 1

+lay $ol& 1, 0


32/43

Network conf$uration

:se a sin$le la%er network (no hidden units) with step unctionto illustrate the delta rule

0nitialise wei$hts as shown

,et η % &*'

,unn%

6vercast

@ain

Temperature3owNormal

Hi$h

-ind%

w&%&*+

(bias)

w' % -&*/

w6 % -&*0

w+ % &*6

w0 % &*+

w/ % &*'

w7 % -&*'

w8 % -&*6w % &*0

-1w1w!w"

w#w5w$ w% w&


33/43

Beedin$ a trainin$e!ample Birst e!ample is

(sunn%+


34/43

The backpropa$ation al$orithm


35/43

3earnin$ in multi&la%erednetworks Networks with one or more hidden la%ers are

necessar% to represent comple! mappin$s

0n such a network the basic delta learnin$ law isinsuKcient 0t onl% defnes how to update wei$hts in output units

(uses T&6)

To update hidden node wei$hts+ we have to defne

their error This is achieved b% the 9ackpropagation al$orithm


36/43

The .ackpropa$ationprocess 0nputs are ed throu$h the network in the usual

wa% this is the orward pass

6utput la%er wei$hts are adusted based onerrors

L then wei$hts in the previous la%er are adustedL

L and so on back to the frst la%er this is the backwards pass (or backpropagation)

9rrors determined in a la%er are used to determinethose in the previous la%er


37/43

0llustratin$ the errorcontribution A hidden node is partiall% ‘credited’ or errors

in the ne!t la%er

these errors are created in the orward passerror '

error 6

error +

error k

w'

wk

error:contribution % w' error ' . ; . wk error k

'


38/43

The backpropa$ational$orithm

A backpropagation network is a multi&la%ered eed&orward network

usin$ the si$moid response activation unction

9ackpropagation algorithm" 0nitialise all network wei$hts to small random numbers

(between #= and ##=)


39/43

Termination conditions

2an% thousands o iterations (epochs or c%cles)ma% be necessar% to learn a classifcation mappin$ The more comple! the mappin$ to be learnt+ the more

c%cles will be re5uired

,everal termination conditions are used stop ater a $iven number o epochs

stop when the error on the trainin$ e!amples (or on aseparate validation set) alls below some a$reed level

,toppin$ too soon results in underftting+ too latein overftting


40/43

.ackpropa$ation as asearch 3earnin$ is a search or a network wei$ht

vector to implement the re5uired mappin$

The search is hill&climbin$ or ratherdescending called steepest gradient descent The heuristic used is the total o (T&6)


41/43

'roblems with the search

The si*e o step is controlled b% the learnin$ rate parameter This must be tuned or individual problems

0 the step is too lar$e search becomes ineKcient

The error surace tends to have e!tensive 7at areas trou$hs with ver% little slope

0t can be diKcult to reduce error in such re$ions

-ei$hts have to move lar$e distances and it can be hard todetermine the ri$ht direction

Hi$h numerical accurac% is re5uired+ e$ /


42/43

The trained network

Ater learnin$+ .ackpropa$ation ma% be usedas a classifer

4escriptions o new e!amples are ed into thenetwork and the class is read rom the output la%er

Bor "&out&o&N output representations+ e!act valueso # and " will not usuall% be obtained

$ensitivity analysis (usin$ test data)

determines which attributes are mostimportant or classifcation An attribute is re$arded as important i small

chan$es in its value aect the classifcation


43/43

.ackpropa$ation versus04/

These two al$orithms are the $iants oclassifcation learnin$

-hich is better1

the ur% is still out There are maor dierences

04/ avours discrete attributes+ .ackprop avourscontinuous (but each handles both t%pes)

.ackprop handles noise well .% usin$ prunin$+ so does04/

.ackprop is much slower than 04/ and ma% $et stuck 04/ tells us which attributes are important .ackprop does

this+ (to some e!tent) with sensitivit% anal%sis .ackprop’s learned knowled$e structure (wei$ht vector) is

not understandable whereas an 04/ tree can becomprehended (althou$h this is diKcult i the tree is lar$e)

Documents

09 Artificial Neural Networks and Classification