Artificial Neural nets

7/16/2015 Neuralnetworksanddeeplearning

http://neuralnetworksanddeeplearning.com/chap5.html 1/21

Imagineyou'reanengineerwhohasbeenaskedtodesignacomputerfromscratch.Onedayyou'reworkingawayinyouroffice,designinglogicalcircuits,settingoutANDgates,ORgates,andsoon,

whenyourbosswalksinwithbadnews.Thecustomerhasjustaddedasurprisingdesignrequirement:thecircuitfortheentirecomputermustbejusttwolayersdeep:

You'redumbfounded,andtellyourboss:"Thecustomeriscrazy!"

Yourbossreplies:"Ithinkthey'recrazy,too.Butwhatthecustomerwants,theyget."

Infact,there'salimitedsenseinwhichthecustomerisn'tcrazy.Supposeyou'reallowedtouseaspeciallogicalgatewhichletsyouANDtogetherasmanyinputsasyouwant.Andyou'realsoalloweda

manyinputNANDgate,thatis,agatewhichcanANDmultipleinputs

andthennegatetheoutput.Withthesespecialgatesitturnsouttobepossibletocomputeanyfunctionatallusingacircuitthat'sjusttwolayersdeep.

Butjustbecausesomethingispossibledoesn'tmakeitagoodidea.Inpractice,whensolvingcircuitdesignproblems(ormostanykindofalgorithmicproblem),weusuallystartbyfiguringouthowtosolvesubproblems,andthengraduallyintegratethesolutions.In

CHAPTER5

Whyaredeepneuralnetworkshardtotrain?

NeuralNetworksandDeepLearningWhatthisbookisaboutOntheexercisesandproblemsUsingneuralnetstorecognizehandwrittendigitsHowthebackpropagationalgorithmworksImprovingthewayneuralnetworkslearnAvisualproofthatneuralnetscancomputeanyfunctionWhyaredeepneuralnetworkshardtotrain?DeeplearningIsthereasimplealgorithmforintelligence?AcknowledgementsFrequentlyAskedQuestions

Sponsors

Thankstoallthesupporterswhomadethebookpossible.ThanksalsotoallthecontributorstotheBugfinderHallofFame.

Thebookiscurrentlyabetarelease,andisstillunderactivedevelopment.Pleasesenderrorreportstomn@michaelnielsen.org.Forotherenquiries,pleaseseetheFAQfirst.

ResourcesCoderepository

Mailinglistforbookannouncements



otherwords,webuilduptoasolutionthroughmultiplelayersofabstraction.

Forinstance,supposewe'redesigningalogicalcircuittomultiplytwonumbers.Chancesarewewanttobuilditupoutofsubcircuitsdoingoperationslikeaddingtwonumbers.Thesubcircuitsforaddingtwonumberswill,inturn,bebuiltupoutofsubsubcircuitsforaddingtwobits.Veryroughlyspeakingourcircuitwilllooklike:

Thatis,ourfinalcircuitcontainsatleastthreelayersofcircuitelements.Infact,it'llprobablycontainmorethanthreelayers,aswebreakthesubtasksdownintosmallerunitsthanI'vedescribed.Butyougetthegeneralidea.

Sodeepcircuitsmaketheprocessofdesigneasier.Butthey'renotjusthelpfulfordesign.Thereare,infact,mathematicalproofsshowingthatforsomefunctionsveryshallowcircuitsrequireexponentiallymorecircuitelementstocomputethandodeepcircuits.Forinstance,afamous1984paperbyFurst,SaxeandSipser*showedthatcomputingtheparityofasetofbitsrequiresexponentiallymanygates,ifdonewithashallowcircuit.Ontheotherhand,ifyouusedeepercircuitsit'seasytocomputetheparityusingasmallcircuit:youjustcomputetheparityofpairsofbits,thenusethoseresultstocomputetheparityofpairsofpairsofbits,andsoon,buildingupquicklytotheoverallparity.Deepcircuitsthuscanbeintrinsicallymuchmorepowerfulthanshallowcircuits.

MichaelNielsen'sprojectannouncementmailinglist

ByMichaelNielsen/Jul2015

*SeeParity,Circuits,andthePolynomialTimeHierarchy,byMerrickFurst,JamesB.Saxe,andMichaelSipser(1984).



Uptonow,thisbookhasapproachedneuralnetworkslikethecrazycustomer.Almostallthenetworkswe'veworkedwithhavejustasinglehiddenlayerofneurons(plustheinputandoutputlayers):

Thesesimplenetworkshavebeenremarkablyuseful:inearlierchaptersweusednetworkslikethistoclassifyhandwrittendigitswithbetterthan98percentaccuracy!Nonetheless,intuitivelywe'dexpectnetworkswithmanymorehiddenlayerstobemorepowerful:

Suchnetworkscouldusetheintermediatelayerstobuildupmultiplelayersofabstraction,justaswedoinBooleancircuits.Forinstance,ifwe'redoingvisualpatternrecognition,thentheneuronsinthefirstlayermightlearntorecognizeedges,theneuronsinthesecondlayercouldlearntorecognizemorecomplexshapes,saytriangleorrectangles,builtupfromedges.Thethirdlayerwouldthenrecognizestillmorecomplexshapes.Andsoon.Thesemultiplelayersofabstractionseemlikelytogivedeepnetworksacompellingadvantageinlearningtosolvecomplexpattern



recognitionproblems.Moreover,justasinthecaseofcircuits,therearetheoreticalresultssuggestingthatdeepnetworksareintrinsicallymorepowerfulthanshallownetworks*.

Howcanwetrainsuchdeepnetworks?Inthischapter,we'lltrytrainingdeepnetworksusingourworkhorselearningalgorithmstochasticgradientdescentbybackpropagation.Butwe'llrunintotrouble,withourdeepnetworksnotperformingmuch(ifatall)betterthanshallownetworks.

Thatfailureseemssurprisinginthelightofthediscussionabove.Ratherthangiveupondeepnetworks,we'lldigdownandtrytounderstandwhat'smakingourdeepnetworkshardtotrain.Whenwelookclosely,we'lldiscoverthatthedifferentlayersinourdeepnetworkarelearningatvastlydifferentspeeds.Inparticular,whenlaterlayersinthenetworkarelearningwell,earlylayersoftengetstuckduringtraining,learningalmostnothingatall.Thisstucknessisn'tsimplyduetobadluck.Rather,we'lldiscovertherearefundamentalreasonsthelearningslowdownoccurs,connectedtoouruseofgradientbasedlearningtechniques.

Aswedelveintotheproblemmoredeeply,we'lllearnthattheoppositephenomenoncanalsooccur:theearlylayersmaybelearningwell,butlaterlayerscanbecomestuck.Infact,we'llfindthatthere'sanintrinsicinstabilityassociatedtolearningbygradientdescentindeep,manylayerneuralnetworks.Thisinstabilitytendstoresultineithertheearlyorthelaterlayersgettingstuckduringtraining.

Thisallsoundslikebadnews.Butbydelvingintothesedifficulties,wecanbegintogaininsightintowhat'srequiredtotraindeepnetworkseffectively.Andsotheseinvestigationsaregoodpreparationforthenextchapter,wherewe'llusedeeplearningtoattackimagerecognitionproblems.

ThevanishinggradientproblemSo,whatgoeswrongwhenwetrytotrainadeepnetwork?

*ForcertainproblemsandnetworkarchitecturesthisisprovedinOnthenumberofresponseregionsofdeepfeedforwardnetworkswithpiecewiselinearactivations,byRazvanPascanu,GuidoMontfar,andYoshuaBengio(2014).Seealsothemoreinformaldiscussioninsection2ofLearningdeeparchitecturesforAI,byYoshuaBengio(2009).



Toanswerthatquestion,let'sfirstrevisitthecaseofanetworkwithjustasinglehiddenlayer.Asperusual,we'llusetheMNISTdigitclassificationproblemasourplaygroundforlearningandexperimentation*.

Ifyouwish,youcanfollowalongbytrainingnetworksonyourcomputer.Itisalso,ofcourse,finetojustreadalong.Ifyoudowishtofollowlive,thenyou'llneedPython2.7,Numpy,andacopyofthecode,whichyoucangetbycloningtherelevantrepositoryfromthecommandline:

gitclonehttps://github.com/mnielsen/neuralnetworksanddeeplearning.git

Ifyoudon'tusegitthenyoucandownloadthedataandcodehere.You'llneedtochangeintothesrcsubdirectory.

Then,fromaPythonshellweloadtheMNISTdata:

>>>importmnist_loader

>>>training_data,validation_data,test_data=\

...mnist_loader.load_data_wrapper()

Wesetupournetwork:

>>>importnetwork2

>>>net=network2.Network([784,30,10])

Thisnetworkhas784neuronsintheinputlayer,correspondingtothe pixelsintheinputimage.Weuse30hiddenneurons,aswellas10outputneurons,correspondingtothe10possibleclassificationsfortheMNISTdigits('0','1','2', ,'9').

Let'strytrainingournetworkfor30completeepochs,usingminibatchesof10trainingexamplesatatime,alearningrate ,andregularizationparameter .Aswetrainwe'llmonitortheclassificationaccuracyonthevalidation_data*:

>>>net.SGD(training_data,30,10,0.1,lmbda=5.0,

...evaluation_data=validation_data,monitor_evaluation_accuracy=True)

Wegetaclassificationaccuracyof96.48percent(orthereaboutsit'llvaryabitfromruntorun),comparabletoourearlierresultswithasimilarconfiguration.

*IintroducedtheMNISTproblemanddatahereandhere.

g

I

M

*Notethatthenetworkstakequitesometimetotrainuptoafewminutespertrainingepoch,dependingonthespeedofyourmachine.Soifyou'rerunningthecodeit'sbesttocontinuereadingandreturnlater,nottowaitforthecodetofinishexecuting.



Now,let'saddanotherhiddenlayer,alsowith30neuronsinit,andtrytrainingwiththesamehyperparameters:

>>>net=network2.Network([784,30,30,10])



Thisgivesanimprovedclassificationaccuracy,96.90percent.That'sencouraging:alittlemoredepthishelping.Let'saddanother30neuronhiddenlayer:

>>>net=network2.Network([784,30,30,30,10])



Thatdoesn'thelpatall.Infact,theresultdropsbackdownto96.57percent,closetoouroriginalshallownetwork.Andsupposeweinsertonefurtherhiddenlayer:

>>>net=network2.Network([784,30,30,30,30,10])



Theclassificationaccuracydropsagain,to96.53percent.That'sprobablynotastatisticallysignificantdrop,butit'snotencouraging,either.

Thisbehaviourseemsstrange.Intuitively,extrahiddenlayersoughttomakethenetworkabletolearnmorecomplexclassificationfunctions,andthusdoabetterjobclassifying.Certainly,thingsshouldn'tgetworse,sincetheextralayerscan,intheworstcase,simplydonothing*.Butthat'snotwhat'sgoingon.

Sowhatisgoingon?Let'sassumethattheextrahiddenlayersreallycouldhelpinprinciple,andtheproblemisthatourlearningalgorithmisn'tfindingtherightweightsandbiases.We'dliketofigureoutwhat'sgoingwronginourlearningalgorithm,andhowtodobetter.

Togetsomeinsightintowhat'sgoingwrong,let'svisualizehowthenetworklearns.Below,I'veplottedpartofanetwork,i.e.,anetworkwithtwohiddenlayers,eachcontaininghiddenneurons.Eachneuroninthediagramhasalittlebaronit,

*Seethislaterproblemtounderstandhowtobuildahiddenlayerthatdoesnothing.



representinghowquicklythatneuronischangingasthenetworklearns.Abigbarmeanstheneuron'sweightsandbiasarechangingrapidly,whileasmallbarmeanstheweightsandbiasarechangingslowly.Moreprecisely,thebarsdenotethegradient foreachneuron,i.e.,therateofchangeofthecostwithrespecttotheneuron'sbias.BackinChapter2wesawthatthisgradientquantitycontrollednotjusthowrapidlythebiaschangesduringlearning,butalsohowrapidlytheweightsinputtotheneuronchange,too.Don'tworryifyoudon'trecallthedetails:thethingtokeepinmindissimplythatthesebarsshowhowquicklyeachneuron'sweightsandbiasarechangingasthenetworklearns.

Tokeepthediagramsimple,I'veshownjustthetopsixneuronsinthetwohiddenlayers.I'veomittedtheinputneurons,sincethey'vegotnoweightsorbiasestolearn.I'vealsoomittedtheoutputneurons,sincewe'redoinglayerwisecomparisons,anditmakesmostsensetocomparelayerswiththesamenumberofneurons.Theresultsareplottedattheverybeginningoftraining,i.e.,immediatelyafterthenetworkisinitialized.Heretheyare*:

*Thedataplottedisgeneratedusingtheprogramgenerate_gradient.py.Thesameprogramisalsousedtogeneratetheresultsquotedlaterinthissection.



Thenetworkwasinitializedrandomly,andsoit'snotsurprisingthatthere'salotofvariationinhowrapidlytheneuronslearn.Still,onethingthatjumpsoutisthatthebarsinthesecondhiddenlayeraremostlymuchlargerthanthebarsinthefirsthiddenlayer.Asaresult,theneuronsinthesecondhiddenlayerwilllearnquiteabitfasterthantheneuronsinthefirsthiddenlayer.Isthismerelyacoincidence,oraretheneuronsinthesecondhiddenlayerlikelytolearnfasterthanneuronsinthefirsthiddenlayeringeneral?

Todeterminewhetherthisisthecase,ithelpstohaveaglobalwayofcomparingthespeedoflearninginthefirstandsecondhiddenlayers.Todothis,let'sdenotethegradientas ,i.e.,thegradientforthe thneuroninthe thlayer*.Wecanthinkofthegradient asavectorwhoseentriesdeterminehowquicklythefirsthiddenlayerlearns,and asavectorwhoseentriesdeterminehowquicklythesecondhiddenlayerlearns.We'llthenusethelengthsofthesevectorsas(rough!)globalmeasuresofthespeedat

F

(

&

(

&

& (

*BackinChapter2wereferredtothisastheerror,butherewe'lladopttheinformalterm"gradient".Isay"informal"becauseofcoursethisdoesn'texplicitlyincludethepartialderivativesofthecostwithrespecttotheweights, .3

F

F



whichthelayersarelearning.So,forinstance,thelengthmeasuresthespeedatwhichthefirsthiddenlayerislearning,whilethelength measuresthespeedatwhichthesecondhiddenlayerislearning.

Withthesedefinitions,andinthesameconfigurationaswasplottedabove,wefind and .Sothisconfirmsourearliersuspicion:theneuronsinthesecondhiddenlayerreallyarelearningmuchfasterthantheneuronsinthefirsthiddenlayer.

Whathappensifweaddmorehiddenlayers?Ifwehavethreehiddenlayers,ina network,thentherespectivespeedsoflearningturnouttobe0.012,0.060,and0.283.Again,earlierhiddenlayersarelearningmuchslowerthanlaterhiddenlayers.Supposeweaddyetanotherlayerwith hiddenneurons.Inthatcase,therespectivespeedsoflearningare0.003,0.017,0.070,and0.285.Thepatternholds:earlylayerslearnslowerthanlaterlayers.

We'vebeenlookingatthespeedoflearningatthestartoftraining,thatis,justafterthenetworksareinitialized.Howdoesthespeedoflearningchangeaswetrainournetworks?Let'sreturntolookatthenetworkwithjusttwohiddenlayers.Thespeedoflearningchangesasfollows:

Togeneratetheseresults,Iusedbatchgradientdescentwithjust

F

F

F

F



1,000trainingimages,trainedover500epochs.ThisisabitdifferentthanthewayweusuallytrainI'veusednominibatches,andjust1,000trainingimages,ratherthanthefull50,000imagetrainingset.I'mnottryingtodoanythingsneaky,orpullthewooloveryoureyes,butitturnsoutthatusingminibatchstochasticgradientdescentgivesmuchnoisier(albeitverysimilar,whenyouaverageawaythenoise)results.UsingtheparametersI'vechosenisaneasywayofsmoothingtheresultsout,sowecanseewhat'sgoingon.

Inanycase,asyoucanseethetwolayersstartoutlearningatverydifferentspeeds(aswealreadyknow).Thespeedinbothlayersthendropsveryquickly,beforerebounding.Butthroughitall,thefirsthiddenlayerlearnsmuchmoreslowlythanthesecondhiddenlayer.

Whataboutmorecomplexnetworks?Here'stheresultsofasimilarexperiment,butthistimewiththreehiddenlayers(a

network):

Again,earlyhiddenlayerslearnmuchmoreslowlythanlaterhiddenlayers.Finally,let'saddafourthhiddenlayers(a

network),andseewhathappenswhenwetrain:



Again,earlyhiddenlayerslearnmuchmoreslowlythanlaterhiddenlayers.Inthiscase,thefirsthiddenlayerislearningroughly100timesslowerthanthefinalhiddenlayer.Nowonderwewerehavingtroubletrainingthesenetworksearlier!

Wehavehereanimportantobservation:inatleastsomedeepneuralnetworks,thegradienttendstogetsmalleraswemovebackwardthroughthehiddenlayers.Thismeansthatneuronsintheearlierlayerslearnmuchmoreslowlythanneuronsinlaterlayers.Andwhilewe'veseenthisinjustasinglenetwork,therearefundamentalreasonswhythishappensinmanyneuralnetworks.Thephenomenonisknownasthevanishinggradientproblem*.

Whydoesthevanishinggradientproblemoccur?Aretherewayswecanavoidit?Andhowshouldwedealwithitintrainingdeepneuralnetworks?Infact,we'lllearnshortlythatit'snotinevitable,althoughthealternativeisnotveryattractive,either:sometimesthegradientgetsmuchlargerinearlierlayers!Thisistheexplodinggradientproblem,andit'snotmuchbetternewsthanthevanishinggradientproblem.Moregenerally,itturnsoutthatthegradientindeepneuralnetworksisunstable,tendingtoeitherexplodeorvanishinearlierlayers.Thisinstabilityisafundamentalproblemforgradientbasedlearningindeepneuralnetworks.It'ssomethingweneedtounderstand,and,ifpossible,takestepstoaddress.

*SeeGradientflowinrecurrentnets:thedifficultyoflearninglongtermdependencies,bySeppHochreiter,YoshuaBengio,PaoloFrasconi,andJrgenSchmidhuber(2001).Thispaperstudiedrecurrentneuralnets,buttheessentialphenomenonisthesameasinthefeedforwardnetworkswearestudying.SeealsoSeppHochreiter'searlierDiplomaThesis,UntersuchungenzudynamischenneuronalenNetzen(1991,inGerman).



Oneresponsetovanishing(orunstable)gradientsistowonderifthey'rereallysuchaproblem.Momentarilysteppingawayfromneuralnets,imagineweweretryingtonumericallyminimizeafunction ofasinglevariable.Wouldn'titbegoodnewsifthederivative wassmall?Wouldn'tthatmeanwewerealreadynearanextremum?Inasimilarway,mightthesmallgradientinearlylayersofadeepnetworkmeanthatwedon'tneedtodomuchadjustmentoftheweightsandbiases?

Ofcourse,thisisn'tthecase.Recallthatwerandomlyinitializedtheweightandbiasesinthenetwork.Itisextremelyunlikelyourinitialweightsandbiaseswilldoagoodjobatwhateveritiswewantournetworktodo.Tobeconcrete,considerthefirstlayerofweightsina networkfortheMNISTproblem.Therandominitializationmeansthefirstlayerthrowsawaymostinformationabouttheinputimage.Eveniflaterlayershavebeenextensivelytrained,theywillstillfinditextremelydifficulttoidentifytheinputimage,simplybecausetheydon'thaveenoughinformation.Andsoitcan'tpossiblybethecasethatnotmuchlearningneedstobedoneinthefirstlayer.Ifwe'regoingtotraindeepnetworks,weneedtofigureouthowtoaddressthevanishinggradientproblem.

What'scausingthevanishinggradientproblem?UnstablegradientsindeepneuralnetsTogetinsightintowhythevanishinggradientproblemoccurs,let'sconsiderthesimplestdeepneuralnetwork:onewithjustasingleneuronineachlayer.Here'sanetworkwiththreehiddenlayers:

Here, aretheweights, arethebiases,and issomecostfunction.Justtoremindyouhowthisworks,theoutputfromthe thneuronis ,where istheusualsigmoid

activationfunction,and istheweightedinputtotheneuron.I'vedrawnthecost attheendtoemphasizethatthe

"4

4"

3

3

&

& U 6

&

U

6

&

3

&

&

&



costisafunctionofthenetwork'soutput, :iftheactualoutputfromthenetworkisclosetothedesiredoutput,thenthecostwillbelow,whileifit'sfaraway,thecostwillbehigh.

We'regoingtostudythegradient associatedtothefirsthiddenneuron.We'llfigureoutanexpressionfor ,andbystudyingthatexpressionwe'llunderstandwhythevanishinggradientproblemoccurs.

I'llstartbysimplyshowingyoutheexpressionfor .Itlooksforbidding,butit'sactuallygotasimplestructure,whichI'lldescribeinamoment.Here'stheexpression(ignorethenetwork,fornow,andnotethat isjustthederivativeofthe function):

Thestructureintheexpressionisasfollows:thereisa termintheproductforeachneuroninthenetworkaweight termforeachweightinthenetworkandafinal term,correspondingtothecostfunctionattheend.NoticethatI'veplacedeachtermintheexpressionabovethecorrespondingpartofthenetwork.Sothenetworkitselfisamnemonicfortheexpression.

You'rewelcometotakethisexpressionforgranted,andskiptothediscussionofhowitrelatestothevanishinggradientproblem.There'snoharmindoingthis,sincetheexpressionisaspecialcaseofourearlierdiscussionofbackpropagation.Butthere'salsoasimpleexplanationofwhytheexpressionistrue,andsoit'sfun(andperhapsenlightening)totakealookatthatexplanation.

Imaginewemakeasmallchange inthebias .Thatwillsetoffacascadingseriesofchangesintherestofthenetwork.First,itcausesachange intheoutputfromthefirsthiddenneuron.That,inturn,willcauseachange intheweightedinputtothesecondhiddenneuron.Thenachange intheoutputfromthesecondhiddenneuron.Andsoon,allthewaythroughtoachange

U

U

U

6

&

3

&

6



inthecostattheoutput.Wehave

Thissuggeststhatwecanfigureoutanexpressionforthegradientbycarefullytrackingtheeffectofeachstepinthiscascade.

Todothis,let'sthinkabouthow causestheoutput fromthefirsthiddenneurontochange.Wehave ,so

That termshouldlookfamiliar:it'sthefirstterminourclaimedexpressionforthegradient .Intuitively,thistermconvertsachange inthebiasintoachange intheoutputactivation.Thatchange inturncausesachangeintheweightedinput tothesecondhiddenneuron:

Combiningourexpressionsfor and ,weseehowthechangeinthebias propagatesalongthenetworktoaffect :

Again,thatshouldlookfamiliar:we'venowgotthefirsttwotermsinourclaimedexpressionforthegradient .

Wecankeepgoinginthisfashion,trackingthewaychangespropagatethroughtherestofthenetwork.Ateachneuronwepickupa term,andthrougheachweightwepickupa term.Theendresultisanexpressionrelatingthefinalchange incosttotheinitialchange inthebias:

Dividingby wedoindeedgetthedesiredexpressionforthe

U U

6

3

U 3

U

6

U

6

6

3

6

6

3

6

6

6

U

6

3

U

6

&

3

&

U

6

3

U

6

U

6



gradient:

Whythevanishinggradientproblemoccurs:Tounderstandwhythevanishinggradientproblemoccurs,let'sexplicitlywriteouttheentireexpressionforthegradient:

Exceptingtheverylastterm,thisexpressionisaproductoftermsoftheform .Tounderstandhoweachofthosetermsbehave,let'slookataplotofthefunction :

4 3 2 1 0 1 2 3 40.00

0.05

0.10

0.15

0.20

0.25

z

Derivativeofsigmoidfunction

Thederivativereachesamaximumat .Now,ifweuseourstandardapproachtoinitializingtheweightsinthenetwork,thenwe'llchoosetheweightsusingaGaussianwithmean andstandarddeviation .Sotheweightswillusuallysatisfy .Puttingtheseobservationstogether,weseethatthetermswillusuallysatisfy .Andwhenwetakeaproductofmanysuchterms,theproductwilltendtoexponentiallydecrease:themoreterms,thesmallertheproductwillbe.Thisisstartingtosmelllikeapossibleexplanationforthevanishinggradientproblem.

Tomakethisallabitmoreexplicit,let'scomparetheexpressionfortoanexpressionforthegradientwithrespecttoalater

bias,say .Ofcourse,wehaven'texplicitlyworkedoutan

U

6

3

U

6

U

6

U

6

3

U

6

3

U

6

3

U

6

3

&

U

6

&

U

U

] ] 3

&

3

&

U

6

&

] ] 3

&

U

6

&



expressionfor ,butitfollowsthesamepatterndescribedabovefor .Here'sthecomparisonofthetwoexpressions:

Thetwoexpressionssharemanyterms.Butthegradientincludestwoextratermseachoftheform .Aswe'veseen,suchtermsaretypicallylessthan inmagnitude.Andsothegradient willusuallybeafactorof (ormore)smallerthan .Thisistheessentialoriginofthevanishinggradientproblem.

Ofcourse,thisisaninformalargument,notarigorousproofthatthevanishinggradientproblemwilloccur.Thereareseveralpossibleescapeclauses.Inparticular,wemightwonderwhethertheweights couldgrowduringtraining.Iftheydo,it'spossibletheterms intheproductwillnolongersatisfy

.Indeed,ifthetermsgetlargeenoughgreaterthan thenwewillnolongerhaveavanishinggradientproblem.Instead,thegradientwillactuallygrowexponentiallyaswemovebackwardthroughthelayers.Insteadofavanishinggradientproblem,we'llhaveanexplodinggradientproblem.

Theexplodinggradientproblem:Let'slookatanexplicitexamplewhereexplodinggradientsoccur.Theexampleissomewhatcontrived:I'mgoingtofixparametersinthenetworkinjusttherightwaytoensurewegetanexplodinggradient.Buteventhoughtheexampleiscontrived,ithasthevirtueoffirmlyestablishingthatexplodinggradientsaren'tmerelyahypotheticalpossibility,theyreallycanhappen.

Therearetwostepstogettinganexplodinggradient.First,we

3

&

U

6

&

3

&

3

&

U

6

&

] ] 3

&

U

6

&



choosealltheweightsinthenetworktobelarge,say.Second,we'llchoosethebiasessothat

the termsarenottoosmall.That'sactuallyprettyeasytodo:allweneeddoischoosethebiasestoensurethattheweightedinputtoeachneuronis (andso ).So,forinstance,wewant .Wecanachievethisbysetting

.Wecanusethesameideatoselecttheotherbiases.Whenwedothis,weseethatalltheterms areequalto

.Withthesechoiceswegetanexplodinggradient.

Theunstablegradientproblem:Thefundamentalproblemhereisn'tsomuchthevanishinggradientproblemortheexplodinggradientproblem.It'sthatthegradientinearlylayersistheproductoftermsfromallthelaterlayers.Whentherearemanylayers,that'sanintrinsicallyunstablesituation.Theonlywayalllayerscanlearnatclosetothesamespeedisifallthoseproductsoftermscomeclosetobalancingout.Withoutsomemechanismorunderlyingreasonforthatbalancingtooccur,it'shighlyunlikelytohappensimplybychance.Inshort,therealproblemhereisthatneuralnetworkssufferfromanunstablegradientproblem.Asaresult,ifweusestandardgradientbasedlearningtechniques,differentlayersinthenetworkwilltendtolearnatwildlydifferentspeeds.

Exercise

Inourdiscussionofthevanishinggradientproblem,wemadeuseofthefactthat .Supposeweusedadifferentactivationfunction,onewhosederivativecouldbemuchlarger.Wouldthathelpusavoidtheunstablegradientproblem?

Theprevalenceofthevanishinggradientproblem:We'veseenthatthegradientcaneithervanishorexplodeintheearlylayersofadeepnetwork.Infact,whenusingsigmoidneuronsthegradientwillusuallyvanish.Toseewhy,consideragaintheexpression .Toavoidthevanishinggradientproblemweneed .Youmightthinkthiscouldhappeneasilyif isverylarge.However,it'smoredifficultthanitlooks.Thereasonisthatthe termalsodependson : ,where

3

3

3

3

U

6

&

6

&

U

6

&

6

3

3

&

U

6

&

] 6] U

]3 6]U

]3 6] U

3

6U

3 6 3 U

U



istheinputactivation.Sowhenwemake large,weneedtobecarefulthatwe'renotsimultaneouslymaking small.Thatturnsouttobeaconsiderableconstraint.Thereasonisthatwhenwemake largewetendtomake verylarge.Lookingatthegraphof youcanseethatthisputsusoffinthe"wings"ofthefunction,whereittakesverysmallvalues.Theonlywaytoavoidthisisiftheinputactivationfallswithinafairlynarrowrangeofvalues(thisqualitativeexplanationismadequantitativeinthefirstproblembelow).Sometimesthatwillchancetohappen.Moreoften,though,itdoesnothappen.Andsointhegenericcasewehavevanishinggradients.

Problems

Considertheproduct .Suppose.(1)Arguethatthiscanonlyeveroccurif

.(2)Supposingthat ,considerthesetofinputactivations forwhich .Showthatthesetofsatisfyingthatconstraintcanrangeoveranintervalno

greaterinwidththan

(3)Shownumericallythattheaboveexpressionboundingthewidthoftherangeisgreatestat ,whereittakesavalue .Andsoevengiventhateverythinglinesupjustperfectly,westillhaveafairlynarrowrangeofinputactivationswhichcanavoidthevanishinggradientproblem.

Identityneuron:Consideraneuronwithasingleinput, ,acorrespondingweight, ,abias ,andaweight ontheoutput.Showthatbychoosingtheweightsandbiasappropriately,wecanensure for .Suchaneuroncanthusbeusedasakindofidentityneuron,thatis,aneuronwhoseoutputisthesame(uptorescalingbyaweightfactor)asitsinput.Hint:Ithelpstorewrite

,toassume issmall,andtouseaTaylorseries

3

3 U

3 3

U

U

]3 3 ]U

]3 3 ] U

]3] ]3]

]3 3 ] U

MO

]3]

]3] ]3]

]3]

4

3

3

U 4 43

3

4

4 3



expansionin .

UnstablegradientsinmorecomplexnetworksWe'vebeenstudyingtoynetworks,withjustoneneuronineachhiddenlayer.Whataboutmorecomplexdeepnetworks,withmanyneuronsineachhiddenlayer?

Infact,muchthesamebehaviouroccursinsuchnetworks.Intheearlierchapteronbackpropagationwesawthatthegradientinthethlayerofan layernetworkisgivenby:

Here, isadiagonalmatrixwhoseentriesarethe valuesfortheweightedinputstothe thlayer.The aretheweightmatricesforthedifferentlayers.And isthevectorofpartialderivativesof withrespecttotheoutputactivations.

Thisisamuchmorecomplicatedexpressionthaninthesingleneuroncase.Still,ifyoulookclosely,theessentialformisverysimilar,withlotsofpairsoftheform .What'smore,thematrices havesmallentriesonthediagonal,nonelargerthan.Providedtheweightmatrices aren'ttoolarge,eachadditional

term tendstomakethegradientvectorsmaller,leadingtoavanishinggradient.Moregenerally,thelargenumberoftermsintheproducttendstoleadtoanunstablegradient,justasinourearlierexample.Inpractice,empiricallyitistypicallyfoundin

3

(

F

(

6

(

3

(

6

(

3

(

6

6

(

6U

( 3

(

3

&

6

&

6

&

3

&

3

&

6

(



sigmoidnetworksthatgradientsvanishexponentiallyquicklyinearlierlayers.Asaresult,learningslowsdowninthoselayers.Thisslowdownisn'tmerelyanaccidentoraninconvenience:it'safundamentalconsequenceoftheapproachwe'retakingtolearning.

OtherobstaclestodeeplearningInthischapterwe'vefocusedonvanishinggradientsand,moregenerally,unstablegradientsasanobstacletodeeplearning.Infact,unstablegradientsarejustoneobstacletodeeplearning,albeitanimportantfundamentalobstacle.Muchongoingresearchaimstobetterunderstandthechallengesthatcanoccurwhentrainingdeepnetworks.Iwon'tcomprehensivelysummarizethatworkhere,butjustwanttobrieflymentionacoupleofpapers,togiveyoutheflavorofsomeofthequestionspeopleareasking.

Asafirstexample,in2010GlorotandBengio*foundevidencesuggestingthattheuseofsigmoidactivationfunctionscancauseproblemstrainingdeepnetworks.Inparticular,theyfoundevidencethattheuseofsigmoidswillcausetheactivationsinthefinalhiddenlayertosaturatenear earlyintraining,substantiallyslowingdownlearning.Theysuggestedsomealternativeactivationfunctions,whichappearnottosufferasmuchfromthissaturationproblem.

Asasecondexample,in2013Sutskever,Martens,DahlandHinton*studiedtheimpactondeeplearningofboththerandomweightinitializationandthemomentumscheduleinmomentumbasedstochasticgradientdescent.Inbothcases,makinggoodchoicesmadeasubstantialdifferenceintheabilitytotraindeepnetworks.

Theseexamplessuggestthat"Whatmakesdeepnetworkshardtotrain?"isacomplexquestion.Inthischapter,we'vefocusedontheinstabilitiesassociatedtogradientbasedlearningindeepnetworks.Theresultsinthelasttwoparagraphssuggestthatthereisalsoaroleplayedbythechoiceofactivationfunction,thewayweightsare

*Understandingthedifficultyoftrainingdeepfeedforwardneuralnetworks,byXavierGlorotandYoshuaBengio(2010).SeealsotheearlierdiscussionoftheuseofsigmoidsinEfficientBackProp,byYannLeCun,LonBottou,GenevieveOrrandKlausRobertMller(1998).

*Ontheimportanceofinitializationandmomentumindeeplearning,byIlyaSutskever,JamesMartens,GeorgeDahlandGeoffreyHinton(2013).



initialized,andevendetailsofhowlearningbygradientdescentisimplemented.And,ofcourse,choiceofnetworkarchitectureandotherhyperparametersisalsoimportant.Thus,manyfactorscanplayaroleinmakingdeepnetworkshardtotrain,andunderstandingallthosefactorsisstillasubjectofongoingresearch.Thisallseemsratherdownbeatandpessimisminducing.Butthegoodnewsisthatinthenextchapterwe'llturnthataround,anddevelopseveralapproachestodeeplearningthattosomeextentmanagetoovercomeorroutearoundallthesechallenges.

Inacademicwork,pleasecitethisbookas:MichaelA.Nielsen,"NeuralNetworksandDeepLearning",DeterminationPress,2015

ThisworkislicensedunderaCreativeCommonsAttributionNonCommercial3.0UnportedLicense.Thismeansyou'refreetocopy,share,andbuildonthisbook,butnottosellit.Ifyou'reinterestedincommercialuse,pleasecontactme.

Lastupdate:FriJul1008:53:052015

Documents

Artificial Neural nets