Upload
manikandan
View
19
Download
0
Embed Size (px)
DESCRIPTION
Why are ann diff to train, micheal nielsen,ANN,RNN, deep learning details with diagram.HF detection , gradient descent usage
Citation preview
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 1/21
Imagineyou'reanengineerwhohasbeenaskedtodesignacomputerfromscratch.Onedayyou'reworkingawayinyouroffice,designinglogicalcircuits,settingoutANDgates,ORgates,andsoon,
whenyourbosswalksinwithbadnews.Thecustomerhasjustaddedasurprisingdesignrequirement:thecircuitfortheentirecomputermustbejusttwolayersdeep:
You'redumbfounded,andtellyourboss:"Thecustomeriscrazy!"
Yourbossreplies:"Ithinkthey'recrazy,too.Butwhatthecustomerwants,theyget."
Infact,there'salimitedsenseinwhichthecustomerisn'tcrazy.Supposeyou'reallowedtouseaspeciallogicalgatewhichletsyouANDtogetherasmanyinputsasyouwant.Andyou'realsoalloweda
manyinputNANDgate,thatis,agatewhichcanANDmultipleinputs
andthennegatetheoutput.Withthesespecialgatesitturnsouttobepossibletocomputeanyfunctionatallusingacircuitthat'sjusttwolayersdeep.
Butjustbecausesomethingispossibledoesn'tmakeitagoodidea.Inpractice,whensolvingcircuitdesignproblems(ormostanykindofalgorithmicproblem),weusuallystartbyfiguringouthowtosolvesubproblems,andthengraduallyintegratethesolutions.In
CHAPTER5
Whyaredeepneuralnetworkshardtotrain?
NeuralNetworksandDeepLearningWhatthisbookisaboutOntheexercisesandproblemsUsingneuralnetstorecognizehandwrittendigitsHowthebackpropagationalgorithmworksImprovingthewayneuralnetworkslearnAvisualproofthatneuralnetscancomputeanyfunctionWhyaredeepneuralnetworkshardtotrain?DeeplearningIsthereasimplealgorithmforintelligence?AcknowledgementsFrequentlyAskedQuestions
Sponsors
Thankstoallthesupporterswhomadethebookpossible.ThanksalsotoallthecontributorstotheBugfinderHallofFame.
Thebookiscurrentlyabetarelease,andisstillunderactivedevelopment.Pleasesenderrorreportstomn@michaelnielsen.org.Forotherenquiries,pleaseseetheFAQfirst.
ResourcesCoderepository
Mailinglistforbookannouncements
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 2/21
otherwords,webuilduptoasolutionthroughmultiplelayersofabstraction.
Forinstance,supposewe'redesigningalogicalcircuittomultiplytwonumbers.Chancesarewewanttobuilditupoutofsubcircuitsdoingoperationslikeaddingtwonumbers.Thesubcircuitsforaddingtwonumberswill,inturn,bebuiltupoutofsubsubcircuitsforaddingtwobits.Veryroughlyspeakingourcircuitwilllooklike:
Thatis,ourfinalcircuitcontainsatleastthreelayersofcircuitelements.Infact,it'llprobablycontainmorethanthreelayers,aswebreakthesubtasksdownintosmallerunitsthanI'vedescribed.Butyougetthegeneralidea.
Sodeepcircuitsmaketheprocessofdesigneasier.Butthey'renotjusthelpfulfordesign.Thereare,infact,mathematicalproofsshowingthatforsomefunctionsveryshallowcircuitsrequireexponentiallymorecircuitelementstocomputethandodeepcircuits.Forinstance,afamous1984paperbyFurst,SaxeandSipser*showedthatcomputingtheparityofasetofbitsrequiresexponentiallymanygates,ifdonewithashallowcircuit.Ontheotherhand,ifyouusedeepercircuitsit'seasytocomputetheparityusingasmallcircuit:youjustcomputetheparityofpairsofbits,thenusethoseresultstocomputetheparityofpairsofpairsofbits,andsoon,buildingupquicklytotheoverallparity.Deepcircuitsthuscanbeintrinsicallymuchmorepowerfulthanshallowcircuits.
MichaelNielsen'sprojectannouncementmailinglist
ByMichaelNielsen/Jul2015
*SeeParity,Circuits,andthePolynomialTimeHierarchy,byMerrickFurst,JamesB.Saxe,andMichaelSipser(1984).
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 3/21
Uptonow,thisbookhasapproachedneuralnetworkslikethecrazycustomer.Almostallthenetworkswe'veworkedwithhavejustasinglehiddenlayerofneurons(plustheinputandoutputlayers):
Thesesimplenetworkshavebeenremarkablyuseful:inearlierchaptersweusednetworkslikethistoclassifyhandwrittendigitswithbetterthan98percentaccuracy!Nonetheless,intuitivelywe'dexpectnetworkswithmanymorehiddenlayerstobemorepowerful:
Suchnetworkscouldusetheintermediatelayerstobuildupmultiplelayersofabstraction,justaswedoinBooleancircuits.Forinstance,ifwe'redoingvisualpatternrecognition,thentheneuronsinthefirstlayermightlearntorecognizeedges,theneuronsinthesecondlayercouldlearntorecognizemorecomplexshapes,saytriangleorrectangles,builtupfromedges.Thethirdlayerwouldthenrecognizestillmorecomplexshapes.Andsoon.Thesemultiplelayersofabstractionseemlikelytogivedeepnetworksacompellingadvantageinlearningtosolvecomplexpattern
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 4/21
recognitionproblems.Moreover,justasinthecaseofcircuits,therearetheoreticalresultssuggestingthatdeepnetworksareintrinsicallymorepowerfulthanshallownetworks*.
Howcanwetrainsuchdeepnetworks?Inthischapter,we'lltrytrainingdeepnetworksusingourworkhorselearningalgorithmstochasticgradientdescentbybackpropagation.Butwe'llrunintotrouble,withourdeepnetworksnotperformingmuch(ifatall)betterthanshallownetworks.
Thatfailureseemssurprisinginthelightofthediscussionabove.Ratherthangiveupondeepnetworks,we'lldigdownandtrytounderstandwhat'smakingourdeepnetworkshardtotrain.Whenwelookclosely,we'lldiscoverthatthedifferentlayersinourdeepnetworkarelearningatvastlydifferentspeeds.Inparticular,whenlaterlayersinthenetworkarelearningwell,earlylayersoftengetstuckduringtraining,learningalmostnothingatall.Thisstucknessisn'tsimplyduetobadluck.Rather,we'lldiscovertherearefundamentalreasonsthelearningslowdownoccurs,connectedtoouruseofgradientbasedlearningtechniques.
Aswedelveintotheproblemmoredeeply,we'lllearnthattheoppositephenomenoncanalsooccur:theearlylayersmaybelearningwell,butlaterlayerscanbecomestuck.Infact,we'llfindthatthere'sanintrinsicinstabilityassociatedtolearningbygradientdescentindeep,manylayerneuralnetworks.Thisinstabilitytendstoresultineithertheearlyorthelaterlayersgettingstuckduringtraining.
Thisallsoundslikebadnews.Butbydelvingintothesedifficulties,wecanbegintogaininsightintowhat'srequiredtotraindeepnetworkseffectively.Andsotheseinvestigationsaregoodpreparationforthenextchapter,wherewe'llusedeeplearningtoattackimagerecognitionproblems.
ThevanishinggradientproblemSo,whatgoeswrongwhenwetrytotrainadeepnetwork?
*ForcertainproblemsandnetworkarchitecturesthisisprovedinOnthenumberofresponseregionsofdeepfeedforwardnetworkswithpiecewiselinearactivations,byRazvanPascanu,GuidoMontfar,andYoshuaBengio(2014).Seealsothemoreinformaldiscussioninsection2ofLearningdeeparchitecturesforAI,byYoshuaBengio(2009).
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 5/21
Toanswerthatquestion,let'sfirstrevisitthecaseofanetworkwithjustasinglehiddenlayer.Asperusual,we'llusetheMNISTdigitclassificationproblemasourplaygroundforlearningandexperimentation*.
Ifyouwish,youcanfollowalongbytrainingnetworksonyourcomputer.Itisalso,ofcourse,finetojustreadalong.Ifyoudowishtofollowlive,thenyou'llneedPython2.7,Numpy,andacopyofthecode,whichyoucangetbycloningtherelevantrepositoryfromthecommandline:
gitclonehttps://github.com/mnielsen/neuralnetworksanddeeplearning.git
Ifyoudon'tusegitthenyoucandownloadthedataandcodehere.You'llneedtochangeintothesrcsubdirectory.
Then,fromaPythonshellweloadtheMNISTdata:
>>>importmnist_loader
>>>training_data,validation_data,test_data=\
...mnist_loader.load_data_wrapper()
Wesetupournetwork:
>>>importnetwork2
>>>net=network2.Network([784,30,10])
Thisnetworkhas784neuronsintheinputlayer,correspondingtothe pixelsintheinputimage.Weuse30hiddenneurons,aswellas10outputneurons,correspondingtothe10possibleclassificationsfortheMNISTdigits('0','1','2', ,'9').
Let'strytrainingournetworkfor30completeepochs,usingminibatchesof10trainingexamplesatatime,alearningrate ,andregularizationparameter .Aswetrainwe'llmonitortheclassificationaccuracyonthevalidation_data*:
>>>net.SGD(training_data,30,10,0.1,lmbda=5.0,
...evaluation_data=validation_data,monitor_evaluation_accuracy=True)
Wegetaclassificationaccuracyof96.48percent(orthereaboutsit'llvaryabitfromruntorun),comparabletoourearlierresultswithasimilarconfiguration.
*IintroducedtheMNISTproblemanddatahereandhere.
g
I
M
*Notethatthenetworkstakequitesometimetotrainuptoafewminutespertrainingepoch,dependingonthespeedofyourmachine.Soifyou'rerunningthecodeit'sbesttocontinuereadingandreturnlater,nottowaitforthecodetofinishexecuting.
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 6/21
Now,let'saddanotherhiddenlayer,alsowith30neuronsinit,andtrytrainingwiththesamehyperparameters:
>>>net=network2.Network([784,30,30,10])
>>>net.SGD(training_data,30,10,0.1,lmbda=5.0,
...evaluation_data=validation_data,monitor_evaluation_accuracy=True)
Thisgivesanimprovedclassificationaccuracy,96.90percent.That'sencouraging:alittlemoredepthishelping.Let'saddanother30neuronhiddenlayer:
>>>net=network2.Network([784,30,30,30,10])
>>>net.SGD(training_data,30,10,0.1,lmbda=5.0,
...evaluation_data=validation_data,monitor_evaluation_accuracy=True)
Thatdoesn'thelpatall.Infact,theresultdropsbackdownto96.57percent,closetoouroriginalshallownetwork.Andsupposeweinsertonefurtherhiddenlayer:
>>>net=network2.Network([784,30,30,30,30,10])
>>>net.SGD(training_data,30,10,0.1,lmbda=5.0,
...evaluation_data=validation_data,monitor_evaluation_accuracy=True)
Theclassificationaccuracydropsagain,to96.53percent.That'sprobablynotastatisticallysignificantdrop,butit'snotencouraging,either.
Thisbehaviourseemsstrange.Intuitively,extrahiddenlayersoughttomakethenetworkabletolearnmorecomplexclassificationfunctions,andthusdoabetterjobclassifying.Certainly,thingsshouldn'tgetworse,sincetheextralayerscan,intheworstcase,simplydonothing*.Butthat'snotwhat'sgoingon.
Sowhatisgoingon?Let'sassumethattheextrahiddenlayersreallycouldhelpinprinciple,andtheproblemisthatourlearningalgorithmisn'tfindingtherightweightsandbiases.We'dliketofigureoutwhat'sgoingwronginourlearningalgorithm,andhowtodobetter.
Togetsomeinsightintowhat'sgoingwrong,let'svisualizehowthenetworklearns.Below,I'veplottedpartofanetwork,i.e.,anetworkwithtwohiddenlayers,eachcontaininghiddenneurons.Eachneuroninthediagramhasalittlebaronit,
*Seethislaterproblemtounderstandhowtobuildahiddenlayerthatdoesnothing.
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 7/21
representinghowquicklythatneuronischangingasthenetworklearns.Abigbarmeanstheneuron'sweightsandbiasarechangingrapidly,whileasmallbarmeanstheweightsandbiasarechangingslowly.Moreprecisely,thebarsdenotethegradient foreachneuron,i.e.,therateofchangeofthecostwithrespecttotheneuron'sbias.BackinChapter2wesawthatthisgradientquantitycontrollednotjusthowrapidlythebiaschangesduringlearning,butalsohowrapidlytheweightsinputtotheneuronchange,too.Don'tworryifyoudon'trecallthedetails:thethingtokeepinmindissimplythatthesebarsshowhowquicklyeachneuron'sweightsandbiasarechangingasthenetworklearns.
Tokeepthediagramsimple,I'veshownjustthetopsixneuronsinthetwohiddenlayers.I'veomittedtheinputneurons,sincethey'vegotnoweightsorbiasestolearn.I'vealsoomittedtheoutputneurons,sincewe'redoinglayerwisecomparisons,anditmakesmostsensetocomparelayerswiththesamenumberofneurons.Theresultsareplottedattheverybeginningoftraining,i.e.,immediatelyafterthenetworkisinitialized.Heretheyare*:
*Thedataplottedisgeneratedusingtheprogramgenerate_gradient.py.Thesameprogramisalsousedtogeneratetheresultsquotedlaterinthissection.
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 8/21
Thenetworkwasinitializedrandomly,andsoit'snotsurprisingthatthere'salotofvariationinhowrapidlytheneuronslearn.Still,onethingthatjumpsoutisthatthebarsinthesecondhiddenlayeraremostlymuchlargerthanthebarsinthefirsthiddenlayer.Asaresult,theneuronsinthesecondhiddenlayerwilllearnquiteabitfasterthantheneuronsinthefirsthiddenlayer.Isthismerelyacoincidence,oraretheneuronsinthesecondhiddenlayerlikelytolearnfasterthanneuronsinthefirsthiddenlayeringeneral?
Todeterminewhetherthisisthecase,ithelpstohaveaglobalwayofcomparingthespeedoflearninginthefirstandsecondhiddenlayers.Todothis,let'sdenotethegradientas ,i.e.,thegradientforthe thneuroninthe thlayer*.Wecanthinkofthegradient asavectorwhoseentriesdeterminehowquicklythefirsthiddenlayerlearns,and asavectorwhoseentriesdeterminehowquicklythesecondhiddenlayerlearns.We'llthenusethelengthsofthesevectorsas(rough!)globalmeasuresofthespeedat
F
(
&
(
&
& (
*BackinChapter2wereferredtothisastheerror,butherewe'lladopttheinformalterm"gradient".Isay"informal"becauseofcoursethisdoesn'texplicitlyincludethepartialderivativesofthecostwithrespecttotheweights, .3
F
F
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 9/21
whichthelayersarelearning.So,forinstance,thelengthmeasuresthespeedatwhichthefirsthiddenlayerislearning,whilethelength measuresthespeedatwhichthesecondhiddenlayerislearning.
Withthesedefinitions,andinthesameconfigurationaswasplottedabove,wefind and .Sothisconfirmsourearliersuspicion:theneuronsinthesecondhiddenlayerreallyarelearningmuchfasterthantheneuronsinthefirsthiddenlayer.
Whathappensifweaddmorehiddenlayers?Ifwehavethreehiddenlayers,ina network,thentherespectivespeedsoflearningturnouttobe0.012,0.060,and0.283.Again,earlierhiddenlayersarelearningmuchslowerthanlaterhiddenlayers.Supposeweaddyetanotherlayerwith hiddenneurons.Inthatcase,therespectivespeedsoflearningare0.003,0.017,0.070,and0.285.Thepatternholds:earlylayerslearnslowerthanlaterlayers.
We'vebeenlookingatthespeedoflearningatthestartoftraining,thatis,justafterthenetworksareinitialized.Howdoesthespeedoflearningchangeaswetrainournetworks?Let'sreturntolookatthenetworkwithjusttwohiddenlayers.Thespeedoflearningchangesasfollows:
Togeneratetheseresults,Iusedbatchgradientdescentwithjust
F
F
F
F
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 10/21
1,000trainingimages,trainedover500epochs.ThisisabitdifferentthanthewayweusuallytrainI'veusednominibatches,andjust1,000trainingimages,ratherthanthefull50,000imagetrainingset.I'mnottryingtodoanythingsneaky,orpullthewooloveryoureyes,butitturnsoutthatusingminibatchstochasticgradientdescentgivesmuchnoisier(albeitverysimilar,whenyouaverageawaythenoise)results.UsingtheparametersI'vechosenisaneasywayofsmoothingtheresultsout,sowecanseewhat'sgoingon.
Inanycase,asyoucanseethetwolayersstartoutlearningatverydifferentspeeds(aswealreadyknow).Thespeedinbothlayersthendropsveryquickly,beforerebounding.Butthroughitall,thefirsthiddenlayerlearnsmuchmoreslowlythanthesecondhiddenlayer.
Whataboutmorecomplexnetworks?Here'stheresultsofasimilarexperiment,butthistimewiththreehiddenlayers(a
network):
Again,earlyhiddenlayerslearnmuchmoreslowlythanlaterhiddenlayers.Finally,let'saddafourthhiddenlayers(a
network),andseewhathappenswhenwetrain:
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 11/21
Again,earlyhiddenlayerslearnmuchmoreslowlythanlaterhiddenlayers.Inthiscase,thefirsthiddenlayerislearningroughly100timesslowerthanthefinalhiddenlayer.Nowonderwewerehavingtroubletrainingthesenetworksearlier!
Wehavehereanimportantobservation:inatleastsomedeepneuralnetworks,thegradienttendstogetsmalleraswemovebackwardthroughthehiddenlayers.Thismeansthatneuronsintheearlierlayerslearnmuchmoreslowlythanneuronsinlaterlayers.Andwhilewe'veseenthisinjustasinglenetwork,therearefundamentalreasonswhythishappensinmanyneuralnetworks.Thephenomenonisknownasthevanishinggradientproblem*.
Whydoesthevanishinggradientproblemoccur?Aretherewayswecanavoidit?Andhowshouldwedealwithitintrainingdeepneuralnetworks?Infact,we'lllearnshortlythatit'snotinevitable,althoughthealternativeisnotveryattractive,either:sometimesthegradientgetsmuchlargerinearlierlayers!Thisistheexplodinggradientproblem,andit'snotmuchbetternewsthanthevanishinggradientproblem.Moregenerally,itturnsoutthatthegradientindeepneuralnetworksisunstable,tendingtoeitherexplodeorvanishinearlierlayers.Thisinstabilityisafundamentalproblemforgradientbasedlearningindeepneuralnetworks.It'ssomethingweneedtounderstand,and,ifpossible,takestepstoaddress.
*SeeGradientflowinrecurrentnets:thedifficultyoflearninglongtermdependencies,bySeppHochreiter,YoshuaBengio,PaoloFrasconi,andJrgenSchmidhuber(2001).Thispaperstudiedrecurrentneuralnets,buttheessentialphenomenonisthesameasinthefeedforwardnetworkswearestudying.SeealsoSeppHochreiter'searlierDiplomaThesis,UntersuchungenzudynamischenneuronalenNetzen(1991,inGerman).
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 12/21
Oneresponsetovanishing(orunstable)gradientsistowonderifthey'rereallysuchaproblem.Momentarilysteppingawayfromneuralnets,imagineweweretryingtonumericallyminimizeafunction ofasinglevariable.Wouldn'titbegoodnewsifthederivative wassmall?Wouldn'tthatmeanwewerealreadynearanextremum?Inasimilarway,mightthesmallgradientinearlylayersofadeepnetworkmeanthatwedon'tneedtodomuchadjustmentoftheweightsandbiases?
Ofcourse,thisisn'tthecase.Recallthatwerandomlyinitializedtheweightandbiasesinthenetwork.Itisextremelyunlikelyourinitialweightsandbiaseswilldoagoodjobatwhateveritiswewantournetworktodo.Tobeconcrete,considerthefirstlayerofweightsina networkfortheMNISTproblem.Therandominitializationmeansthefirstlayerthrowsawaymostinformationabouttheinputimage.Eveniflaterlayershavebeenextensivelytrained,theywillstillfinditextremelydifficulttoidentifytheinputimage,simplybecausetheydon'thaveenoughinformation.Andsoitcan'tpossiblybethecasethatnotmuchlearningneedstobedoneinthefirstlayer.Ifwe'regoingtotraindeepnetworks,weneedtofigureouthowtoaddressthevanishinggradientproblem.
What'scausingthevanishinggradientproblem?UnstablegradientsindeepneuralnetsTogetinsightintowhythevanishinggradientproblemoccurs,let'sconsiderthesimplestdeepneuralnetwork:onewithjustasingleneuronineachlayer.Here'sanetworkwiththreehiddenlayers:
Here, aretheweights, arethebiases,and issomecostfunction.Justtoremindyouhowthisworks,theoutputfromthe thneuronis ,where istheusualsigmoid
activationfunction,and istheweightedinputtotheneuron.I'vedrawnthecost attheendtoemphasizethatthe
"4
4"
3
3
&
& U 6
&
U
6
&
3
&
&
&
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 13/21
costisafunctionofthenetwork'soutput, :iftheactualoutputfromthenetworkisclosetothedesiredoutput,thenthecostwillbelow,whileifit'sfaraway,thecostwillbehigh.
We'regoingtostudythegradient associatedtothefirsthiddenneuron.We'llfigureoutanexpressionfor ,andbystudyingthatexpressionwe'llunderstandwhythevanishinggradientproblemoccurs.
I'llstartbysimplyshowingyoutheexpressionfor .Itlooksforbidding,butit'sactuallygotasimplestructure,whichI'lldescribeinamoment.Here'stheexpression(ignorethenetwork,fornow,andnotethat isjustthederivativeofthe function):
Thestructureintheexpressionisasfollows:thereisa termintheproductforeachneuroninthenetworkaweight termforeachweightinthenetworkandafinal term,correspondingtothecostfunctionattheend.NoticethatI'veplacedeachtermintheexpressionabovethecorrespondingpartofthenetwork.Sothenetworkitselfisamnemonicfortheexpression.
You'rewelcometotakethisexpressionforgranted,andskiptothediscussionofhowitrelatestothevanishinggradientproblem.There'snoharmindoingthis,sincetheexpressionisaspecialcaseofourearlierdiscussionofbackpropagation.Butthere'salsoasimpleexplanationofwhytheexpressionistrue,andsoit'sfun(andperhapsenlightening)totakealookatthatexplanation.
Imaginewemakeasmallchange inthebias .Thatwillsetoffacascadingseriesofchangesintherestofthenetwork.First,itcausesachange intheoutputfromthefirsthiddenneuron.That,inturn,willcauseachange intheweightedinputtothesecondhiddenneuron.Thenachange intheoutputfromthesecondhiddenneuron.Andsoon,allthewaythroughtoachange
U
U
U
6
&
3
&
6
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 14/21
inthecostattheoutput.Wehave
Thissuggeststhatwecanfigureoutanexpressionforthegradientbycarefullytrackingtheeffectofeachstepinthiscascade.
Todothis,let'sthinkabouthow causestheoutput fromthefirsthiddenneurontochange.Wehave ,so
That termshouldlookfamiliar:it'sthefirstterminourclaimedexpressionforthegradient .Intuitively,thistermconvertsachange inthebiasintoachange intheoutputactivation.Thatchange inturncausesachangeintheweightedinput tothesecondhiddenneuron:
Combiningourexpressionsfor and ,weseehowthechangeinthebias propagatesalongthenetworktoaffect :
Again,thatshouldlookfamiliar:we'venowgotthefirsttwotermsinourclaimedexpressionforthegradient .
Wecankeepgoinginthisfashion,trackingthewaychangespropagatethroughtherestofthenetwork.Ateachneuronwepickupa term,andthrougheachweightwepickupa term.Theendresultisanexpressionrelatingthefinalchange incosttotheinitialchange inthebias:
Dividingby wedoindeedgetthedesiredexpressionforthe
U U
6
3
U 3
U
6
U
6
6
3
6
6
3
6
6
6
U
6
3
U
6
&
3
&
U
6
3
U
6
U
6
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 15/21
gradient:
Whythevanishinggradientproblemoccurs:Tounderstandwhythevanishinggradientproblemoccurs,let'sexplicitlywriteouttheentireexpressionforthegradient:
Exceptingtheverylastterm,thisexpressionisaproductoftermsoftheform .Tounderstandhoweachofthosetermsbehave,let'slookataplotofthefunction :
4 3 2 1 0 1 2 3 40.00
0.05
0.10
0.15
0.20
0.25
z
Derivativeofsigmoidfunction
Thederivativereachesamaximumat .Now,ifweuseourstandardapproachtoinitializingtheweightsinthenetwork,thenwe'llchoosetheweightsusingaGaussianwithmean andstandarddeviation .Sotheweightswillusuallysatisfy .Puttingtheseobservationstogether,weseethatthetermswillusuallysatisfy .Andwhenwetakeaproductofmanysuchterms,theproductwilltendtoexponentiallydecrease:themoreterms,thesmallertheproductwillbe.Thisisstartingtosmelllikeapossibleexplanationforthevanishinggradientproblem.
Tomakethisallabitmoreexplicit,let'scomparetheexpressionfortoanexpressionforthegradientwithrespecttoalater
bias,say .Ofcourse,wehaven'texplicitlyworkedoutan
U
6
3
U
6
U
6
U
6
3
U
6
3
U
6
3
U
6
3
&
U
6
&
U
U
] ] 3
&
3
&
U
6
&
] ] 3
&
U
6
&
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 16/21
expressionfor ,butitfollowsthesamepatterndescribedabovefor .Here'sthecomparisonofthetwoexpressions:
Thetwoexpressionssharemanyterms.Butthegradientincludestwoextratermseachoftheform .Aswe'veseen,suchtermsaretypicallylessthan inmagnitude.Andsothegradient willusuallybeafactorof (ormore)smallerthan .Thisistheessentialoriginofthevanishinggradientproblem.
Ofcourse,thisisaninformalargument,notarigorousproofthatthevanishinggradientproblemwilloccur.Thereareseveralpossibleescapeclauses.Inparticular,wemightwonderwhethertheweights couldgrowduringtraining.Iftheydo,it'spossibletheterms intheproductwillnolongersatisfy
.Indeed,ifthetermsgetlargeenoughgreaterthan thenwewillnolongerhaveavanishinggradientproblem.Instead,thegradientwillactuallygrowexponentiallyaswemovebackwardthroughthelayers.Insteadofavanishinggradientproblem,we'llhaveanexplodinggradientproblem.
Theexplodinggradientproblem:Let'slookatanexplicitexamplewhereexplodinggradientsoccur.Theexampleissomewhatcontrived:I'mgoingtofixparametersinthenetworkinjusttherightwaytoensurewegetanexplodinggradient.Buteventhoughtheexampleiscontrived,ithasthevirtueoffirmlyestablishingthatexplodinggradientsaren'tmerelyahypotheticalpossibility,theyreallycanhappen.
Therearetwostepstogettinganexplodinggradient.First,we
3
&
U
6
&
3
&
3
&
U
6
&
] ] 3
&
U
6
&
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 17/21
choosealltheweightsinthenetworktobelarge,say.Second,we'llchoosethebiasessothat
the termsarenottoosmall.That'sactuallyprettyeasytodo:allweneeddoischoosethebiasestoensurethattheweightedinputtoeachneuronis (andso ).So,forinstance,wewant .Wecanachievethisbysetting
.Wecanusethesameideatoselecttheotherbiases.Whenwedothis,weseethatalltheterms areequalto
.Withthesechoiceswegetanexplodinggradient.
Theunstablegradientproblem:Thefundamentalproblemhereisn'tsomuchthevanishinggradientproblemortheexplodinggradientproblem.It'sthatthegradientinearlylayersistheproductoftermsfromallthelaterlayers.Whentherearemanylayers,that'sanintrinsicallyunstablesituation.Theonlywayalllayerscanlearnatclosetothesamespeedisifallthoseproductsoftermscomeclosetobalancingout.Withoutsomemechanismorunderlyingreasonforthatbalancingtooccur,it'shighlyunlikelytohappensimplybychance.Inshort,therealproblemhereisthatneuralnetworkssufferfromanunstablegradientproblem.Asaresult,ifweusestandardgradientbasedlearningtechniques,differentlayersinthenetworkwilltendtolearnatwildlydifferentspeeds.
Exercise
Inourdiscussionofthevanishinggradientproblem,wemadeuseofthefactthat .Supposeweusedadifferentactivationfunction,onewhosederivativecouldbemuchlarger.Wouldthathelpusavoidtheunstablegradientproblem?
Theprevalenceofthevanishinggradientproblem:We'veseenthatthegradientcaneithervanishorexplodeintheearlylayersofadeepnetwork.Infact,whenusingsigmoidneuronsthegradientwillusuallyvanish.Toseewhy,consideragaintheexpression .Toavoidthevanishinggradientproblemweneed .Youmightthinkthiscouldhappeneasilyif isverylarge.However,it'smoredifficultthanitlooks.Thereasonisthatthe termalsodependson : ,where
3
3
3
3
U
6
&
6
&
U
6
&
6
3
3
&
U
6
&
] 6] U
]3 6]U
]3 6] U
3
6U
3 6 3 U
U
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 18/21
istheinputactivation.Sowhenwemake large,weneedtobecarefulthatwe'renotsimultaneouslymaking small.Thatturnsouttobeaconsiderableconstraint.Thereasonisthatwhenwemake largewetendtomake verylarge.Lookingatthegraphof youcanseethatthisputsusoffinthe"wings"ofthefunction,whereittakesverysmallvalues.Theonlywaytoavoidthisisiftheinputactivationfallswithinafairlynarrowrangeofvalues(thisqualitativeexplanationismadequantitativeinthefirstproblembelow).Sometimesthatwillchancetohappen.Moreoften,though,itdoesnothappen.Andsointhegenericcasewehavevanishinggradients.
Problems
Considertheproduct .Suppose.(1)Arguethatthiscanonlyeveroccurif
.(2)Supposingthat ,considerthesetofinputactivations forwhich .Showthatthesetofsatisfyingthatconstraintcanrangeoveranintervalno
greaterinwidththan
(3)Shownumericallythattheaboveexpressionboundingthewidthoftherangeisgreatestat ,whereittakesavalue .Andsoevengiventhateverythinglinesupjustperfectly,westillhaveafairlynarrowrangeofinputactivationswhichcanavoidthevanishinggradientproblem.
Identityneuron:Consideraneuronwithasingleinput, ,acorrespondingweight, ,abias ,andaweight ontheoutput.Showthatbychoosingtheweightsandbiasappropriately,wecanensure for .Suchaneuroncanthusbeusedasakindofidentityneuron,thatis,aneuronwhoseoutputisthesame(uptorescalingbyaweightfactor)asitsinput.Hint:Ithelpstorewrite
,toassume issmall,andtouseaTaylorseries
3
3 U
3 3
U
U
]3 3 ]U
]3 3 ] U
]3] ]3]
]3 3 ] U
MO
]3]
]3] ]3]
]3]
4
3
3
U 4 43
3
4
4 3
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 19/21
expansionin .
UnstablegradientsinmorecomplexnetworksWe'vebeenstudyingtoynetworks,withjustoneneuronineachhiddenlayer.Whataboutmorecomplexdeepnetworks,withmanyneuronsineachhiddenlayer?
Infact,muchthesamebehaviouroccursinsuchnetworks.Intheearlierchapteronbackpropagationwesawthatthegradientinthethlayerofan layernetworkisgivenby:
Here, isadiagonalmatrixwhoseentriesarethe valuesfortheweightedinputstothe thlayer.The aretheweightmatricesforthedifferentlayers.And isthevectorofpartialderivativesof withrespecttotheoutputactivations.
Thisisamuchmorecomplicatedexpressionthaninthesingleneuroncase.Still,ifyoulookclosely,theessentialformisverysimilar,withlotsofpairsoftheform .What'smore,thematrices havesmallentriesonthediagonal,nonelargerthan.Providedtheweightmatrices aren'ttoolarge,eachadditional
term tendstomakethegradientvectorsmaller,leadingtoavanishinggradient.Moregenerally,thelargenumberoftermsintheproducttendstoleadtoanunstablegradient,justasinourearlierexample.Inpractice,empiricallyitistypicallyfoundin
3
(
F
(
6
(
3
(
6
(
3
(
6
6
(
6U
( 3
(
3
&
6
&
6
&
3
&
3
&
6
(
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 20/21
sigmoidnetworksthatgradientsvanishexponentiallyquicklyinearlierlayers.Asaresult,learningslowsdowninthoselayers.Thisslowdownisn'tmerelyanaccidentoraninconvenience:it'safundamentalconsequenceoftheapproachwe'retakingtolearning.
OtherobstaclestodeeplearningInthischapterwe'vefocusedonvanishinggradientsand,moregenerally,unstablegradientsasanobstacletodeeplearning.Infact,unstablegradientsarejustoneobstacletodeeplearning,albeitanimportantfundamentalobstacle.Muchongoingresearchaimstobetterunderstandthechallengesthatcanoccurwhentrainingdeepnetworks.Iwon'tcomprehensivelysummarizethatworkhere,butjustwanttobrieflymentionacoupleofpapers,togiveyoutheflavorofsomeofthequestionspeopleareasking.
Asafirstexample,in2010GlorotandBengio*foundevidencesuggestingthattheuseofsigmoidactivationfunctionscancauseproblemstrainingdeepnetworks.Inparticular,theyfoundevidencethattheuseofsigmoidswillcausetheactivationsinthefinalhiddenlayertosaturatenear earlyintraining,substantiallyslowingdownlearning.Theysuggestedsomealternativeactivationfunctions,whichappearnottosufferasmuchfromthissaturationproblem.
Asasecondexample,in2013Sutskever,Martens,DahlandHinton*studiedtheimpactondeeplearningofboththerandomweightinitializationandthemomentumscheduleinmomentumbasedstochasticgradientdescent.Inbothcases,makinggoodchoicesmadeasubstantialdifferenceintheabilitytotraindeepnetworks.
Theseexamplessuggestthat"Whatmakesdeepnetworkshardtotrain?"isacomplexquestion.Inthischapter,we'vefocusedontheinstabilitiesassociatedtogradientbasedlearningindeepnetworks.Theresultsinthelasttwoparagraphssuggestthatthereisalsoaroleplayedbythechoiceofactivationfunction,thewayweightsare
*Understandingthedifficultyoftrainingdeepfeedforwardneuralnetworks,byXavierGlorotandYoshuaBengio(2010).SeealsotheearlierdiscussionoftheuseofsigmoidsinEfficientBackProp,byYannLeCun,LonBottou,GenevieveOrrandKlausRobertMller(1998).
*Ontheimportanceofinitializationandmomentumindeeplearning,byIlyaSutskever,JamesMartens,GeorgeDahlandGeoffreyHinton(2013).
7/16/2015 Neuralnetworksanddeeplearning
http://neuralnetworksanddeeplearning.com/chap5.html 21/21
initialized,andevendetailsofhowlearningbygradientdescentisimplemented.And,ofcourse,choiceofnetworkarchitectureandotherhyperparametersisalsoimportant.Thus,manyfactorscanplayaroleinmakingdeepnetworkshardtotrain,andunderstandingallthosefactorsisstillasubjectofongoingresearch.Thisallseemsratherdownbeatandpessimisminducing.Butthegoodnewsisthatinthenextchapterwe'llturnthataround,anddevelopseveralapproachestodeeplearningthattosomeextentmanagetoovercomeorroutearoundallthesechallenges.
Inacademicwork,pleasecitethisbookas:MichaelA.Nielsen,"NeuralNetworksandDeepLearning",DeterminationPress,2015
ThisworkislicensedunderaCreativeCommonsAttributionNonCommercial3.0UnportedLicense.Thismeansyou'refreetocopy,share,andbuildonthisbook,butnottosellit.Ifyou'reinterestedincommercialuse,pleasecontactme.
Lastupdate:FriJul1008:53:052015