Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Title AIFH,Volume3:DeepLearningandNeuralNetworksAuthor JeffHeatonPublished December31,2015Copyright Copyright2015byHeatonResearch,Inc.,AllRightsReserved.FileCreated SunNov0815:28:13CST2015ISBN 978-1505714340Price 9.99USD
Donotmakeillegalcopiesofthisebook
ThiseBookiscopyrightedmaterial,andpublicdistributionisprohibited.IfyoudidnotreceivethisebookfromHeatonResearch(http://www.heatonresearch.com),oranauthorizedbookseller,pleasecontactHeatonResearch,Inc.topurchasealicensedcopy.DRMfreecopiesofourbookscanbepurchasedfrom:
http://www.heatonresearch.com/book
Ifyoupurchasedthisbook,thankyou!YourpurchaseofthisbookssupportstheEncogMachineLearningFramework.http://www.encog.org
Publisher:HeatonResearch,Inc.ArtificialIntelligenceforHumans,Volume3:NeuralNetworksandDeepLearningDecember,2015Author:JeffHeatonEditor:TracyHeatonISBN:978-1505714340Edition:1.0
Copyright©2015byHeatonResearchInc.,1734ClarksonRd.#107,Chesterfield,MO63017-4976.Worldrightsreserved.Theauthor(s)createdreusablecodeinthispublicationexpresslyforreusebyreaders.HeatonResearch,Inc.grantsreaderspermissiontoreusethecodefoundinthispublicationordownloadedfromourwebsitesolongas(author(s))areattributedinanyapplicationcontainingthereusablecodeandthesourcecodeitselfisneverredistributed,postedonlinebyelectronictransmission,soldorcommerciallyexploitedasastand-aloneproduct.Asidefromthisspecificexceptionconcerningreusablecode,nopartofthispublicationmaybestoredinaretrievalsystem,transmitted,orreproducedinanyway,including,butnotlimitedtophotocopy,photograph,magnetic,orotherrecord,withoutprioragreementandwrittenpermissionofthepublisher.
HeatonResearch,Encog,theEncogLogoandtheHeatonResearchlogoarealltrademarksofHeatonResearch,Inc.,intheUnitedStatesand/orothercountries.
TRADEMARKS:HeatonResearchhasattemptedthroughoutthisbooktodistinguishproprietarytrademarksfromdescriptivetermsbyfollowingthecapitalizationstyleusedbythemanufacturer.
Theauthorandpublisherhavemadetheirbesteffortstopreparethisbook,sothecontentisbaseduponthefinalreleaseofsoftwarewheneverpossible.Portionsofthemanuscriptmaybebaseduponpre-releaseversionssuppliedbysoftwaremanufacturer(s).Theauthorandthepublishermakenorepresentationorwarrantiesofanykindwithregardtothecompletenessoraccuracyofthecontentshereinandacceptnoliabilityofanykindincludingbutnotlimitedtoperformance,merchantability,fitnessforanyparticularpurpose,oranylossesordamagesofanykindcausedorallegedtobecauseddirectlyorindirectlyfromthisbook.
SOFTWARELICENSEAGREEMENT:TERMSANDCONDITIONS
Themediaand/oranyonlinematerialsaccompanyingthisbookthatareavailablenoworinthefuturecontainprogramsand/ortextfiles(the“Software”)tobeusedinconnectionwiththebook.HeatonResearch,Inc.herebygrantstoyoualicensetouseanddistributesoftwareprogramsthatmakeuseofthecompiledbinaryformofthisbook’ssourcecode.Youmaynotredistributethesourcecodecontainedinthisbook,withoutthewrittenpermissionofHeatonResearch,Inc.Yourpurchase,acceptance,oruseoftheSoftwarewillconstituteyouracceptanceofsuchterms.
TheSoftwarecompilationisthepropertyofHeatonResearch,Inc.unlessotherwiseindicatedandisprotectedbycopyrighttoHeatonResearch,Inc.orothercopyrightowner(s)asindicatedinthemediafiles(the“Owner(s)”).YouareherebygrantedalicensetouseanddistributetheSoftwareforyourpersonal,noncommercialuseonly.Youmay
notreproduce,sell,distribute,publish,circulate,orcommerciallyexploittheSoftware,oranyportionthereof,withoutthewrittenconsentofHeatonResearch,Inc.andthespecificcopyrightowner(s)ofanycomponentsoftwareincludedonthismedia.
IntheeventthattheSoftwareorcomponentsincludespecificlicenserequirementsorend-useragreements,statementsofcondition,disclaimers,limitationsorwarranties(“End-UserLicense”),thoseEnd-UserLicensessupersedethetermsandconditionshereinastothatparticularSoftwarecomponent.Yourpurchase,acceptance,oruseoftheSoftwarewillconstituteyouracceptanceofsuchEnd-UserLicenses.
Bypurchase,useoracceptanceoftheSoftwareyoufurtheragreetocomplywithallexportlawsandregulationsoftheUnitedStatesassuchlawsandregulationsmayexistfromtimetotime.
SOFTWARESUPPORT
ComponentsofthesupplementalSoftwareandanyoffersassociatedwiththemmaybesupportedbythespecificOwner(s)ofthatmaterialbuttheyarenotsupportedbyHeatonResearch,Inc..InformationregardinganyavailablesupportmaybeobtainedfromtheOwner(s)usingtheinformationprovidedintheappropriateREADMEfilesorlistedelsewhereonthemedia.
Shouldthemanufacturer(s)orotherOwner(s)ceasetooffersupportordeclinetohonoranyoffer,HeatonResearch,Inc.bearsnoresponsibility.ThisnoticeconcerningsupportfortheSoftwareisprovidedforyourinformationonly.HeatonResearch,Inc.isnottheagentorprincipaloftheOwner(s),andHeatonResearch,Inc.isinnowayresponsibleforprovidinganysupportfortheSoftware,norisitliableorresponsibleforanysupportprovided,ornotprovided,bytheOwner(s).
WARRANTY
HeatonResearch,Inc.warrantstheenclosedmediatobefreeofphysicaldefectsforaperiodofninety(90)daysafterpurchase.TheSoftwareisnotavailablefromHeatonResearch,Inc.inanyotherformormediathanthatenclosedhereinorpostedtowww.heatonresearch.com.Ifyoudiscoveradefectinthemediaduringthiswarrantyperiod,youmayobtainareplacementofidenticalformatatnochargebysendingthedefectivemedia,postageprepaid,withproofofpurchaseto:
HeatonResearch,Inc.CustomerSupportDepartment1734ClarksonRd#107Chesterfield,MO63017-4976Web:www.heatonresearch.comE-Mail:[email protected]
DISCLAIMER
HeatonResearch,Inc.makesnowarrantyorrepresentation,eitherexpressedorimplied,withrespecttotheSoftwareoritscontents,quality,performance,merchantability,orfitnessforaparticularpurpose.InnoeventwillHeatonResearch,Inc.,itsdistributors,ordealersbeliabletoyouoranyotherpartyfordirect,indirect,special,incidental,consequential,orotherdamagesarisingoutoftheuseoforinabilitytousetheSoftwareor
itscontentsevenifadvisedofthepossibilityofsuchdamage.IntheeventthattheSoftwareincludesanonlineupdatefeature,HeatonResearch,Inc.furtherdisclaimsanyobligationtoprovidethisfeatureforanyspecificdurationotherthantheinitialposting.
Theexclusionofimpliedwarrantiesisnotpermittedbysomestates.Therefore,theaboveexclusionmaynotapplytoyou.Thiswarrantyprovidesyouwithspecificlegalrights;theremaybeotherrightsthatyoumayhavethatvaryfromstatetostate.ThepricingofthebookwiththeSoftwarebyHeatonResearch,Inc.reflectstheallocationofriskandlimitationsonliabilitycontainedinthisagreementofTermsandConditions.
SHAREWAREDISTRIBUTION
ThisSoftwaremayusevariousprogramsandlibrariesthataredistributedasshareware.Copyrightlawsapplytobothsharewareandordinarycommercialsoftware,andthecopyrightOwner(s)retainsallrights.Ifyoutryasharewareprogramandcontinueusingit,youareexpectedtoregisterit.Individualprogramsdifferondetailsoftrialperiods,registration,andpayment.Pleaseobservetherequirementsstatedinappropriatefiles.
ThisbookisdedicatedtomymomMary,
thankyouforallthelove
andencouragementovertheyears.
.
IntroductionSeriesIntroductionExampleComputerLanguagesPrerequisiteKnowledgeFundamentalAlgorithmsOtherResourcesStructureofthisBook
Thisbookisthethirdinaseriescoveringselecttopicsinartificialintelligence(AI),alargefieldofstudythatencompassesmanysub-disciplines.Inthisintroduction,wewillprovidesomebackgroundinformationforreaderswhomightnothavereadVolume1or2.ItisnotnecessarytoreadVolume1or2beforethisbook.Weintroduceneededinformationfrombothvolumesinthefollowingsections.
SeriesIntroduction
Thisseriesofbooksintroducesthereadertoavarietyofpopulartopicsinartificialintelligence.BynomeansarethesevolumesintendedtobeanexhaustiveAIresource.However,eachbookpresentsaspecificareaofAItofamiliarizethereaderwithsomeofthelatesttechniquesinthisfieldofcomputerscience.
Inthisseries,weteachartificialintelligenceconceptsinamathematicallygentlemanner,whichiswhywenamedtheseriesArtificialIntelligenceforHumans.Asaresult,wealwaysfollowthetheorieswithreal-worldprogrammingexamplesandpseudocodeinsteadofrelyingsolelyonmathematicalformulas.Still,wemaketheseassumptions:
Thereaderisproficientinatleastoneprogramminglanguage.Thereaderhasabasicunderstandingofcollegealgebra.Thereaderdoesnotnecessarilyhavemuchexperiencewithformulasfromcalculus,linearalgebra,differentialequations,andstatistics.Wewillintroducetheseformulaswhennecessary.
Finally,thebook’sexampleshavebeenportedtoanumberofprogramminglanguages.Readerscanadapttheexamplestothelanguagethatfitstheirparticularprogrammingneeds.
ProgrammingLanguages
Althoughthebook’stextstaysatthepseudocodelevel,weprovideexamplepacksforJava,C#andPython.TheScalaprogramminglanguagehasacommunity-suppliedport,andreadersarealsoworkingonportingtheexamplestoadditionallanguages.So,yourfavoritelanguagemighthavebeenportedsincethisprinting.Checkthebook’sGitHubrepositoryformoreinformation.Wehighlyencouragereadersofthebookstohelpporttootherlanguages.Ifyouwouldliketogetinvolved,AppendixAhasmoreinformationtogetyoustarted.
OnlineLabs
ManyoftheexamplesfromthisseriesuseJavaScriptandareavailabletorunonline,usingHTML5.MobiledevicesmustalsohaveHTML5capabilitytoruntheprograms.Youcanfindallonlinelabmaterialsatthefollowingwebsite:
http://www.aifh.org
Theseonlinelabsallowyoutoexperimentwiththeexamplesevenasyoureadthee-bookfromamobiledevice.
CodeRepositories
AllofthecodeforthisprojectisreleasedundertheApacheOpenSourceLicensev2andcanbefoundatthefollowingGitHubrepository:
https://github.com/jeffheaton/aifh
Ifyoufindsomethingbroken,misspelled,orotherwisebotchedasyouworkwiththeexamples,youcanforktheprojectandpushacommitrevisiontoGitHub.Youwillalsoreceivecreditamongthegrowingnumberofcontributors.RefertoAppendixAformoreinformationoncontributingcode.
BooksPlannedfortheSeries
Thefollowingvolumesareplannedforthisseries:
Volume0:IntroductiontotheMathofAIVolume1:FundamentalAlgorithmsVolume2:Nature-InspiredAlgorithmsVolume3:DeepLearningandNeuralNetworks
WewillproduceVolumes1,2,and3inorder.Volume0isaplannedprequelthatwewillcreateneartheendoftheseries.Whileallthebookswillincludetherequiredmathematicalformulastoimplementtheprograms,theprequelwillrecapandexpandonalltheconceptsfromtheearliervolumes.WealsointendtoproducemorebooksonAIafterthepublicationofVolume3.
Ingeneral,youcanreadthebooksinanyorder.Eachbook’sintroductionwillprovidesomebackgroundmaterialfrompreviousvolumes.Thisorganizationallowsyoutojumpquicklytothevolumethatcontainsyourareaofinterest.Ifyouwanttosupplementyourknowledgeatalaterpoint,youcanreadthepreviousvolume.
OtherResources
ManyotherresourcesontheInternetwillbeveryusefulasyoureadthroughthisseriesofbooks.
ThefirstresourceisKhanAcademy,anonprofit,educationalwebsitethatprovidesvideostodemonstratemanyareasofmathematics.Ifyouneedadditionalreviewonanymathematicalconceptinthisbook,KhanAcademyprobablyhasavideoonthatinformation.
http://www.khanacademy.org/
ThesecondresourceistheNeuralNetworkFAQ.Thistext-onlyresourcehasagreatdealofinformationonneuralnetworksandotherAItopics.
http://www.faqs.org/faqs/ai-faq/neural-nets/
AlthoughtheinformationinthisbookisnotnecessarilytiedtoEncog,theEncoghomepagehasafairamountofgeneralinformationonmachinelearning.
http://www.encog.org
NeuralNetworksIntroduction
Neuralnetworkshavebeenaroundsincethe1940s,and,asaresult,theyhavequiteabitofhistory.Thisbookwillcoverthehistoricaspectsofneuralnetworksbecauseyouneedtoknowsomeoftheterminology.Agoodexampleofthishistoricprogressistheactivationfunction,whichscalesvaluespassingthroughneuronsintheneuralnetwork.Alongwiththresholdactivationfunctions,researchersintroducedneuralnetworks,andthisadvancementgavewaytosigmoidalactivationfunctions,thentohyperbolictangentfunctionsandnowtotherectifiedlinearunit(ReLU).WhilemostcurrentliteraturesuggestsusingtheReLUactivationfunctionexclusively,youneedtounderstandsigmoidalandhyperbolictangenttoseethebenefitsofReLU.
Wheneverpossible,wewillindicatewhicharchitecturalcomponentofaneuralnetworktouse.Wewillalwaysidentifythearchitecturalcomponentsnowacceptedastherecommendedchoiceoverolderclassicalcomponents.WewillbringmanyofthesearchitecturalelementstogetherandprovideyouwithsomeconcreterecommendationsforstructuringyourneuralnetworksinChapter14,“ArchitectingNeuralNetworks.”
Neuralnetworkshaverisenfromtheashesofdiscreditseveraltimesintheirhistory.McCulloch,W.andPitts,W.(1943)firstintroducedtheideaofaneuralnetwork.However,theyhadnomethodtotraintheseneuralnetworks.Programmershadtocraftbyhandtheweightmatricesoftheseearlynetworks.Becausethisprocesswastedious,neuralnetworksfellintodisuseforthefirsttime.
Rosenblatt,F.(1958)providedamuch-neededtrainingalgorithmcalledbackpropagation,whichautomaticallycreatestheweightmatricesofneuralnetworks.Itfact,backpropagationhasmanylayersofneuronsthatsimulatethearchitectureofanimalbrains.However,backpropagationisslow,and,asthelayersincrease,itbecomesevenslower.Itappearedasiftheadditionofcomputationalpowerinthe1980sandearly1990shelpedneuralnetworksperformtasks,butthehardwareandtrainingalgorithmsofthiseracouldnoteffectivelytrainneuralnetworkswithmanylayers,and,forthesecondtime,neuralnetworksfellintodisuse.
ThethirdriseofneuralnetworksoccurredwhenHinton(2006)providedaradicalnewwaytotraindeepneuralnetworks.Therecentadvancesinhigh-speedgraphicsprocessingunits(GPU)allowedprogrammerstotrainneuralnetworkswiththreeormorelayersandledtoaresurgenceinthistechnologyasprogrammersrealizedthebenefitsofdeepneuralnetworks.
Inordertoestablishthefoundationfortherestofthebook,webeginwithananalysisofclassicneuralnetworks,whicharestillusefulforavarietyoftasks.Ouranalysisincludesconcepts,suchasself-organizingmaps(SOMs),Hopfieldneuralnetworks,andBoltzmannmachines.Wealsointroducethefeedforwardneuralnetworkandshowseveralwaystotrainit.
Afeedforwardneuralnetworkwithmanylayersbecomesadeepneuralnetwork.Thebookcontainsmethods,suchasGPUsupport,totraindeepnetworks.Wealsoexploretechnologiesrelatedtodeeplearning,suchasdropout,regularization,andconvolution.Finally,wedemonstratethesetechniquesthroughseveralreal-worldexamplesofdeeplearning,suchaspredictivemodelingandimagerecognition.
Ifyouwouldliketoreadingreaterdetailaboutthethreephasesofneuralnetworktechnology,thefollowingarticlepresentsagreatoverview:
http://chronicle.com/article/The-Believers/190147/
TheKickstarterCampaign
In2013,welaunchedthisseriesofbooksafterasuccessfulKickstartercampaign.Figure1showsthehomepageoftheKickstarterprojectforVolume3:
Figure1:TheKickstarterCampaign
YoucanvisittheoriginalKickstarteratthefollowinglink:
https://goo.gl/zW4dht
WewouldliketothankalloftheKickstarterbackersoftheproject.Withoutyoursupport,thisseriesmightnotexist.Wewouldliketoextendahugethankyoutothosewhobackedatthe$250andbeyondlevel:
Figure2:GoldLevelBackers
Itwillbegreatdiscussingyourprojectswithyou.Thankyouagainforyoursupport.
Wewouldalsoliketoextendaspecialthankstothosebackerswhosupportedthebookatthe$100andhigherlevels.Theyarelistedhereintheorderthattheybacked:
Figure3:SilverLevelBackers
Aspecialthankyoutomywife,TracyHeaton,whoeditedtheprevioustwovolumes.
Therehavebeenthreevolumessofar;therepeatbackershavebeenveryvaluabletothiscampaign!Itisamazingtomehowmanyrepeatbackersthereare!
Thankyou,everyone—youarethebest!
http://www.heatonresearch.com/ThankYou/
Figure4:RepeatBackers1/4
Figure5:RepeatBackers2/4
Figure6:RepeatBackers3/4
Figure7:RepeatBackers4/4
BackgroundInformation
YoucanreadArtificialIntelligenceforHumansinanyorder.However,thisbookdoesexpandonsometopicsintroducedinVolumes1and2.Thegoalofthissectionistohelpyouunderstandwhataneuralnetworkisandhowtouseit.Mostpeople,evennon-programmers,haveheardofneuralnetworks.Manysciencefictionstorieshaveplotsthatarebasedonideasrelatedtoneuralnetworks.Asaresult,sci-fiwritershavecreatedaninfluentialbutsomewhatinaccurateviewoftheneuralnetwork.
Mostlaypeopleconsiderneuralnetworkstobeatypeofartificialbrain.Accordingtothisview,neuralnetworkscouldpowerrobotsorcarryonintelligentconversationswithhumanbeings.However,thisnotionisacloserdefinitionofartificialintelligence(AI)thanofneuralnetworks.AlthoughAIseekstocreatetrulyintelligentmachines,thecurrentstateofcomputersisfarbelowthisgoal.Humanintelligencestilltrumpscomputerintelligence.
NeuralnetworksareasmallpartofAI.Astheycurrentlyexist,neuralnetworkscarryoutminiscule,highlyspecifictasks.Unlikethehumanbrain,computer-basedneuralnetworksarenotgeneral-purposecomputationaldevices.Furthermore,thetermneuralnetworkcancreateconfusionbecausethebrainisanetworkofneuronsjustasAIusesneuralnetworks.Toavoidthisproblem,wemustmakeanimportantdistinction.
Weshouldreallycallthehumanbrainabiologicalneuralnetwork(BNN).MosttextsdonotbothertomakethedistinctionbetweenaBNNandartificialneuralnetworks(ANNs).Ourbookfollowsthispattern.Whenwerefertoneuralnetworks,we’redealingwithANNs.WearenottalkingaboutBNNswhenweusethetermneuralnetwork.
Biologicalneuralnetworksandartificialneuralnetworkssharesomeverybasicsimilarities.Forinstance,biologicalneuralnetworkshaveinspiredthemathematicalconstructsofartificialneuralnetworks.Biologicalplausibilitydescribesvariousartificialneuralnetworkalgorithms.Thistermdefineshowcloseanartificialneuralnetworkalgorithmistoabiologicalneuralnetwork.
Aspreviouslymentioned,programmersdesignneuralnetworkstoexecuteonesmalltask.Afullapplicationwilllikelyuseneuralnetworkstoaccomplishcertainpartsoftheapplication.However,theentireapplicationwillnotbeimplementedasaneuralnetwork.Itmayconsistofseveralneuralnetworksofwhicheachhasaspecifictask.
Patternrecognitionisataskthatneuralnetworkscaneasilyaccomplish.Forthistask,youcancommunicateapatterntoaneuralnetwork,anditcommunicatesapatternbacktoyou.Atthehighestlevel,atypicalneuralnetworkcanperformonlythisfunction.Althoughsomenetworkarchitecturesmightachievemore,thevastmajorityofneuralnetworksworkthisway.Figure8illustratesaneuralnetworkatthislevel:
Figure8:ATypicalNeuralNetwork
Asyoucansee,theaboveneuralnetworkacceptsapatternandreturnsapattern.Neuralnetworksoperatesynchronouslyandwillonlyoutputwhenithasinput.Thisbehaviorisnotlikethatofahumanbrain,whichdoesnotoperatesynchronously.Thehumanbrainrespondstoinput,butitwillproduceoutputanytimeitfeelslikeit!
NeuralNetworkStructure
Neuralnetworksconsistoflayersofsimilarneurons.Mosthaveatleastaninputlayerandanoutputlayer.Theprogrampresentstheinputpatterntotheinputlayer.Thentheoutputpatternisreturnedfromtheoutputlayer.Whathappensbetweentheinputandanoutputlayerisablackbox.Byblackbox,wemeanthatyoudonotknowexactlywhyaneuralnetworkoutputswhatitdoes.Atthispoint,wearenotyetconcernedwiththeinternalstructureoftheneuralnetwork,ortheblackbox.Manydifferentarchitecturesdefinetheinteractionbetweentheinputandoutputlayer.Later,wewillexaminesomeofthesearchitectures.
Theinputandoutputpatternsarebotharraysoffloating-pointnumbers.Considerthearraysinthefollowingways:
NeuralNetworkInput:[-0.245,.283,0.0]
NeuralNetworkOutput:[0.782,0.543]
Theaboveneuralnetworkhasthreeneuronsintheinputlayer,andtwoneuronsareintheoutputlayer.Thenumberofneuronsintheinputandoutputlayersdoesnotchange,evenifyourestructuretheinterioroftheneuralnetwork.
Toutilizetheneuralnetwork,youmustexpressyourproblemsothattheinputoftheproblemisanarrayoffloating-pointnumbers.Likewise,thesolutiontotheproblemmustbeanarrayoffloating-pointnumbers.Ultimately,thisexpressionistheonlyprocessthatthatneuralnetworkscanperform.Inotherwords,theytakeonearrayandtransformitintoasecond.Neuralnetworksdonotloop,callsubroutines,orperformanyoftheothertasksyoumightthinkofwithtraditionalprogramming.Neuralnetworkssimplyrecognizepatterns.
Youmightthinkofaneuralnetworkasahashtableintraditionalprogrammingthatmapskeystovalues.Itactssomewhatlikeadictionary.Youcanconsiderthefollowingasatypeofhashtable:
“hear”->“toperceiveorapprehendbytheear”
“run”->“togofasterthanawalk”“write”->“toform(ascharactersorsymbols)onasurfacewithaninstrument(asapen)”
Thistablecreatesamappingbetweenwordsandprovidestheirdefinitions.Programminglanguagesusuallycallthisahashmaporadictionary.Thishashtableusesakeyoftypestringtoreferenceanothervaluethatisalsoofthesametypestring.Ifyou’venotworkedwithhashtablesbefore,theysimplymaponevaluetoanother,andtheyareaformofindexing.Inotherwords,thedictionaryreturnsavaluewhenyouprovideitwithakey.Mostneuralnetworksfunctioninthismanner.Oneneuralnetworkcalledbidirectionalassociativememory(BAM)allowsyoutoprovidethevalueandreceivethekey.
Programminghashtablescontainkeysandvalues.Thinkofthepatternsenttotheinputlayeroftheneuralnetworkasthekeytothehashtable.Likewise,thinkofthevaluereturnedfromthehashtableasthepatternthatisreturnedfromtheoutputlayeroftheneuralnetwork.Althoughthecomparisonbetweenahashtableandaneuralnetworkisappropriatetohelpyouunderstandtheconcept,youneedtorealizethattheneuralnetworkismuchmorethanahashtable.
Whatwouldhappenwiththeprevioushashtableifyouweretoprovideawordthatisnotakeyinthemap?Toanswerthequestion,wewillpassinthekeyof“wrote.”Forthisexample,ahashtablewouldreturnnull.Itwouldindicateinsomewaythatitcouldnotfindthespecifiedkey.However,neuralnetworksdonotreturnnull;theyfindtheclosestmatch.Notonlydotheyfindtheclosestmatch,theywillmodifytheoutputtoestimatethemissingvalue.Soifyoupassedin“wrote”totheaboveneuralnetwork,youwouldlikelyreceivewhatyouwouldhaveexpectedfor“write.”Youwouldlikelygettheoutputfromoneoftheotherkeysbecausenotenoughdataexistfortheneuralnetworktomodifytheresponse.Thelimitednumberofsamples(inthiscase,therearethree)causesthisresult.
Theabovemappingraisesanimportantpointaboutneuralnetworks.Aspreviouslystated,neuralnetworksacceptanarrayoffloating-pointnumbersandreturnanotherarray.Thisbehaviorprovokesthequestionabouthowtoputstring,ortextual,valuesintotheaboveneuralnetwork.Althoughasolutionexists,dealingwithnumericdataratherthanstringsismucheasierfortheneuralnetwork.
Infact,thisquestionrevealsoneofthemostdifficultaspectsofneuralnetworkprogramming.Howdoyoutranslateyourproblemintoafixed-lengtharrayoffloating-pointnumbers?Intheexamplesthatfollow,youwillseethecomplexityofneuralnetworks.
ASimpleExample
Incomputerprogramming,itiscustomarytoprovidea“HelloWorld”applicationthatsimplydisplaysthetext“HelloWorld.”Ifyouhavepreviouslyreadaboutneuralnetworks,youhavenodoubtseenexampleswiththeexclusiveor(XOR)operator,whichisoneofthe“HelloWorld”applicationsofneuralnetworkprogramming.Laterinthissection,wewilldescribemorecomplexscenariosthanXOR,butitisagreatintroduction.WeshallbeginbylookingattheXORoperatorasthoughitwereahashtable.IfyouarenotfamiliarwiththeXORoperator,itworkssimilarlytotheAND/ORoperators.ForanANDtobetrue,bothsidesmustbetrue.ForanORtobetrue,eithersidemustbetrue.ForanXORtobetrue,bothofthesidesmustbedifferentfromeachother.ThefollowingtruthtablerepresentsanXOR:
FalseXORFalse=False
TrueXORFalse=True
FalseXORTrue=True
TrueXORTrue=False
Tocontinuethehashtableexample,youwouldrepresenttheabovetruthtableasfollows:
[0.0,0.0]->[0.0]
[1.0,0.0]->[1.0]
[0.0,1.0]->[1.0]
[1.0,1.0]->[0.0]
Thesemappingsshowinputandtheidealexpectedoutputfortheneuralnetwork.
Training:SupervisedandUnsupervised
Whenyouspecifytheidealoutput,youareusingsupervisedtraining.Ifyoudidnotprovideidealoutputs,youwouldbeusingunsupervisedtraining.Supervisedtrainingteachestheneuralnetworktoproducetheidealoutput.Unsupervisedtrainingusuallyteachestheneuralnetworktoplacetheinputdataintoanumberofgroupsdefinedbytheoutputneuroncount.
Bothsupervisedandunsupervisedtrainingareiterativeprocesses.Forsupervisedtraining,eachtrainingiterationcalculateshowclosetheactualoutputistotheidealoutputandexpressesthisclosenessasanerrorpercent.Eachiterationmodifiestheinternalweightmatricesoftheneuralnetworktodecreasetheerrorratetoanacceptablylowlevel.
Unsupervisedtrainingisalsoaniterativeprocess.However,calculatingtheerrorisnotaseasy.Becauseyouhavenoexpectedoutput,youcannotmeasurehowfartheunsupervisedneuralnetworkisfromyouridealoutput.Thus,youhavenoidealoutput.As
aresult,youwilljustiterateforafixednumberofiterationsandtrytousethenetwork.Iftheneuralnetworkneedsmoretraining,theprogramprovidesit.
Anotherimportantaspecttotheabovetrainingdataisthatyoucantakeitinanyorder.Theresultoftwozeros,withXORapplied(0XOR0)isgoingtobe0,regardlessofwhichcasethatyouused.Thischaracteristicisnottrueofallneuralnetworks.FortheXORoperator,wewouldprobablyuseatypeofneuralnetworkcalledafeedforwardneuralnetworkinwhichtheorderofthetrainingsetdoesnotmatter.Laterinthisbook,wewillexaminerecurrentneuralnetworksthatdoconsidertheorderofthetrainingdata.Orderisanessentialcomponentofasimplerecurrentneuralnetwork.
Previously,yousawthatthesimpleXORoperatorutilizedtrainingdata.Nowwewillanalyzeasituationwithmorecomplextrainingdata.
MilesperGallon
Ingeneral,neuralnetworkproblemsinvolveasetofdatathatyouusetopredictvaluesforlatersetsofdata.Theselatersetsofdataresultafteryou’vealreadytrainedyourneuralnetwork.Thepowerofaneuralnetworkistopredictoutcomesforentirelynewdatasetsbasedonknowledgelearnedfrompastdatasets.Consideracardatabasethatcontainsthefollowingfields:
CarWeightEngineDisplacementCylinderCountHorsePowerHybridorGasolineMilesperGallon
Althoughweareoversimplifyingthedata,thisexampledemonstrateshowtoformatdata.Assumingyouhavecollectedsomedataforthesefields,youshouldbeabletoconstructaneuralnetworkthatcanpredictonefieldvalue,basedontheotherfieldvalues.Forthisexample,wewilltrytopredictmilespergallon.
Aspreviouslydemonstrated,wewillneedtodefinethisproblemintermsofaninputarrayoffloating-pointnumbersmappedtoanoutputarrayoffloating-pointnumbers.However,theproblemhasoneadditionalrequirement.Thenumericrangeoneachofthesearrayelementsshouldbebetween0and1or-1and1.Thisrangeiscallednormalization.Ittakesreal-worlddataandturnsitintoaformthattheneuralnetworkcanprocess.
First,wedeterminehowtonormalizetheabovedata.Considertheneuralnetworkformat.Wehavesixtotalfields.Wewanttousefiveofthesefieldstopredictthesixth.Consequently,theneuralnetworkwouldhavefiveinputneuronsandoneoutputneuron.
Yournetworkwouldresemblethefollowing:
InputNeuron1:CarWeight
InputNeuron2:EngineDisplacementInputNeuron3:CylinderCountInputNeuron4:HorsePowerInputNeuron5:HybridorGasolineOutputNeuron1:MilesperGallon
Wealsoneedtonormalizethedata.Toaccomplishthisnormalization,wemustthinkofreasonablerangesforeachofthesevalues.Wewillthentransforminputdataintoanumberbetween0and1thatrepresentsanactualvalue’spositionwithinthatrange.Considerthisexamplewiththereasonablerangesforthefollowingvalues:
CarWeight:100-5000lbs.EngineDisplacement:0.1to10litersCylinderCount:2-12HorsePower:1-1000HybridorGasoline:trueorfalseMilesperGallon:1-500
Giventoday’scars,theserangesmaybeonthelargeend.However,thischaracteristicwillallowminimalrestructuringtotheneuralnetworkinthefuture.Wealsowanttoavoidhavingtoomuchdataattheextremeendsoftherange.
Toillustratethisrange,wewillconsidertheproblemofnormalizingaweightof2,000pounds.Thisweightis1,900intotherange(2000–100).Thesizeoftherangeis4,900pounds(5000-100).Thepercentoftherangesizeis0.38(1,900/4,900).Therefore,wewouldfeedthevalueof0.38totheinputneuroninordertorepresentthisvalue.Thisprocesssatisfiestherangerequirementof0to1foraninputneuron.
Thehybridorregularvalueisatrue/false.Torepresentthisvalue,wewilluse1forhybridand0forregular.Wesimplynormalizeatrue/falseintotwovalues.
Nowthatyou’veseensomeoftheusesforneuralnetworks,itistimetodeterminehowtoselecttheappropriateneuralnetworkforyourspecificproblem.Inthesucceedingsection,weprovidearoadmapforthevariousneuralnetworksthatareavailable.
ANeuralNetworkRoadmap
Thisvolumecontainsawidearrayofneuralnetworktypes.Wewillpresenttheseneuralnetworksalongwithexamplesthatwillshowcaseeachneuralnetworkinaspecificproblemdomain.Notallneuralnetworksaredesignedtotackleeveryproblemdomain.Asaneuralnetworkprogrammer,youneedtoknowwhichneuralnetworktouseforaspecificproblem.
Thissectionprovidesahigh-levelroadmaptotherestofthebookthatwillguideyourreadingtoareasofthebookthatalignwithyourinterests.Figure9showsagridofthe
neuralnetworktypesinthisvolumeandtheirapplicableproblemdomains:
Figure9:NeuralNetworkTypes&ProblemDomains
Theproblemdomainslistedabovearethefollowing:
Clust–UnsupervisedclusteringproblemsRegis–Regressionproblems,thenetworkmustoutputanumberbasedoninput.Classif–Classificationproblems,thenetworkmustclassifydatapointsintopredefinedclasses.Predict–Thenetworkmustpredicteventsintime,suchassignalsforfinanceapplications.Robot–Robotics,usingsensorsandmotorcontrolVision–ComputerVision(CV)problemsrequirethecomputertounderstandimages.Optim–Optimizationproblemsrequirethatthenetworkfindthebestorderingorsetofvaluestoachieveanobjective.
Thenumberofcheckmarksgivestheapplicabilityofeachoftheneuralnetworktypestothatparticularproblem.Iftherearenochecks,youcannotapplythatnetworktypetothatproblemdomain.
Allneuralnetworkssharesomecommoncharacteristics.Neurons,weights,activationfunctions,andlayersarethebuildingblocksofneuralnetworks.Inthefirstchapterofthisbook,wewillintroducetheseconceptsandpresentthebasiccharacteristicsthatmostneuralnetworksshare.
DataSetsUsedinthisBook
Thisbookcontainsseveraldatasetsthatallowustoshowapplicationoftheneuralnetworkstorealdata.Wechoseseveraldatasetsinordertocovertopicssuchasregression,classification,time-series,andcomputervision.
MNISTHandwrittenDigits
SeveralexamplesusetheMNISThandwrittendigitsdataset.TheMNISTdatabase(MixedNationalInstituteofStandardsandTechnologydatabase)isalargedatabaseofhandwrittendigitsthatprogrammersusefortrainingvariousimageprocessingsystems.Thisclassicdatasetisoftenpresentedinconjunctionwithneuralnetworks.Thisdatasetisessentiallythe“HelloWorld”programofneuralnetworks.YoucanobtainitfromthefollowingURL:
http://yann.lecun.com/exdb/mnist/
Thedatasetintheabovelistingisstoredinaspecialbinaryformat.YoucanalsofindthisformatattheaboveURL.Theexampleprogramsprovidedforthischapterarecapableofreadingthisformat.
Thisdatasetcontainsmanyhandwrittendigits.Italsoincludesatrainingsetof60,000examplesandatestsetof10,000examples.Weprovidelabelsonbothsetstoindicatewhateachdigitissupposedtobe.MNISTisahighlystudieddatasetthatprogrammersfrequentlyuseasabenchmarkfornewmachinelearningalgorithmsandtechniques.Furthermore,researchershavepublishedmanyscientificpapersabouttheirattemptstoachievethelowesterrorrate.Inonestudy,theresearchermanagedtoachieveanerrorrateontheMNISTdatabaseof0.23percentwhileusingahierarchicalsystemofconvolutionalneuralnetworks(Schmidhuber,2012).
WeshowasmallsamplingofthedatasetinFigure10:
Figure10:MNISTDigits
Wecanusethisdatasetforclassificationneuralnetworks.Thenetworkslearntolookatanimageandclassifyitintotheappropriateplaceamongthetendigits.Eventhoughthisdatasetisanimage-basedneuralnetwork,youcanthinkofitasatraditionaldataset.Theseimagesare28pixelsby28pixels,resultinginatotalof784pixels.Despitetheimpressiveimages,webeginthebookbyusingregularneuralnetworksthattreattheimagesasa784-input-neuronneuralnetwork.Youwoulduseexactlythesametypeofneuralnetworktohandleanyclassificationproblemthathasalargenumberofinputs.Suchproblemsarehighdimensional.Laterinthebook,wewillseehowtouseneuralnetworksthatwerespecificallydesignedforimagerecognition.TheseneuralnetworkswillperformconsiderablybetterontheMNISTdigitsthanthemoretraditionalneuralnetworks.
TheMNISTdatasetisstoredinaproprietybinaryformatthatisdescribedattheaboveURL.Weprovideadecoderinthebook’sexamples.
IrisDataSet
BecauseAIfrequentlyusestheirisdataset(Fisher,1936),youwillseeitseveraltimesinthisbook.SirRonaldFisher(1936)collectedthesedataasanexampleofdiscriminantanalysis.Thisdatasethasbecomeverypopularinmachinelearningeventoday.ThefollowingURLcontainstheirisdataset:
https://archive.ics.uci.edu/ml/datasets/Iris
Theirisdatasetcontainsmeasurementsandspeciesinformationfor150irisflowers,andthedataareessentiallyrepresentedasaspreadsheetwiththefollowingcolumnsorfeatures:
Sepallength
SepalwidthPetallengthPetalwidthIrisspecies
Petalsrefertotheinnermostpetalsoftheiris,andsepalreferstotheoutermostpetalsoftheirisflower.Eventhoughthedatasetseemstohaveavectoroflength5,thespeciesfeaturemustbehandleddifferentlythantheotherfour.Inotherwords,vectorstypicallycontainonlynumbers.So,thefirstfourfeaturesareinherentlynumerical.Thespeciesfeatureisnot.
Oneoftheprimaryapplicationsofthisdatasetistocreateaprogramthatwillactasaclassifier.Thatis,itwillconsidertheflower’sfeaturesasinputs(sepallength,petalwidth,etc.)andultimatelydeterminethespecies.Thisclassificationwouldbetrivialforacompleteandknowndataset,butourgoalistoseewhetherthemodelcancorrectlyidentifythespeciesusingdatafromunknownirises.
Onlysimplenumericencodingtranslatestheirisspeciestoasingledimension.Wemustuseadditionaldimensionalencodings,suchasone-of-norequilateral,sothatthespeciesencodingsareequidistantfromeachother.Ifweareclassifyingirises,wedonotwantourencodingprocesstocreateanybiases.
Thinkingoftheirisfeaturesasdimensionsinahigher-dimensionalspacemakesagreatdealofsense.Considertheindividualsamples(therowsintheirisdataset)aspointsinthissearchspace.Pointsclosertogetherlikelysharesimilarities.Let’stakealookatthesesimilaritiesbystudyingthefollowingthreerowsfromtheirisdataset:
5.1,3.5,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolour
6.3,3.3,6.0,2.5,Iris-virginica
Thefirstlinehas5.1asthesepallength,3.5asthesepalwidth,1.4asthepetallength,and0.2asthepetalwidth.Ifweuseone-of-nencodingtotherange0to1,theabovethreerowswouldencodetothefollowingthreevectors:
[5.1,3.5,1.4,0.2,1,0,0]
[7.0,3.2,4.7,1.4,0,1,0]
[6.3,3.3,6.0,2.5,0,0,1]
Chapter4,“FeedforwardNeuralNetworks,”willcoverone-of-nencoding.
AutoMPGDataSet
Theautomilespergallon(MPG)datasetiscommonlyusedforregressionproblems.Thedatasetcontainsattributesofseveralcars.Usingtheseattributes,wecantrainneuralnetworkstopredictthefuelefficiencyofthecar.TheUCIMachineLearningRepositoryprovidesthisdataset,andyoucandownloaditfromthefollowingURL:
https://archive.ics.uci.edu/ml/datasets/Auto+MPG
WetookthesedatafromtheStatLiblibrary,whichismaintainedatCarnegieMellonUniversity.IntheexpositionfortheAmericanStatisticalAssociation,programmersusedthedatain1983,andnovaluesaremissing.Quinlan(1993),theauthorofthestudy,usedthisdatasettodescribefuelconsumption.“Thedataconcerncity-cyclefuelconsumptioninmilespergallon,tobeprojectedintermsofthreemulti-valueddiscreteandfivecontinuousattributes”(Quinlan,1993).
Thedatasetcontainsthefollowingattributes:
1.mpg:continuous
2.cylinders:multi-valueddiscrete
3.displacement:continuous
4.horsepower:continuous
5.weight:continuous
6.acceleration:continuous
7.modelyear:multi-valueddiscrete
8.origin:multi-valueddiscrete
9.carname:string(uniqueforeachinstance)
SunspotsDataSet
Sunspotsaretemporaryphenomenaonthesurfaceofthesunthatappearvisiblyasdarkspotscomparedtosurroundingregions.Intensemagneticactivitycausessunspots.Althoughtheyoccurattemperaturesofroughly3,000–4,500K(2,727–4,227°C),thecontrastwiththesurroundingmaterialatabout5,780Kleavesthemclearlyvisibleasdarkspots.Sunspotsappearanddisappearwithregularity,makingthemagooddatasetfortimeseriesprediction.
Figure11showssunspotactivityovertime:
Figure11:SunspotsActivity
Thesunspotdatafilecontainsinformationsimilartothefollowing:
YEARMONSSNDEV
1749158.024.1
1749262.625.1
1749370.026.6
1749455.723.6
1749585.029.4
1749683.529.2
1749794.831.1
1749866.325.9
1749975.927.7
Theabovedataprovidetheyear,month,sunspotcount,andstandarddeviationofsunspotsobserved.Manyworldorganizationstracksunspots.ThefollowingURLcontainsatableofsunspotreadings:
http://solarscience.msfc.nasa.gov/greenwch/spot_num.txt
XOROperator
Theexclusiveor(XOR)operatorisaBooleanoperator.ProgrammersfrequentlyusethetruthtablefortheXORasanultra-simplesortof“HelloWorld”trainingsetformachinelearning.WerefertothetableastheXORdataset.ThisoperatorisrelatedtotheXORparityoperator,whichacceptsthreeinputsandhasthefollowingtruthtable:
0XOR0=0
1XOR0=1
0XOR1=1
1XOR1=0
WeutilizetheXORoperatorforcasesinwhichwewouldliketotrainorevaluatetheneuralnetworkbyhand.
KaggleOttoGroupChallenge
Inthisbook,wewillalsoutilizetheKaggleOttoGroupChallengedataset.Kaggleisaplatformthatfosterscompetitionamongdatascientistsonnewdatasets.Weusethisdatasettoclassifyproductsintoseveralgroupsbasedonunknownattributes.Additionally,wewillemployadeepneuralnetworktotacklethisproblem.WewillalsodiscussadvancedensembletechniquesinthischapterthatyoucanusetocompeteinKaggle.WewilldescribethisdatasetingreaterdetailinChapter16.
Wewillbeginthisbookwithanoverviewoffeaturesthatarecommontomostneuralnetworks.Thesefeaturesincludeneurons,layers,activationfunctions,andconnections.Fortheremainderofthebook,wewillexpandonthesetopicsasweintroducemoreneuralnetworkarchitectures.
Chapter1:NeuralNetworkBasicsNeuronsandLayersNeuronTypesActivationFunctionsLogicGates
Thisbookisaboutneuralnetworksandhowtotrain,query,structure,andinterpretthem.Wepresentmanyneuralnetworkarchitecturesaswellastheplethoraofalgorithmsthatcantraintheseneuralnetworks.Trainingistheprocessinwhichaneuralnetworkisadaptedtomakepredictionsfromdata.Inthischapter,wewillintroducethebasicconceptsthataremostrelevanttotheneuralnetworktypesfeaturedinthebook.
Deeplearning,arelativelynewsetoftrainingtechniquesformultilayeredneuralnetworks,isalsoaprimarytopic.Itencompassesseveralalgorithmsthatcantraincomplextypesofneuralnetworks.Withthedevelopmentofdeeplearning,wenowhaveeffectivemethodstotrainneuralnetworkswithmanylayers.
Thischapterwillincludeadiscussionofthecommonalitiesamongthedifferentneuralnetworks.Additionally,youwilllearnhowneuronsformweightedconnections,howtheseneuronscreatelayers,andhowactivationfunctionsaffecttheoutputofalayer.Webeginwithneuronsandlayers.
NeuronsandLayers
Mostneuralnetworkstructuresusesometypeofneuron.Manydifferentkindsofneuralnetworksexist,andprogrammersintroduceexperimentalneuralnetworkstructuresallthetime.Consequently,itisnotpossibletocovereveryneuralnetworkarchitecture.However,therearesomecommonalitiesamongneuralnetworkimplementations.Analgorithmthatiscalledaneuralnetworkwilltypicallybecomposedofindividual,interconnectedunitseventhoughtheseunitsmayormaynotbecalledneurons.Infact,thenameforaneuralnetworkprocessingunitvariesamongtheliteraturesources.Itcouldbecalledanode,neuron,orunit.
Figure1.1showstheabstractstructureofasingleartificialneuron:
Figure1.1:AnArtificialNeuron
Theartificialneuronreceivesinputfromoneormoresourcesthatmaybeotherneuronsordatafedintothenetworkfromacomputerprogram.Thisinputisusuallyfloating-pointorbinary.Oftenbinaryinputisencodedtofloating-pointbyrepresentingtrueorfalseas1or0.Sometimestheprogramalsodepictsthebinaryinputasusingabipolarsystemwithtrueas1andfalseas-1.
Anartificialneuronmultiplieseachoftheseinputsbyaweight.Thenitaddsthesemultiplicationsandpassesthissumtoanactivationfunction.Someneuralnetworksdonotuseanactivationfunction.Equation1.1summarizesthecalculatedoutputofaneuron:
Equation1.1:NeuronOutput
Intheaboveequation,thevariablesxandwrepresenttheinputandweightsoftheneuron.Thevariableicorrespondstothenumberofweightsandinputs.Youmustalwayshavethesamenumberofweightsasinputs.Eachweightismultipliedbyitsrespectiveinput,andtheproductsofthesemultiplicationsarefedintoanactivationfunctionthatisdenotedbytheGreekletterφ(phi).Thisprocessresultsinasingleoutputfromtheneuron.
Figure1.1showsthestructurewithjustonebuildingblock.Youcanchaintogethermanyartificialneuronstobuildanartificialneuralnetwork(ANN).Thinkoftheartificialneuronsasbuildingblocksforwhichtheinputandoutputcirclesaretheconnectors.Figure1.2showsanartificialneuralnetworkcomposedofthreeneurons:
Figure1.2:SimpleArtificialNeuralNetwork(ANN)
Theabovediagramshowsthreeinterconnectedneurons.ThisrepresentationisessentiallyFigure1.1,minusafewinputs,repeatedthreetimesandthenconnected.Italsohasatotaloffourinputsandasingleoutput.TheoutputofneuronsN1andN2feedN3toproducetheoutputO.TocalculatetheoutputforFigure1.2,weperformEquation1.1threetimes.ThefirsttwotimescalculateN1andN2,andthethirdcalculationusestheoutputofN1andN2tocalculateN3.
NeuralnetworkdiagramsdonottypicallyshowthelevelofdetailseeninFigure1.2.Tosimplifythediagram,wecanomittheactivationfunctionsandintermediateoutputs,andthisprocessresultsinFigure1.3:
Figure1.3:SimplifiedViewofANN
LookingatFigure1.3,youcanseetwoadditionalcomponentsofneuralnetworks.First,considertheinputsandoutputsthatareshownasabstractdottedlinecircles.Theinputandoutputcouldbepartsofalargerneuralnetwork.However,theinputandoutputareoftenaspecialtypeofneuronthatacceptsdatafromthecomputerprogramusingtheneuralnetwork,andtheoutputneuronsreturnaresultbacktotheprogram.Thistypeofneuroniscalledaninputneuron.Wewilldiscusstheseneuronsinthenextsection.
Figure1.3alsoshowstheneuronsarrangedinlayers.Theinputneuronsarethefirstlayer,theN1andN2neuronscreatethesecondlayer,thethirdlayercontainsN3,andthefourthlayerhasO.Whilemostneuralnetworksarrangeneuronsintolayers,thisisnot
alwaysthecase.Stanley(2002)introducedaneuralnetworkarchitecturecalledNeuroevolutionofAugmentingTopologies(NEAT).NEATneuralnetworkscanhaveaveryjumbled,non-layeredarchitecture.
Theneuronsthatformalayershareseveralcharacteristics.First,everyneuroninalayerhasthesameactivationfunction.However,thelayersthemselvesmighthavedifferentactivationfunctions.Second,layersarefullyconnectedtothenextlayer.Inotherwords,everyneuroninonelayerhasaconnectiontoneuronsinthepreviouslayer.Figure1.3isnotfullyconnected.Severallayersaremissingconnections.Forexample,I1andN2donotconnect.Figure1.4isanewversionofFigure1.3thatisfullyconnectedandhasanadditionallayer.
Figure1.4:FullyConnectedNetwork
InFigure1.4,youseeafullyconnected,multilayeredneuralnetwork.Networks,suchasthisone,willalwayshaveaninputandoutputlayer.Thenumberofhiddenlayersdeterminesthenameofthenetworkarchitecture.ThenetworkinFigure1.4isatwo-hidden-layernetwork.Mostnetworkswillhavebetweenzeroandtwohiddenlayers.Unlessyouhaveimplementeddeeplearningstrategies,networkswithmorethantwohiddenlayersarerare.
Youmightalsonoticethatthearrowsalwayspointdownwardorforwardfromtheinputtotheoutput.Thistypeofneuralnetworkiscalledafeedforwardneuralnetwork.
Laterinthisbook,wewillseerecurrentneuralnetworksthatforminvertedloopsamongtheneurons.
TypesofNeurons
Inthelastsection,webrieflyintroducedtheideathatdifferenttypesofneuronsexist.Nowwewillexplainalltheneurontypesdescribedinthebook.Noteveryneuralnetworkwilluseeverytypeofneuron.Itisalsopossibleforasingleneurontofilltheroleofseveraldifferentneurontypes.
InputandOutputNeurons
Nearlyeveryneuralnetworkhasinputandoutputneurons.Theinputneuronsacceptdatafromtheprogramforthenetwork.Theoutputneuronprovidesprocesseddatafromthenetworkbacktotheprogram.Theseinputandoutputneuronswillbegroupedbytheprogramintoseparatelayerscalledtheinputandoutputlayer.However,forsomenetworkstructures,theneuronscanactasbothinputandoutput.TheHopfieldneuralnetwork,whichwewilldiscussinChapter3,“Hopfield&BoltzmannMachines,”isanexampleofaneuralnetworkarchitectureinwhichneuronsarebothinputandoutput.
Theprogramnormallyrepresentstheinputtoaneuralnetworkasanarrayorvector.Thenumberofelementscontainedinthevectormustbeequaltothenumberofinputneurons.Forexample,aneuralnetworkwiththreeinputneuronsmightacceptthefollowinginputvector:
[0.5,0.75,0.2]
Neuralnetworkstypicallyacceptfloating-pointvectorsastheirinput.Likewise,neuralnetworkswilloutputavectorwithlengthequaltothenumberofoutputneurons.Theoutputwilloftenbeasinglevaluefromasingleoutputneuron.Tobeconsistent,wewillrepresenttheoutputofasingleoutputneuronnetworkasasingle-elementvector.
Noticethatinputneuronsdonothaveactivationfunctions.AsdemonstratedbyFigure1.1,inputneuronsarelittlemorethanplaceholders.Theinputissimplyweightedandsummed.Furthermore,thesizeoftheinputandoutputvectorsfortheneuralnetworkwillbethesameiftheneuralnetworkhasneuronsthatarebothinputandoutput.
HiddenNeurons
Hiddenneuronshavetwoimportantcharacteristics.First,hiddenneuronsonlyreceiveinputfromotherneurons,suchasinputorotherhiddenneurons.Second,hiddenneuronsonlyoutputtootherneurons,suchasoutputorotherhiddenneurons.Hiddenneuronshelptheneuralnetworkunderstandtheinput,andtheyformtheoutput.However,theyarenotdirectlyconnectedtotheincomingdataortotheeventualoutput.Hiddenneuronsareoftengroupedintofullyconnectedhiddenlayers.
Acommonquestionforprogrammersconcernsthenumberofhiddenneuronsinanetwork.Sincetheanswertothisquestioniscomplex,morethanonesectionofthebookwillincludearelevantdiscussionofthenumberofhiddenneurons.Priortodeeplearning,itwasgenerallysuggestedthatanythingmorethanasingle-hiddenlayerisexcessive(Hornik,1991).Researchershaveproventhatasingle-hidden-layerneuralnetworkcanfunctionasauniversalapproximator.Inotherwords,thisnetworkshouldbeabletolearntoproduce(orapproximate)anyoutputfromanyinputaslongasithasenoughhiddenneuronsinasinglelayer.
Anotherreasonwhyresearchersusedtoscoffattheideaofadditionalhiddenlayersisthattheselayerswouldimpedethetrainingoftheneuralnetwork.Trainingreferstotheprocessthatdeterminesgoodweightvalues.Beforeresearchersintroduceddeeplearningtechniques,wesimplydidnothaveanefficientwaytotrainadeepnetwork,whichareneuralnetworkswithalargenumberofhiddenlayers.Althoughasingle-hidden-layerneuralnetworkcantheoreticallylearnanything,deeplearningfacilitatesamorecomplexrepresentationofpatternsinthedata.
BiasNeurons
Programmersaddbiasneuronstoneuralnetworkstohelpthemlearnpatterns.Biasneuronsfunctionlikeaninputneuronthatalwaysproducesthevalueof1.Becausethebiasneuronshaveaconstantoutputof1,theyarenotconnectedtothepreviouslayer.Thevalueof1,whichiscalledthebiasactivation,canbesettovaluesotherthan1.However,1isthemostcommonbiasactivation.Notallneuralnetworkshavebiasneurons.Figure1.5showsasingle-hidden-layerneuralnetworkwithbiasneurons:
Figure1.5:NetworkwithBiasNeurons
Theabovenetworkcontainsthreebiasneurons.Everylevel,exceptfortheoutputlayer,containsasinglebiasneuron.Biasneuronsallowtheoutputofanactivationfunctiontobeshifted.Wewillseeexactlyhowthisshiftingoccurslaterinthechapterwhenactivationfunctionsarediscussed.
ContextNeurons
Contextneuronsareusedinrecurrentneuralnetworks.Thistypeofneuronallowstheneuralnetworktomaintainstate.Asaresult,agiveninputmaynotalwaysproduceexactlythesameoutput.Thisinconsistencyissimilartotheworkingsofbiologicalbrains.Considerhowcontextfactorsinyourresponsewhenyouhearaloudhorn.Ifyouhearthenoisewhileyouarecrossingthestreet,youmightstartle,stopwalking,andlookinthedirectionofthehorn.Ifyouhearthehornwhileyouarewatchinganactionadventurefilminamovietheatre,youdon’trespondinthesameway.Therefore,priorinputsgiveyouthecontextforprocessingtheaudioinputofahorn.
Timeseriesisoneapplicationofcontextneurons.Youmightneedtotrainaneuralnetworktolearninputsignalstoperformspeechrecognitionortopredicttrendsinsecurityprices.Contextneuronsareonewayforneuralnetworkstodealwithtimeseriesdata.Figure1.6showshowcontextneuronsmightbearrangedinaneuralnetwork:
Figure1.6:ContextNeurons
Thisneuralnetworkhasasingleinputandoutputneuron.Betweentheinputandoutputlayersaretwohiddenneuronsandtwocontextneurons.Otherthanthetwocontextneurons,thisnetworkisthesameaspreviousnetworksinthechapter.
Eachcontextneuronholdsavaluethatstartsat0andalwaysreceivesacopyofeitherhidden1orhidden2fromtheprevioususeofthenetwork.ThetwodashedlinesinFigure1.6meanthatthecontextneuronisadirectcopywithnootherweighting.Theotherlinesindicatethattheoutputisweightedbyoneofthesixweightvalueslistedabove.Equation1.1stillcalculatestheoutputinthesameway.Thevalueoftheoutputneuronwouldbethesumofallfourinputs,multipliedbytheirweights,andappliedtotheactivationfunction.
Atypeofneuralnetworkcalledasimplerecurrentneuralnetwork(SRN)usescontextneurons.JordanandElmannetworksarethetwomostcommontypesofSRN.Figure1.6showsanElmanSRN.Chapter13,“TimeSeriesandRecurrentNetworks,”includesadiscussionofbothtypesofSRN.
OtherNeuronTypes
Theindividualunitsthatcompriseaneuralnetworkarenotalwayscalledneurons.Researcherswillsometimesrefertotheseneuronsasnodes,unitsorsummations.Inlaterchaptersofthebook,wewillexploredeeplearningthatutilizesBoltzmannmachinestofilltheroleofneurons.Regardlessofthetypeofunit,neuralnetworksarealmostalwaysconstructedofweightedconnectionsbetweentheseunits.
ActivationFunctions
Inneuralnetworkprogramming,activationortransferfunctionsestablishboundsfortheoutputofneurons.Neuralnetworkscanusemanydifferentactivationfunctions.Wewilldiscussthemostcommonactivationfunctionsinthissection.
Choosinganactivationfunctionforyourneuralnetworkisanimportantconsiderationbecauseitcanaffecthowyoumustformatinputdata.Inthischapter,wewillguideyouontheselectionofanactivationfunction.Chapter14,“ArchitectingNeuralNetworks,”willalsocontainadditionaldetailsontheselectionprocess.
LinearActivationFunction
Themostbasicactivationfunctionisthelinearfunctionbecauseitdoesnotchangetheneuronoutputatall.Equation1.2showshowtheprogramtypicallyimplementsalinearactivationfunction:
Equation1.2:LinearActivationFunction
Asyoucanobserve,thisactivationfunctionsimplyreturnsthevaluethattheneuroninputspassedtoit.Figure1.7showsthegraphforalinearactivationfunction:
Figure1.7:LinearActivationFunction
Regressionneuralnetworks,thosethatlearntoprovidenumericvalues,willusuallyusealinearactivationfunctionontheiroutputlayer.Classificationneuralnetworks,thosethatdetermineanappropriateclassfortheirinput,willusuallyutilizeasoftmaxactivationfunctionfortheiroutputlayer.
StepActivationFunction
Thesteporthresholdactivationfunctionisanothersimpleactivationfunction.Neuralnetworkswereoriginallycalledperceptrons.McCulloch&Pitts(1943)introducedtheoriginalperceptronandusedastepactivationfunctionlikeEquation1.3:
Equation1.3:StepActivationFunction
Equation1.3outputsavalueof1.0forincomingvaluesof0.5orhigherand0forallothervalues.Stepfunctionsareoftencalledthresholdfunctionsbecausetheyonlyreturn1(true)forvaluesthatareabovethespecifiedthreshold,asseeninFigure1.8:
Figure1.8:StepActivationFunction
SigmoidActivationFunction
Thesigmoidorlogisticactivationfunctionisaverycommonchoiceforfeedforwardneuralnetworksthatneedtooutputonlypositivenumbers.Despiteitswidespreaduse,thehyperbolictangentortherectifiedlinearunit(ReLU)activationfunctionisusuallyamoresuitablechoice.WeintroducetheReLUactivationfunctionlaterinthischapter.Equation1.4showsthesigmoidactivationfunction:
Equation1.4:SigmoidActivationFunction
Usethesigmoidfunctiontoensurethatvaluesstaywithinarelativelysmallrange,asseeninFigure1.9:
Figure1.9:SigmoidActivationFunction
Asyoucanseefromtheabovegraph,valuesaboveorbelow0arecompressedtotheapproximaterangebetween0and1.
HyperbolicTangentActivationFunction
Thehyperbolictangentfunctionisalsoaverycommonactivationfunctionforneuralnetworksthatmustoutputvaluesintherangebetween-1and1.Thisactivationfunctionissimplythehyperbolictangent(tanh)function,asshowninEquation1.5:
Equation1.5:HyperbolicTangentActivationFunction
Thegraphofthehyperbolictangentfunctionhasasimilarshapetothesigmoidactivationfunction,asseeninFigure1.10:
Figure1.10:HyperbolicTangentActivationFunction
Thehyperbolictangentfunctionhasseveraladvantagesoverthesigmoidactivationfunction.Theseinvolvethederivativesusedinthetrainingoftheneuralnetwork,andtheywillbecoveredinChapter6,“BackpropagationTraining.”
RectifiedLinearUnits(ReLU)
Introducedin2000byTeh&Hinton,therectifiedlinearunit(ReLU)hasseenveryrapidadoptionoverthepastfewyears.PriortotheReLUactivationfunction,thehyperbolictangentwasgenerallyacceptedastheactivationfunctionofchoice.MostcurrentresearchnowrecommendstheReLUduetosuperiortrainingresults.Asaresult,mostneuralnetworksshouldutilizetheReLUonhiddenlayersandeithersoftmaxorlinearontheoutputlayer.Equation1.6showstheverysimpleReLUfunction:
Equation1.6:RectifiedLinearUnit(ReLU)
WewillnowexaminewhyReLUtypicallyperformsbetterthanotheractivationfunctionsforhiddenlayers.PartoftheincreasedperformanceisduetothefactthattheReLUactivationfunctionisalinear,non-saturatingfunction.Unlikethesigmoid/logisticorthehyperbolictangentactivationfunctions,theReLUdoesnotsaturateto-1,0,or1.Asaturatingactivationfunctionmovestowardsandeventuallyattainsavalue.Thehyperbolictangentfunction,forexample,saturatesto-1asxdecreasesandto1asxincreases.Figure1.11showsthegraphoftheReLUactivationfunction:
Figure1.11:ReLUActivationFunction
Mostcurrentresearchstatesthatthehiddenlayersofyourneuralnetworkshoulduse
theReLUactivation.ThereasonsforthesuperiorityoftheReLUoverhyperbolictangentandsigmoidwillbedemonstratedinChapter6,“BackpropagationTraining.”
SoftmaxActivationFunction
Thefinalactivationfunctionthatwewillexamineisthesoftmaxactivationfunction.Alongwiththelinearactivationfunction,softmaxisusuallyfoundintheoutputlayerofaneuralnetwork.Thesoftmaxfunctionisusedonaclassificationneuralnetwork.Theneuronthathasthehighestvalueclaimstheinputasamemberofitsclass.Becauseitisapreferablemethod,thesoftmaxactivationfunctionforcestheoutputoftheneuralnetworktorepresenttheprobabilitythattheinputfallsintoeachoftheclasses.Withoutthesoftmax,theneuron’soutputsaresimplynumericvalues,withthehighestindicatingthewinningclass.
Toseehowthesoftmaxactivationfunctionisused,wewilllookatacommonneuralnetworkclassificationproblem.Theirisdatasetcontainsfourmeasurementsfor150differentirisflowers.Eachoftheseflowersbelongstooneofthreespeciesofiris.Whenyouprovidethemeasurementsofaflower,thesoftmaxfunctionallowstheneuralnetworktogiveyoutheprobabilitythatthesemeasurementsbelongtoeachofthethreespecies.Forexample,theneuralnetworkmighttellyouthatthereisan80%chancethattheirisissetosa,a15%probabilitythatitisvirginicaandonlya5%probabilityofversicolour.Becausetheseareprobabilities,theymustaddupto100%.Therecouldnotbean80%probabilityofsetosa,a75%probabilityofvirginicaanda20%probabilityofversicolour—thistypeofaresultwouldbenonsensical.
Toclassifyinputdataintooneofthreeirisspecies,youwillneedoneoutputneuronforeachofthethreespecies.Theoutputneuronsdonotinherentlyspecifytheprobabilityofeachofthethreespecies.Therefore,itisdesirabletoprovideprobabilitiesthatsumto100%.Theneuralnetworkwilltellyoutheprobabilityofaflowerbeingeachofthethreespecies.Togettheprobability,usethesoftmaxfunctioninEquation1.7:
Equation1.7:TheSoftmaxFunction
Intheaboveequation,irepresentstheindexoftheoutputneuron(o)beingcalculated,andjrepresentstheindexesofallneuronsinthegroup/level.Thevariablezdesignatesthearrayofoutputneurons.It’simportanttonotethatthesoftmaxactivationiscalculateddifferentlythantheotheractivationfunctionsinthischapter.Whensoftmaxistheactivationfunction,theoutputofasingleneuronisdependentontheotheroutputneurons.
InEquation1.7,youcanobservethattheoutputoftheotheroutputneuronsiscontainedinthevariablez,asnoneoftheotheractivationfunctionsinthischapterutilizez.Listing1.1implementssoftmaxinpseudocode:
Listing1.1:TheSoftmaxFunction
defsoftmax(neuron_output):
sum=0
forvinneuron_output:
sum=sum+v
sum=math.exp(sum)
proba=[]
foriinrange(len(neuron_output))
proba[i]=math.exp(neuron_output[i])/sum
returnproba
Toseethesoftmaxfunctioninoperation,refertothefollowingURL:
http://www.heatonresearch.com/aifh/vol3/softmax.html
Consideratrainedneuralnetworkthatclassifiesdataintothreecategories,suchasthethreeirisspecies.Inthiscase,youwoulduseoneoutputneuronforeachofthetargetclasses.Consideriftheneuralnetworkweretooutputthefollowing:
Neuron1:setosa:0.9
Neuron2:versicolour:0.2
Neuron3:virginica:0.4
Fromtheaboveoutput,wecanclearlyseethattheneuralnetworkconsidersthedatatorepresentasetosairis.However,thesenumbersarenotprobabilities.The0.9valuedoesnotrepresenta90%likelihoodofthedatarepresentingasetosa.Thesevaluessumto1.5.Inorderforthemtobetreatedasprobabilities,theymustsumto1.0.Theoutputvectorforthisneuralnetworkisthefollowing:
[0.9,0.2,0.4]
Ifthisvectorisprovidedtothesoftmaxfunction,thefollowingvectorisreturned:
[0.47548495534876745,0.2361188410001125,0.28839620365112]
Theabovethreevaluesdosumto1.0andcanbetreatedasprobabilities.Thelikelihoodofthedatarepresentingasetosairisis48%becausethefirstvalueinthevectorroundsto0.48(48%).Youcancalculatethisvalueinthefollowingmanner:
sum=exp(0.9)+exp(0.2)+exp(0.4)=5.17283056695839
j0=exp(0.9)/sum=0.47548495534876745
j1=exp(0.2)/sum=0.2361188410001125
j2=exp(0.4)/sum=0.28839620365112
WhatRoledoesBiasPlay?
Theactivationfunctionsseenintheprevioussectionspecifytheoutputofasingleneuron.Together,theweightandbiasofaneuronshapetheoutputoftheactivationtoproducethedesiredoutput.Toseehowthisprocessoccurs,considerEquation1.8.Itrepresentsasingle-inputsigmoidactivationneuralnetwork.
Equation1.8:Single-InputNeuralNetwork
Thexvariablerepresentsthesingleinputtotheneuralnetwork.Thewandbvariablesspecifytheweightandbiasoftheneuralnetwork.TheaboveequationisacombinationoftheEquation1.1thatspecifiesaneuralnetworkandEquation1.4thatdesignatesthesigmoidactivationfunction.
Theweightsoftheneuronallowyoutoadjusttheslopeorshapeoftheactivationfunction.Figure1.12showstheeffectontheoutputofthesigmoidactivationfunctioniftheweightisvaried:
Figure1.12:AdjustingNeuronWeight
Theabovediagramshowsseveralsigmoidcurvesusingthefollowingparameters:
f(x,0.5,0.0)
f(x,1.0,0.0)
f(x,1.5,0.0)
f(x,2.0,0.0)
Toproducethecurves,wedidnotusebias,whichisevidentinthethirdparameterof0ineachcase.UsingfourweightvaluesyieldsfourdifferentsigmoidcurvesinFigure1.11.Nomattertheweight,wealwaysgetthesamevalueof0.5whenxis0becauseallofthecurveshitthesamepointwhenxis0.Wemightneedtheneuralnetworktoproduceothervalueswhentheinputisnear0.5.
Biasdoesshiftthesigmoidcurve,whichallowsvaluesotherthan0.5whenxisnear0.Figure1.13showstheeffectofusingaweightof1.0withseveraldifferentbiases:
Figure1.13:AdjustingNeuronBias
Theabovediagramshowsseveralsigmoidcurveswiththefollowingparameters:
f(x,1.0,1.0)
f(x,1.0,0.5)
f(x,1.0,1.5)
f(x,1.0,2.0)
Weusedaweightof1.0forthesecurvesinallcases.Whenweutilizedseveraldifferentbiases,sigmoidcurvesshiftedtotheleftorright.Becauseallthecurvesmergetogetheratthetoprightorbottomleft,itisnotacompleteshift.
Whenweputbiasandweightstogether,theyproducedacurvethatcreatedthenecessaryoutputfromaneuron.Theabovecurvesaretheoutputfromonlyoneneuron.Inacompletenetwork,theoutputfrommanydifferentneuronswillcombinetoproducecomplexoutputpatterns.
LogicwithNeuralNetworks
Asacomputerprogrammer,youareprobablyfamiliarwithlogicalprogramming.YoucanusetheprogrammingoperatorsAND,OR,andNOTtogovernhowaprogrammakesdecisions.Theselogicaloperatorsoftendefinetheactualmeaningoftheweightsandbiasesinaneuralnetwork.Considerthefollowingtruthtable:
0AND0=0
1AND0=0
0AND1=0
1AND1=1
0OR0=0
1OR0=1
0OR1=1
1OR1=1
NOT0=1
NOT1=0
ThetruthtablespecifiesthatifbothsidesoftheANDoperatoraretrue,thefinaloutputisalsotrue.Inallothercases,theresultoftheANDisfalse.ThisdefinitionfitswiththeEnglishword“and”quitewell.IfyouwantahousewithaniceviewANDalargebackyard,thenbothrequirementsmustbefulfilledforyoutochooseahouse.Ifyouwantahousethathasanicevieworalargebackyard,thenonlyoneneedstobepresent.
Theselogicalstatementscanbecomemorecomplex.Considerifyouwantahousethathasaniceviewandalargebackyard.However,youwouldalsobesatisfiedwithahousethathasasmallbackyardyetisnearapark.Youcanexpressthisideainthefollowingway:
([niceview]AND[largeyard])OR((NOT[largeyard])and[park])
Youcanexpressthepreviousstatementwiththefollowinglogicaloperators:
Intheabovestatement,theORlookslikealetter“v,”theANDlookslikeanupsidedown“v,”andtheNOTlookslikehalfofabox.
WecanuseneuralnetworkstorepresentthebasiclogicaloperatorsofAND,OR,andNOT,asseeninFigure1.14:
Figure1.14:BasicLogicOperators
Theabovediagramshowstheweightsandbiasweightforeachofthethreefundamentallogicaloperators.YoucaneasilycalculatetheoutputforanyoftheseoperatorsusingEquation1.1.ConsidertheANDoperatorwithtwotrue(1)inputs:
(1*1)+(1*1)+(-1.5)=0.5
Weareusingastepactivationfunction.Because0.5isgreaterthanorequalto0.5,theoutputis1ortrue.Wecanevaluatetheexpressionwherethefirstinputisfalse:
(1*1)+(0*1)+(-1.5)=-0.5
Becauseofthestepactivationfunction,thisoutputis0orfalse.
Wecanbuildmorecomplexlogicalstructuresfromtheseneurons.Considertheexclusiveor(XOR)operatorthathasthefollowingtruthtable:
0XOR0=0
1XOR0=1
0XOR1=1
1XOR1=0
TheXORoperatorspecifiesthatone,butnotboth,oftheinputscanbetrue.Forexample,oneofthetwocarswillwintherace,butnotbothofthemwillwin.TheXORoperatorcanbewrittenwiththebasicAND,OR,andNOToperatorsasfollows:
Equation1.9:TheExclusiveOrOperator
ThepluswithacircleisthesymbolfortheXORoperator,andpandqarethetwo
inputstoevaluate.TheaboveexpressionmakessenseifyouthinkoftheXORoperatormeaningporq,butnotbothpandq.Figure1.15showsaneuralnetworkthatcanrepresentanXORoperator:
Figure1.15:XORNeuralNetwork
Calculatingtheaboveneuralnetworkwouldrequireseveralsteps.First,youmustcalculatethevaluesforeverynodethatisdirectlyconnectedtotheinputs.Inthecaseoftheaboveneuralnetwork,therearetwonodes.WewillshowanexampleofcalculatingtheXORwiththeinputs[0,1].Webeginbycalculatingthetwotopmost,unlabeled(hidden)nodes:
(0*1)+(1*1)-0.5=0.5=True
(0*1)+(1*1)-1.5=-0.5=False
Nextwecalculatethelower,unlabeled(hidden)node:
(0*-1)+0.5=0.5=True
Finally,wecalculateO1:
(1*1)+(1*1)-1.5=0.5=True
Asyoucansee,youcanmanuallywiretheconnectionsinaneuralnetworktoproducethedesiredoutput.However,manuallycreatingneuralnetworksisverytedious.Therestofthebookwillincludeseveralalgorithmsthatcanautomaticallydeterminetheweightandbiasvalues.
ChapterSummary
Inthischapter,weshowedthataneuralnetworkiscomprisedofneurons,layers,andactivationfunctions.Fundamentally,theneuronsinaneuralnetworkmightbeinput,hidden,oroutputinnature.Inputandoutputneuronspassinformationintoandoutoftheneuralnetwork.Hiddenneuronsoccurbetweentheinputandoutputneuronsandhelpprocessinformation.
Activationfunctionsscaletheoutputofaneuron.Wealsointroducedseveralactivationfunctions.Thetwomostcommonactivationfunctionsarethesigmoidandhyperbolictangent.Thesigmoidfunctionisappropriatefornetworksinwhichonlypositiveoutputisneeded.Thehyperbolictangentfunctionsupportsbothpositiveandnegativeoutput.
Aneuralnetworkcanbuildlogicalstatements,andwedemonstratedtheweightsto
generateAND,OR,andNOToperators.Usingthesethreebasicoperators,youcanbuildmorecomplex,logicalexpressions.WepresentedanexampleofbuildinganXORoperator.
Nowthatwe’veseenthebasicstructureofaneuralnetwork,wewillexploreinthenexttwochaptersseveralclassicneuralnetworkssothatyoucanusethisabstractstructure.Classicneuralnetworkstructuresincludetheself-organizingmap,theHopfieldneuralnetwork,andtheBoltzmannmachine.Theseclassicalneuralnetworksformthefoundationofotherarchitecturesthatwepresentinthebook.
Chapter2:Self-OrganizingMapsSelf-OrganizingMapsNeighborhoodFunctionsUnsupervisedTrainingDimensionality
Nowthatyouhaveexploredtheabstractnatureofaneuralnetworkintroducedinthepreviouschapter,youwilllearnaboutseveralclassicneuralnetworktypes.Thischaptercoversoneoftheearliesttypesofneuralnetworksthatarestillusefultoday.Becauseneuronscanbeconnectedinvariousways,manydifferentneuralnetworkarchitecturesexistandbuildonthefundamentalideasfromChapter1,“NeuralNetworkBasics.”Webeginourexaminationofclassicneuralnetworkswiththeself-organizingmap(SOM).
TheSOMisusedtoclassifyneuralinputdataintooneofseveralgroups.TrainingdataisprovidedtotheSOM,aswellasthenumberofgroupsintowhichyouwishtoclassifythesedata.Whiletraining,theSOMwillgroupthesedataintogroups.Datathathavethemostsimilarcharacteristicswillbegroupedtogether.Thisprocessisverysimilartoclusteringalgorithms,suchask-means.However,unlikek-means,whichonlygroupsaninitialsetofdata,theSOMcancontinueclassifyingnewdatabeyondtheinitialdatasetthatwasusedfortraining.Unlikemostoftheneuralnetworksinthisbook,SOMisunsupervised—youdonottellitwhatgroupsyouexpectthetrainingdatatofallinto.TheSOMsimplyfiguresoutthegroupsitself,basedonyourtrainingdata,andthenitclassifiesanyfuturedataintosimilargroups.FutureclassificationisperformedusingwhattheSOMlearnedfromthetrainingdata.
Self-OrganizingMaps
Kohonen(1988)introducedtheself-organizingmap(SOM),aneuralnetworkconsistingofaninputlayerandanoutputlayer.Thetwo-layerSOMisalsoknownastheKohonenneuralnetworkandfunctionswhentheinputlayermapsdatatotheoutputlayer.Astheprogrampresentspatternstotheinputlayer,theoutputneuronisconsideredthewinnerwhenitcontainstheweightsmostsimilartotheinput.ThissimilarityiscalculatedbycomparingtheEuclideandistancebetweenthesetofweightsfromeachoutputneuron.TheshortestEuclideandistancewins.CalculatingEuclideandistanceisthefocusofthenextsection.
UnlikethefeedforwardneuralnetworkdiscussedinChapter1,therearenobiasvaluesintheSOM.Itjusthasweightsfromtheinputlayertotheoutputlayer.Additionally,itusesonlyalinearactivationfunction.Figure2.1showstheSOM:
Figure2.1:Self-OrganizingMap
TheSOMpicturedaboveshowshowtheprogrammapsthreeinputneuronstonineoutputneuronsarrangedinathree-by-threegrid.TheoutputneuronsoftheSOMareoftenarrangedintoagrid,cube,orotherhigher-dimensionalconstruct.Becausetheorderingoftheoutputneuronsinmostneuralnetworkstypicallyconveysnomeaningatall,thisarrangementisverydifferent.Forexample,thecloseproximityoftheoutputneurons#1and#2inmostneuralnetworksisnotsignificant.However,fortheSOM,theclosenessofoneoutputneurontoanotherisimportant.Computervisionapplicationsmakeuseoftheclosenessofneuronstoidentifyimagesmoreaccurately.Convolutionalneuralnetworks(CNNs),whichwillbeexaminedinChapter10,“ConvolutionalNeuralNetworks,”groupneuronsintooverlappingregionsbasedonhowclosetheseinputneuronsaretoeachother.Whenrecognizingimages,itisveryimportanttoconsiderwhichpixelsareneareachother.Theprogramrecognizespatternssuchasedges,solidregions,andlinesbylookingatpixelsneareachother.
CommonstructuresfortheoutputneuronsofSOMsincludethefollowing:
One-Dimensional:Outputneuronsarearrangedinaline.Two-Dimensional:Outputneuronsarearrangedinagrid.Three-Dimensional:Outputneuronsarearrangedinacube.
WewillnowseehowtostructureasimpleSOMthatlearnstorecognizecolorsthataregivenasRGBvectors.Theindividualred,green,andbluevaluescanrangebetween-1and+1.Blackortheabsenceofcolordesignates-1,and+1expressesthefullintensityofred,greenorblue.Thesethree-colorcomponentscomprisetheneuralnetworkinput.
Theoutputwillbea2,500-neurongridarrangedinto50rowsby50columns.ThisSOMwillorganizesimilarcolorsneareachotherinthisoutputgrid.Figure2.2showsthis
output:
Figure2.2:TheOutputGrid
Althoughtheabovefiguremaynotbeasclearintheblackandwhiteeditionsofthisbookasitisinthecolore-bookeditions,youcanseesimilarcolorsgroupedneareachother.Asingle,color-basedSOMisaverysimpleexamplethatallowsyoutovisualizethegroupingcapabilitiesoftheSOM.
HowareSOMstrained?Thetrainingprocesswillupdatetheweightmatrix,whichis3by2,500.Tostart,theprograminitializestheweightmatrixtorandomvalues.Thenitrandomlychooses15trainingcolors.
Thetrainingwillprogressthroughaseriesofiterations.Unlikeotherneuralnetworktypes,thetrainingforSOMnetworksinvolvesafixednumberofiterations.Totrainthecolor-basedSOM,wewilluse1,000iterations.
Eachiterationwillchooseonerandomcolorsamplefromthetrainingset,acollectionofRGBcolorvectorsthateachconsistofthreenumbers.Likewise,theweightsbetweeneachofthe2,500outputneuronsandthethreeinputneuronsareavectorofthreenumbers.Astrainingprogresses,theprogramwillcalculatetheEuclideandistancebetweeneachweightvectorandthecurrenttrainingpattern.AEuclideandistancedeterminesthedifferencebetweentwovectorsofthesamesize.Inthiscase,bothvectorsarethreenumbersthatrepresentanRGBcolor.Wecomparethecolorfromthetrainingdatatothethreeweightsofeachneuron.Equation2.1showstheEuclideandistancecalculation:
Equation2.1:TheEuclideanDistancebetweenTrainingDataandOutputNeuron
Intheaboveequation,thevariableprepresentsthetrainingpattern.Thevariablewcorrespondstotheweightvector.Bysquaringthedifferencesbetweeneachvectorcomponentandtakingthesquarerootoftheresultingsum,wecalculatetheEuclideandistance.Thiscalculationmeasuresthedifferencebetweeneachweightvectorandtheinputtrainingpattern.
TheprogramcalculatestheEuclideandistanceforeveryoutputneuron,andtheonewiththeshortestdistanceiscalledthebestmatchingunit(BMU).Thisneuronwilllearnthemostfromthecurrenttrainingpattern.TheneighborsoftheBMUwilllearnless.Toperformthistraining,theprogramloopsovereveryneuronanddeterminestheextenttowhichitshouldbetrained.NeuronsthatareclosertotheBMUwillreceivemoretraining.Equation2.2canmakethisdetermination:
Equation2.2:SOMLearningFunction
Intheaboveequation,thevariablet,alsoknownastheiterationnumber,representstime.ThepurposeoftheequationistocalculatetheresultingweightvectorWv(t+1).Youwilldeterminethenextweightbyaddingtothecurrentweight,whichisWv(t).Theendgoalistocalculatehowdifferentthecurrentweightisfromtheinputvector,anditisdonebytheclauseD(T)-Wv(t).TrainingtheSOMistheprocessofmakinganeuron’sweightsmoresimilartothetrainingelement.Wedonotwanttosimplyassignthetrainingelementtotheoutputneuronsweights,makingthemidentical.Rather,wecalculatethedifferencebetweenthetrainingelementandtheneuronsweightsandscalethisdifferencebymultiplyingitbytworatios.Thefirstratio,representedbyθ(theta),istheneighborhoodfunction.Thesecondratio,representedbyα(alpha),isamonotonicallydecreasinglearningrate.Inotherwords,asthetrainingprogresses,thelearningratefallsandneverrises.
TheneighborhoodfunctionconsidershowcloseeachoutputneuronistotheBMU.Forneuronsthatarenearer,theneighborhoodfunctionwillreturnavaluethatapproaches1.Fordistantneighbors,theneighborhoodfunctionwillapproach0.Thisrangebetween0and1controlshownearandfarneighborsaretrained.Nearerneighborswillreceivemoreofthetrainingadjustmenttotheirweights.Inthenextsection,wewillanalyzehowtheneighborhoodfunctiondeterminesthetrainingadjustments.Inadditiontotheneighborhoodfunction,thelearningratealsoscaleshowmuchtheprogramwilladjusttheoutputneuron.
UnderstandingNeighborhoodFunctions
Theneighborhoodfunctiondeterminesthedegreetowhicheachoutputneuronshouldreceiveatrainingadjustmentfromthecurrenttrainingpattern.Thefunctionusuallyreturnsavalueof1fortheBMU.ThisvalueindicatesthattheBMUshouldreceivethemosttraining.ThoseneuronsfartherfromtheBMUwillreceivelesstraining.Theneighborhoodfunctiondeterminesthisweighting.
Iftheoutputneuronsarearrangedinonlyonedimension,youshoulduseasimpleone-dimensionalneighborhoodfunction,whichwilltreattheoutputasonelongarrayofnumbers.Forinstance,aone-dimensionalnetworkmighthave100outputneuronsthatformalong,single-dimensionalarrayof100values.
Atwo-dimensionalSOMmighttakethesesame100valuesandrepresentthemasagrid,perhapsof10rowsand10columns.Theactualstructureremainsthesame;theneuralnetworkhas100outputneurons.Theonlydifferenceistheneighborhoodfunction.Thefirstwouldutilizeaone-dimensionalneighborhoodfunction;thesecondwoulduseatwo-dimensionalneighborhoodfunction.Thefunctionmustconsiderthisadditionaldimensionandfactoritintothedistancereturned.
Itisalsopossibletohavethree,four,andevenmoredimensionalfunctionsfortheneighborhoodfunction.Typically,neighborhoodfunctionsareexpressedinvectorformsothatthenumberofdimensionsdoesnotmatter.Torepresentthedimensions,theEuclidiannorm(representedbytwoverticalbars)ofallinputsistaken,asseeninEquation2.3:
Equation2.3:EuclideanNorm
Fortheaboveequation,thevariableprepresentsthedimensionalinputs.Thevariablewrepresentstheweights.Asingledimensionhasonlyasinglevalueforp.CalculatingtheEuclidiannormfor[2-0]wouldsimplybethefollowing:
CalculatingtheEuclideannormfor[2-0,3-0]isonlyslightlymorecomplex:
ThemostpopularchoiceforSOMsisthetwo-dimensionalneighborhoodfunction.One-dimensionalneighborhoodfunctionsarealsocommon.However,neighborhoodfunctionswiththreeormoredimensionsaremoreunusual.Choosingthenumberofdimensionsreallycomesdowntotheprogrammerdecidinghowmanywaysanoutputneuroncanbeclosetoanother.Thisdecisionshouldnotbetakenlightlybecauseeachadditionaldimensionsignificantlyaffectstheamountofmemoryandprocessingpowerneeded.ThisadditionalprocessingiswhymostprogrammerschoosetwoorthreedimensionsfortheSOMapplication.
Itcanbedifficulttounderstandwhyyoumighthavemorethanthreedimensions.Thefollowinganalogyillustratesthelimitationsofthreedimensions.Whileatthegrocerystore,Johnnoticedapackageofdriedapples.Asheturnedhisheadtotheleftorright,travelinginthefirstdimension,hesawotherbrandsofdriedapples.Ifhelookedupordown,travelingintheseconddimension,hesawothertypesofdriedfruit.Thethirddimension,depth,simplygiveshimmoreofexactlythesamedriedapples.Hereachedbehindthefrontitemandfoundadditionalstock.However,thereisnofourthdimension,whichcouldhavebeenusefultoallowfreshapplestobelocatedneartothedriedapples.Becausethesupermarketonlyhadthreedimensions,thistypeoflinkisnotpossible.Programmersdonothavethislimitation,andtheymustdecideiftheextraprocessingtimeisnecessaryforthebenefitsofadditionaldimensions.
TheGaussianfunctionisapopularchoiceforaneighborhoodfunction.Equation2.4usestheEuclideannormtocalculatetheGaussianfunctionforanynumberofdimensions:
Equation2.4:TheVectorFormoftheGaussianFunction
ThevariablexrepresentstheinputtotheGaussianfunction,crepresentsthecenteroftheGaussianfunction,andwrepresentsthewidths.Thevariablesx,wandcallarevectorswithmultipledimensions.Figure2.3showsthegraphofthetwo-dimensionalGaussianfunction:
Figure2.3:ASingle-DimensionalGaussianFunction
ThisfigureillustrateswhytheGaussianfunctionisapopularchoiceforaneighborhoodfunction.ProgrammersfrequentlyusetheGaussianfunctiontoshowthenormaldistribution,orbellcurve.IfthecurrentoutputneuronistheBMU,thenitsdistance(x-axis)willbe0.Asaresult,thetrainingpercent(y-axis)is1.0(100%).Asthedistanceincreaseseitherpositivelyornegatively,thetrainingpercentagedecreases.Oncethedistanceislargeenough,thetrainingpercentapproaches0.
IftheinputvectortotheGaussianfunctionhastwodimensions,thegraphappearsasFigure2.4:
Figure2.4:ATwo-DimensionalGaussianFunction
HowdoesthealgorithmuseGaussianconstantswithaneuralnetwork?Thecenter(c)ofaneighborhoodfunctionisalways0,whichcentersthefunctionontheorigin.Ifthealgorithmmovesthecenterfromtheorigin,aneuronotherthantheBMUwouldreceivethefulllearning.Itisunlikelyyouwouldeverwanttomovethecenterfromtheorigin.Foramulti-dimensionalGaussian,setallcentersto0inordertopositionthecurveattheorigin.
TheonlyremainingGaussianparameteristhewidth.Youshouldsetthisparametertosomethingslightlylessthantheentirewidthofthegridorarray.Astrainingprogresses,thewidthgraduallydecreases.Justlikethelearningrate,thewidthshoulddecreasemonotonically.
MexicanHatNeighborhoodFunction
Thoughitisthemostpopular,theGaussianfunctionisnottheonlyneighborhoodfunctionavailable.TheRickerwave,orMexicanhatfunction,isanotherpopularneighborhoodfunction.JustliketheGaussianneighborhoodfunction,thevectorlengthofthexdimensionsisthebasisfortheMexicanhatfunction,asseeninEquation2.5:
Equation2.5:VectorFormofMexicanHatFunction
MuchthesameastheGaussian,theprogrammercanusetheMexicanhatfunctioninoneormoredimensions.Figure2.5showstheMexicanhatfunctionwithonedimension:
Figure2.5:AOne-DimensionalMexicanHatFunction
YoumustbeawarethattheMexicanhatfunctionpenalizesneighborsthatarebetween2and4,or-2and-4unitsfromthecenter.Ifyourmodelseekstopenalizenearmisses,theMexicanhatfunctionisagoodchoice.
YoucanalsousetheMexicanhatfunctionintwoormoredimensions.Figure2.6showsatwo-dimensionalMexicanhatfunction:
Figure2.6:ATwo-DimensionalMexicanHatFunction
Justliketheone-dimensionalversion,theaboveMexicanhatpenalizesnearmisses.Theonlydifferenceisthatthetwo-dimensionalMexicanhatfunctionutilizesatwo-dimensionalvector,whichlooksmorelikeaMexicansombrerothantheone-dimensionalvariant.Althoughitispossibletousemorethantwodimensions,thesevariantsarehardtovisualizebecauseweperceivespaceinthreedimensions.
CalculatingSOMError
Supervisedtrainingtypicallyreportsanerrormeasurementthatdecreasesastrainingprogresses.Unsupervisedmodels,suchastheSOMnetwork,cannotdirectlycalculateanerrorbecausethereisnoexpectedoutput.However,anestimationoftheerrorcanbecalculatedfortheSOM(Masters,1993).
WedefinetheerrorasthelongestEuclideandistanceofallBMUsinatrainingiteration.EachtrainingsetelementhasitsownBMU.Aslearningprogresses,thelongestdistanceshoulddecrease.TheresultsalsoindicatethesuccessoftheSOMtrainingsincethevalueswilltendtodecreaseasthetrainingcontinues.
ChapterSummary
Inthefirsttwochapters,weexplainedseveralclassicneuralnetworktypes.SincePitts(1943)introducedtheneuralnetwork,manydifferentneuralnetworktypeshavebeeninvented.Wehavefocusedprimarilyontheclassicneuralnetworktypesthatstillhaverelevanceandthatestablishthefoundationforotherarchitecturesthatwewillcoverinlaterchaptersofthebook.
Thischapterfocusedontheself-organizingmap(SOM)thatisanunsupervisedneuralnetworktypethatcanclusterdata.TheSOMhasaninputneuroncountequaltothenumberofattributesforthedatatobeclustered.Anoutputneuroncountspecifiesthenumberofgroupsintowhichthedatashouldbeclustered.TheSOMneuralnetworkistrainedinanunsupervisedmanner.Inotherwords,onlythedatapointsareprovidedtotheneuralnetwork;theexpectedoutputsarenotprovided.TheSOMnetworklearnstoclusterthedatapoints,especiallythedatapointssimilartotheoneswithwhichittrained.
Inthenextchapter,wewillexaminetwomoreclassicneuralnetworktypes:theHopfieldneuralnetworkandtheBoltzmannmachine.Theseneuralnetworktypesaresimilarinthattheybothuseanenergyfunctionduringtheirtrainingprocess.Theenergyfunctionmeasurestheamountofenergyinthenetwork.Astrainingprogresses,theenergyshoulddecreaseasthenetworklearns.
Chapter3:Hopfield&BoltzmannMachines
HopfieldNetworksEnergyFunctionsHebbianLearningAssociativeMemoryOptimizationBoltzmannMachines
ThischapterwillintroducetheHopfieldnetworkaswellastheBoltzmannmachine.ThoughneitheroftheseclassicneuralnetworksisusedextensivelyinmodernAIapplications,botharefoundationaltomoremodernalgorithms.TheBoltzmannmachineformsthefoundationofthedeepbeliefneuralnetwork(DBNN),whichisoneofthefundamentalalgorithmsofdeeplearning.Hopfieldnetworksareaverysimpletypeofneuralnetworkthatutilizesmanyofthesamefeaturesthatthemorecomplexfeedforwardneuralnetworksemploy.
HopfieldNeuralNetworks
TheHopfieldneuralnetwork(Hopfield,1982)isperhapsthesimplesttypeofneuralnetworkbecauseitisafullyconnectedsinglelayer,auto-associativenetwork.Inotherwords,ithasasinglelayerinwhicheachneuronisconnectedtoeveryotherneuron.Additionally,thetermauto-associativemeansthattheneuralnetworkwillreturntheentirepatternifitrecognizesapattern.Asaresult,thenetworkwillfillinthegapsofincompleteordistortedpatterns.
Figure3.1showsaHopfieldneuralnetworkwithjustfourneurons.Whileafour-neuronnetworkishandybecauseitissmallenoughtovisualize,itcanrecognizeafewpatterns.
Figure3.1:AHopfieldNeuralNetworkwith12Connections
BecauseeveryneuroninaHopfieldneuralnetworkisconnectedtoeveryotherneuron,youmightassumethatafour-neuronnetworkwouldcontainafour-by-fourmatrix,or16connections.However,16connectionswouldrequirethateveryneuronbeconnectedtoitselfaswellastoeveryotherneuron.InaHopfieldneuralnetwork,16connectionsdonotoccur;theactualnumberofconnectionsis12.
Theseconnectionsareweightedandstoredinamatrix.Afour-by-fourmatrixwouldstorethenetworkpicturedabove.Infact,thediagonalofthismatrixwouldcontain0’sbecausetherearenoself-connections.Allneuralnetworkexamplesinthisbookwillusesomeformofmatrixtostoretheirweights.
EachneuroninaHopfieldnetworkhasastateofeithertrue(1)orfalse(-1).ThesestatesareinitiallytheinputtotheHopfieldnetworkandultimatelybecometheoutputofthenetwork.TodeterminewhetheraHopfieldneuron’sstateis-1or1,useEquation3.1:
Equation3.1:HopfieldNeuronState
Theaboveequationcalculatesthestate(s)ofneuroni.Thestateofagivenneurongreatlydependsonthestatesoftheotherneurons.Theequationmultipliesandsumstheweight(w)andstate(s)oftheotherneurons(j).Essentially,thestateofthecurrentneuron(i)is+1ifthissumisgreaterthanthethreshold(θ,theta).Otherwiseitis-1.Thethresholdvalueisusually0.
Becausethestateofasingleneurondependsonthestatesoftheremainingneurons,theorderinwhichtheequationcalculatestheneuronsisveryimportant.ProgrammersfrequentlyemploythefollowingtwostrategiestocalculatethestatesforallneuronsinaHopfieldnetwork:
Asynchronous:Thisstrategyupdatesonlyoneneuronatatime.Itpicksthisneuronatrandom.Synchronous:Itupdatesallneuronsatthesametime.Thismethodislessrealisticsincebiologicalorganismslackaglobalclockthatsynchronizestheneurons.
YoushouldtypicallyrunaHopfieldnetworkuntilthevaluesofallneuronsstabilize.Despitethefactthateachneuronisdependentonthestatesoftheothers,thenetworkwillusuallyconvergetoastablestate.
Itisimportanttohavesomeindicationofhowclosethenetworkistoconvergingtoastablestate.YoucancalculateanenergyvalueforHopfieldnetworks.ThisvaluedecreasesastheHopfieldnetworkmovestoamorestablestate.Toevaluatethestabilityofthenetwork,youcanusetheenergyfunction.Equation3.2showstheenergycalculationfunction:
Equation3.2:HopfieldEnergyFunction
Boltzmannmachines,discussedlaterinthechapter,alsoutilizethisenergyfunction.BoltzmannmachinessharemanysimilaritieswithHopfieldneuralnetworks.Whenthethresholdis0,thesecondtermofEquation3.2dropsout.Listing3.1containsthecodetoimplementEquation3.1:
Listing3.1:HopfieldEnergy
defenergy(weights,state,threshold):
#Firstterm
a=0
foriinrange(neuron_count):
forjinrange(neuron_count):
a=a+weight[i][j]*state[i]*state[j]
a=a*-0.5
#Secondterm
b=0
foriinrange(neuron_count):
b=b+state[i]*threshold[i]
#Result
returna+b
TrainingaHopfieldNetwork
YoucantrainHopfieldnetworkstoarrangetheirweightsinawaythatallowsthenetworktoconvergetodesiredpatterns,alsoknownasthetrainingset.
ThesedesiredtrainingpatternsarealistofpatternswithaBooleanvalueforeachoftheneuronsthatcomprisetheBoltzmannmachine.Thefollowingdatamightrepresentafour-patterntrainingsetforaHopfieldnetworkwitheightneurons:
11000000
00001100
10000001
00011000
Theabovedataarecompletelyarbitrary;however,theydorepresentactualpatternstotraintheHopfieldnetwork.Oncetrained,apatternsimilartotheonelistedbelowshouldfindequilibriumwithapatternclosetothetrainingset:
11100000
Therefore,thestateoftheHopfieldmachineshouldchangetothefollowingpattern:
11000000
YoucantrainHopfieldnetworkswitheitherHebbian(Hopfield,1982)orStorkey(Storkey,1999)learning.TheHebbianprocessforlearningisbiologicallyplausible,anditisoftenexpressedas,“cellsthatfiretogether,wiretogether.”Inotherwords,twoneuronswillbecomeconnectediftheyfrequentlyreacttothesameinputstimulus.Equation3.3summarizesthisbehaviormathematically:
Equation3.3:HopfieldHebbianLearning
Theconstantnrepresentsthenumberoftrainingsetelements(ε,epsilon).Theweightmatrixwillbesquareandwillcontainrowsandcolumnsequaltothenumberofneurons.Thediagonalwillalwaysbe0becauseaneuronisnotconnectedtoitself.Theotherlocationsinthematrixwillcontainvaluesspecifyinghowoftentwovaluesinthetrainingpatternareeither+1or-1.Listing3.2containsthecodetoimplementEquation3.3:
Listing3.2:HopfieldHebbianTraining
defadd_pattern(weights,pattern,n):
foriinrange(neuron_count):
forjinrange(neuron_count):
ifi==j:
weights[i][j]=0
else:
weights[i][j]=weights[i][j]
+((pattern[i]*pattern[j])/n)
Weapplytheadd_patternmethodtoaddeachofthetrainingelements.Theparameterweightsspecifiestheweightmatrix,andtheparameterpatternspecifieseachindividualtrainingelement.Thevariablendesignatesthenumberofelementsinthetrainingset.
Itispossiblethattheequationandthecodearenotsufficienttoshowhowtheweightsaregeneratedfrominputpatterns.Tohelpyouvisualizethisprocess,weprovideanonlineJavascriptapplicationatthefollowingURL:
http://www.heatonresearch.com/aifh/vol3/hopfield.html
ConsiderthefollowingdatatotrainaHopfieldnetwork:
[1,0,0,1]
[0,1,1,0]
ThepreviousdatashouldproduceaweightmatrixlikeFigure3.2:
Figure3.2:HopfieldMatrix
Tocalculatetheabovematrix,divide1bythenumberoftrainingsetelements.Theresultis1/2,or0.5.Thevalue0.5isplacedintoeveryrowandcolumnthathasa1inthetrainingset.Forexample,thefirsttrainingelementhasa1inneurons#0and#3,resultingina0.5beingaddedtorow0,column3androw3,column0.Thesameprocesscontinuesfortheothertrainingsetelement.
AnothercommontrainingtechniqueforHopfieldneuralnetworksistheStorkeytrainingalgorithm.HopfieldneuralnetworkstrainedwithStorkeyhaveagreatercapacityofpatternsthantheHebbianmethodjustdescribed.TheStorkeyalgorithmismorecomplexthantheHebbianalgorithm.
ThefirststepintheStorkeyalgorithmistocalculateavaluecalledthelocalfield.Equation3.4calculatesthisvalue:
Equation3.4:HopfieldStorkeyLocalField
Wecalculatethelocalfieldvalue(h)foreachweightelement(i&j).Justasbefore,weusetheweights(w)andtrainingsetelements(ε,epsilon).Listing3.3providesthecodetocalculatethelocalfield:
Listing3.3:CalculateStorkeyLocalField
defcalculate_local_field(weights,i,j,pattern):
sum=0
forkinrange(len(pattern)):
ifk!=i:
sum=sum+weights[i][k]*pattern[k]
returnsum
Equation3.5hasthelocalfieldvaluethatcalculatestheneededchange(ΔW):
Equation3.5:HopfieldStorkeyLearning
Listing3.4calculatesthevaluesoftheweightdeltas:
Listing3.4:StorkeyLearning
defadd_pattern(weights,pattern):
sum_matrix=matrix(len(pattern),len(pattern))
n=len(pattern)
foriinrange(n):
forjinrange(n):
t1=(pattern[i]*pattern[j])/n
t2=(pattern[i]*
calculate_local_field(weights,j,i,pattern))/n
t3=(pattern[j]*
calculate_local_field(weights,i,j,pattern))/n
d=t1-t2-t3;
sum_matrix[i][j]=sum_matrix[i][j]+d
returnsum_matrix
Onceyoucalculatetheweightdeltas,youcanaddthemtotheexistingweightmatrix.Ifthereisnoexistingweightmatrix,simplyallowthedeltaweightmatrixtobecometheweightmatrix.
Hopfield-TankNetworks
Inthelastsection,youlearnedthatHopfieldnetworkscanrecallpatterns.Theycanalsooptimizeproblemssuchasthetravelingsalesmanproblem(TSP).HopfieldandTank(1984)introducedaspecialvariant,theHopfield-Tanknetwork,tofindsolutionstooptimizationproblems.
ThestructureofaHopfield-TanknetworkissomewhatdifferentthanastandardHopfieldnetwork.TheneuronsinaregularHopfieldneuralnetworkcanholdonlythetwodiscretevaluesof0or1.However,aHopfield-Tankneuroncanhaveanynumberintherange0to1.StandardHopfieldnetworkspossessdiscretevalues;Hopfield-Tanknetworkskeepcontinuousvaluesoverarange.AnotherimportantdifferenceisthatHopfield-Tanknetworksusesigmoidactivationfunctions.
ToutilizeaHopfield-Tanknetwork,youmustcreateaspecializedenergyfunctiontoexpresstheparametersofeachproblemtosolve.However,producingsuchanenergyfunctioncanbeatime-consumingtask.Hopfield&Tank(2008)demonstratedhowtoconstructanenergyfunctionforthetravelingsalesmanproblem(TSP).Otheroptimizationfunctions,suchassimulatedannealingandNelder-Mead,donotrequirethecreationofacomplexenergyfunction.Thesegeneral-purposeoptimizationalgorithmstypicallyperformbetterthantheolderHopfield-Tankoptimizationalgorithms.
Becauseotheralgorithmsaretypicallybetterchoicesforoptimizations,thisbookdoesnotcovertheoptimizationHopfield-Tanknetwork.Nelder-MeadandsimulatedannealingweredemonstratedinArtificialIntelligenceforHumans,Volume1:FundamentalAlgorithms.Chapter6,“BackpropagationTraining,”willhaveareviewofstochasticgradientdescent(SGD),whichisoneofthebesttrainingalgorithmsforfeedforwardneuralnetworks.
BoltzmannMachines
Hinton&Sejnowski(1985)firstintroducedBoltzmannmachines,butthisneuralnetworktypehasnotenjoyedwidespreaduseuntilrecently.AspecialtypeofBoltzmannmachine,therestrictedBoltzmannmachine(RBM),isoneofthefoundationaltechnologiesofdeeplearningandthedeepbeliefneuralnetwork(DBNN).Inthischapter,wewillintroduceclassicBoltzmannmachines.Chapter9,“DeepLearning,”willincludedeeplearningandtherestrictedBoltzmannmachine.
ABoltzmannmachineisessentiallyafullyconnected,two-layerneuralnetwork.Werefertotheselayersasthevisualandhiddenlayers.Thevisuallayerisanalogoustotheinputlayerinfeedforwardneuralnetworks.DespitethefactthataBoltzmannmachinehas
ahiddenlayer,itfunctionsmoreasanoutputlayer.ThisdifferenceinthemeaningofhiddenlayerisoftenasourceofconfusionbetweenBoltzmannmachinesandfeedforwardneuralnetworks.TheBoltzmannmachinehasnohiddenlayerbetweentheinputandoutputlayers.Figure3.3showstheverysimplestructureofaBoltzmannmachine:
Figure3.3:BoltzmannMachine
TheaboveBoltzmannmachinehasthreehiddenneuronsandfourvisibleneurons.ABoltzmannmachineisfullyconnectedbecauseeveryneuronhasaconnectiontoeveryotherneuron.However,noneuronisconnectedtoitself.ThisconnectivityiswhatdifferentiatesaBoltzmannmachinefromarestrictedBoltzmannmachine(RBM),asseeninFigure3.4:
Figure3.4:RestrictedBoltzmannMachine(RBM)
TheaboveRBMisnotfullyconnected.Allhiddenneuronsareconnectedtoeachvisibleneuron.However,therearenoconnectionsamongthehiddenneuronsnorarethereconnectionsamongthevisibleneurons.
LiketheHopfieldneuralnetwork,aBoltzmannmachine’sneuronsacquireonlybinarystates,either0or1.WhilethereissomeresearchoncontinuousBoltzmannmachinescapableofassigningdecimalnumberstotheneurons,nearlyallresearchontheBoltzmannmachinecentersonbinaryunits.Therefore,thisbookwillnotincludeinformationoncontinuousBoltzmannmachines.
Boltzmannmachinesarealsocalledagenerativemodel.Inotherwords,aBoltzmannmachinedoesnotgenerateconstantoutput.ThevaluespresentedtothevisibleneuronsofaBoltzmannmachine,whenconsideredwiththeweights,specifyaprobabilitythatthehiddenneuronswillassumeavalueof1,asopposedto0.
AlthoughaBoltzmannmachineandHopfieldneuralnetworkshavesomecharacteristicsincommon,thereareseveralimportantdifferences:
Hopfieldnetworkssufferfromrecognizingcertainfalsepatterns.BoltzmannmachinescanstoreagreatercapacityofpatternsthanHopfieldnetworks.Hopfieldnetworksrequiretheinputpatternstobeuncorrelated.Boltzmannmachinescanbestackedtoformlayers.
BoltzmannMachineProbability
Whentheprogramqueriesthevalueof1oftheBoltzmannmachine’shiddenneurons,itwillrandomlyproducea0or1.Equation3.6obtainsthecalculatedprobabilityforthatneuronwithavalueof1:
Equation3.6:ProbabilityofNeuronBeingOne(on)
Theaboveequationwillcalculateanumberbetween0and1thatrepresentsaprobability.Forexample,ifthevalue0.75weregenerated,theneuronwouldreturna1in75%ofthecases.Onceitcalculatestheprobability,itcanproducetheoutputbygeneratingarandomnumberbetween0and1andreturning1iftherandomnumberisbelowtheprobability.
Theaboveequationreturnstheprobabilityforneuronibeingonandiscalculatedwiththedeltaenergy(ΔE)ati.TheequationalsousesthevalueT,whichrepresentsthetemperatureofthesystem.Equation3.2,fromearlierinthechapter,cancalculateT.Thevalueθ(theta)istheneuron’sbiasvalue.
ThechangeinenergyiscalculatedusingEquation3.7:
Equation3.7:CalculatingtheEnergyChangeforaNeuron
Thisvalueistheenergydifferencebetween1(on)and0(off)forneuroni.Itiscalculatedusingtheθ(theta),whichrepresentsthebias.
Althoughthevaluesoftheindividualneuronsarestochastic(random),theywilltypicallyfallintoequilibrium.Toreachthisequilibrium,youcanrepeatedlycalculatethenetwork.Eachtime,aunitischosenwhileEquation3.6setsitsstate.Afterrunningforanadequateperiodoftimeatacertaintemperature,theprobabilityofaglobalstateofthenetworkwilldependonlyuponthatglobalstate’senergy.
Inotherwords,thelogprobabilitiesofglobalstatesbecomelinearintheirenergies.Thisrelationshipistruewhenthemachineisatthermalequilibrium,whichmeansthatthe
probabilitydistributionofglobalstateshasconverged.Ifwestartrunningthenetworkfromahightemperatureandgraduallydecreaseituntilwereachathermalequilibriumatalowtemperature,thenwemayconvergetoadistributionwheretheenergylevelfluctuatesaroundtheglobalminimum.Wecallthisprocesssimulatedannealing.
ApplyingtheBoltzmannMachine
MostresearcharoundBoltzmannmachineshasmovedtotherestrictedBoltzmannmachine(RBM)thatwewillexplaininChapter9,“DeepLearning.”Inthissection,wewillfocusontheolder,unrestrictedformoftheBoltzmann,whichhasbeenappliedtobothoptimizationandrecognitionproblems.Wewilldemonstrateanexampleofeachtype,beginningwithanoptimizationproblem.
TravelingSalesmanProblem
Thetravelingsalesmanproblem(TSP)isaclassiccomputerscienceproblemthatisdifficulttosolvewithtraditionalprogrammingtechniques.ArtificialintelligencecanbeappliedtofindpotentialsolutionstotheTSP.Theprogrammustdeterminetheorderofafixedsetofcitiesthatminimizesthetotaldistancecovered.Thetravelingsalesmaniscalledacombinationalproblem.IfyouarealreadyfamiliarwithTSPoryouhavereadaboutitinapreviousvolumeinthisseries,youcanskipthissection.
TSPinvolvesdeterminingtheshortestrouteforatravelingsalesmanwhomustvisitacertainnumberofcities.Althoughhecanbeginandendinanycity,hemayvisiteachcityonlyonce.TheTSPhasseveralvariants,someofwhichallowmultiplevisitstocitiesorassigndifferentvaluestocities.TheTSPinthischaptersimplyseekstheshortestpossibleroutetovisiteachcityonetime.Figure3.5showstheTSPproblemusedhere,aswellasapotentialshortestroute:
Figure3.5:TheTravelingSalesman
Findingtheshortestroutemayseemlikeaneasytaskforanormaliterativeprogram.However,asthenumberofcitiesincreases,thenumberofpossiblecombinationsincreasesdrastically.Iftheproblemhasoneortwocities,onlyoneortworoutesarepossible.Ifitincludesthreecities,thepossibleroutesincreasetosix.Thefollowinglistshowshowquicklythenumberofpathsgrows:
1cityhas1path
2citieshave2paths
3citieshave6paths
4citieshave24paths
5citieshave120paths
6citieshave720paths
7citieshave5,040paths
8citieshave40,320paths
9citieshave362,880paths
10citieshave3,628,800paths
11citieshave39,916,800paths
12citieshave479,001,600paths
13citieshave6,227,020,800paths
...
50citieshave3.041*10^64paths
Intheabovetable,theformulatocalculatetotalpathsisthefactorial.Thenumberofcities,n,iscalculatedusingthefactorialoperator(!).Thefactorialofsomearbitraryvaluenisgivenbyn*(n–1)*(n–2)*…*3*2*1.Thesevaluesbecomeincrediblylargewhenaprogrammustdoabrute-forcesearch.Thetravelingsalesmanproblemisanexampleofanon-deterministicpolynomialtime(NP)hardproblem.Informally,NP-hardisdefinedasanyproblemthatlacksanefficientwaytoverifyacorrectsolution.TheTSPfitsthisdefinitionformorethan10cities.YoucanfindaformaldefinitionofNP-hardinComputersandIntractability:AGuidetotheTheoryofNP-Completeness(Garey,1979).
Dynamicprogrammingisanothercommonapproachtothetravelingsalesmanproblem,asseeninxkcd.comcomicinFigure3.6:
Figure3.6:TheTravelingSalesman(fromxkcd.com)
Althoughthisbookdoesnotincludeafulldiscussionofdynamicprogramming,understandingitsessentialfunctionisvaluable.Dynamicprogrammingbreaksalargeproblem,suchastheTSP,intosmallerproblems.Youcanreuseworkformanyofthesmallerprograms,therebydecreasingtheamountofiterationsrequiredbyabrute-forcesolution.
Unlikebrute-forcesolutionsanddynamicprogramming,ageneticalgorithmisnotguaranteedtofindthebestsolution.Althoughitwillfindagoodsolution,thescoremightnotbethebest.Thesampleprogramexaminedinthenextsectionshowshowageneticalgorithmproducedanacceptablesolutionforthe50-cityprobleminamatterofminutes.
OptimizationProblems
TousetheBoltzmannmachineforanoptimizationproblem,itisnecessarytorepresentaTSPsolutioninsuchawaythatitfitsontothebinaryneuronsoftheBoltzmannmachine.Hopfield(1984)devisedanencodingfortheTSPthatbothBoltzmannandHopfieldneuralnetworkscommonlyusetorepresentthiscombinationalproblem.
ThealgorithmarrangestheneuronsoftheHopfieldorBoltzmannmachineonasquaregridwiththenumberofrowsandcolumnsequaltothenumberofcities.Eachcolumnrepresentsacity,andeachrowcorrespondstoasegmentinthejourney.Thenumberofsegmentsinthejourneyisequaltothenumberofcities,resultinginasquaregrid.Eachrowinthematrixshouldhaveexactlyonecolumnwithavalueof1.Thisvaluedesignatesthedestinationcityforeachofthetripsegments.ConsiderthecitypathshowninFigure3.7:
Figure3.7:FourCitiestoVisit
Becausetheproblemincludesfourcities,thesolutionrequiresafour-by-fourgrid.ThefirstcityvisitedisCity#0.Therefore,theprogrammarks1inthefirstcolumnofthefirstrow.Likewise,visitingCity#3secondproducesa1inthefinalcolumnofthesecondrow.Figure3.8showsthecompletepath:
Figure3.8:EncodingofFourCities
Ofcourse,theBoltzmannmachinesdonotarrangeneuronsinagrid.Torepresenttheabovepathasavectorofvaluesfortheneuron,therowsaresimplyplacedsequentially.Thatis,thematrixisflattenedinarow-wisemanner,resultinginthefollowingvector:
[1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0]
TocreateaBoltzmannmachinethatcanprovideasolutiontotheTSP,theprogrammustaligntheweightsandbiasesinsuchawaythatallowsthestatesoftheBoltzmannmachineneuronstostabilizeatapointthatminimizesthetotaldistancebetweencities.Keepinmindthattheabovegridcanalsofinditselfinmanyinvalidstates.Therefore,avalidgridmusthavethefollowing:
Asingle1valueperrow.Asingle1valuepercolumn.
Asaresult,theprogramneedstoconstructtheweightssothattheBoltzmannmachine
willnotreachequilibriuminaninvalidstate.Listing3.5showsthepseudocodethatwillgeneratethisweightmatrix:
Listing3.5:BoltzmannWeightsforTSP
gamma=7
#Source
forsource_tourinrange(NUM_CITIES):
forsource_cityinrange(NUM_CITIES):
source_index=source_tour*NUM_CITIES+source_city
#Target
fortargetTourinrange(NUM_CITIES):
for(inttarget_cityinrange(NUM_CITIES):
target_index=target_tour*NUM_CITIES+target_city
#Calculatetheweight
weight=0
#Diagonalweightis0
ifsource_index!=target_index:
#Determinethenextandpreviouselementinthetour.
#Wrapbetween0andlastelement.
prev_target_tour=wrappednexttargettour
next_target_tour=wrappedprevioustargettour
#Ifsametourelementorcity,then-gama
if(source_tour==target_tour)
or(source_city==target_city):
weight=-gamma
#Ifnextorpreviouscity,-gamma
elif((source_tour==prev_target_tour)
or(source_tour==next_target_tour))
weight=-distance(source_city,target_city)
#Otherwise0
set_weight(source_index,target_index,weight)
#Allbiasesare-gamma/2
set_bias(source_index,-gamma/2)
Figure3.9displayspartofthecreatedweightmatrixforfourcities:
Figure3.9:BoltzmannMachineWeightsforTSP(4cities)
Dependingonyourviewingdevice,youmighthavedifficultyreadingtheabovegrid.Therefore,youcangenerateitforanynumberofcitieswiththeJavascriptutilityatthefollowingURL:
http://www.heatonresearch.com/aifh/vol3/boltzmann_tsp_grid.html
Essentially,theweightshavethefollowingspecifications:
Matrixdiagonalisassignedto0.Shownas“\”inFigure3.9.Samesourceandtargetposition,setto–γ(gamma).Shownas-ginFigure3.9.Samesourceandtargetcity,setto–γ(gamma).Shownas-ginFigure3.9.Sourceandtargetnext/previouscities,setto–distance.Shownasd(x,y)inFigure3.9.Otherwise,setto0.
Thematrixissymmetricalbetweentherowsandcolumns.
BoltzmannMachineTraining
Theprevioussectionshowedtheuseofhard-codedweightstoconstructaBoltzmannmachinethatwascapableoffindingsolutionstotheTSP.Theprogramconstructedtheseweightsthroughitsknowledgeoftheproblem.ManuallysettingtheweightsisanecessaryanddifficultstepforapplyingBoltzmannmachinestooptimizationproblems.However,thisbookwillnotincludeinformationaboutconstructingweightmatricesforgeneraloptimizationproblemsbecauseNelder-Meadandsimulatedannealingaremoreoftenusedforgeneral-purposealgorithms.
ChapterSummary
Inthischapter,weexplainedseveralclassicneuralnetworktypes.SincePitts(1943)introducedtheneuralnetwork,manydifferentneuralnetworktypeshavebeeninvented.Wehavefocusedprimarilyontheclassicneuralnetworktypesthatstillhaverelevanceandthatestablishthefoundationforotherarchitecturesthatwewillcoverinlaterchaptersofthebook.
Theself-organizingmap(SOM)isanunsupervisedneuralnetworktypethatcanclusterdata.TheSOMhasaninputneuroncountequaltothenumberofattributesforthedatatobeclustered.Anoutputneuroncountspecifiesthenumberofgroupsintowhichthedatashouldbeclustered.
TheHopfieldneuralnetworkisasimpleneuralnetworktypethatcanrecognizepatternsandoptimizeproblems.YoumustcreateaspecialenergyfunctionforeachtypeofoptimizationproblemthatrequirestheHopfieldneuralnetwork.Becauseofthisquality,programmerschoosealgorithmslikeNelder-MeadorsimulatedannealinginsteadoftheoptimizedversionoftheHopfieldneuralnetwork.
TheBoltzmannmachineisaneuralnetworkarchitecturethatsharesmanycharacteristicswiththeHopfieldneuralnetwork.However,unliketheHopfieldnetwork,youcanstackthedeepbeliefneuralnetwork(DBNN).ThisstackingabilityallowstheBoltzmannmachinetoplayacentralroleintheimplementationofthedeepbeliefneuralnetwork(DBNN),thebasisofdeeplearning.
Inthenextchapter,wewillexaminethefeedforwardneuralnetwork,whichremainsoneofthemostpopularneuralnetworktypes.Thischapterwillfocusonclassicfeedforwardneuralnetworksthatusesigmoidandhyperbolictangentactivationfunctions.Newtrainingalgorithms,layertypes,activationfunctionsandotherinnovationsallowtheclassicfeedforwardneuralnetworktobeusedwithdeeplearning.
Chapter4:FeedforwardNeuralNetworksClassificationRegressionNetworkLayersNormalization
Inthischapter,weshallexamineoneofthemostcommonneuralnetworkarchitectures,thefeedforwordneuralnetwork.Becauseofitsversatility,thefeedforwardneuralnetworkarchitectureisverypopular.Therefore,wewillexplorehowtotrainitandhowitprocessesapattern.
Thetermfeedforwarddescribeshowthisneuralnetworkprocessesandrecallspatterns.Inafeedforwardneuralnetwork,eachlayeroftheneuralnetworkcontainsconnectionstothenextlayer.Forexample,theseconnectionsextendforwardfromtheinputtothehiddenlayer,butnoconnectionsmovebackward.ThisarrangementdiffersfromtheHopfieldneuralnetworkfeaturedinthepreviouschapter.TheHopfieldneuralnetworkwasfullyconnected,anditsconnectionswerebothforwardandbackward.Wewillanalyzethestructureofafeedforwardneuralnetworkandthewayitrecallsapatternlaterinthechapter.
Wecantrainfeedforwardneuralnetworkswithavarietyoftechniquesfromthebroadcategoryofbackpropagationalgorithms,aformofsupervisedtrainingthatwewilldiscussingreaterdetailinthenextchapter.Wewillfocusonapplyingoptimizationalgorithmstotraintheweightsofaneuralnetworkinthischapter.Ifyouneedmoreinformationaboutoptimizationalgorithms,Volumes1and2ofArtificialIntelligenceforHumanscontainsectionsonthissubject.Althoughwecanemployseveraloptimizationalgorithmstotraintheweights,wewillprimarilydirectourattentiontosimulatedannealing.
Optimizationalgorithmsadjustavectorofnumberstoachieveagoodscorefromanobjectivefunction.Theobjectivefunctiongivestheneuralnetworkascorebasedcloselyontheneuralnetwork’soutputthatmatchestheexpectedoutput.Thisscoreallowsanyoptimizationalgorithmtotrainneuralnetworks.
Afeedforwardneuralnetworkissimilartothetypesofneuralnetworksthatwehavealreadyexamined.Justlikeothertypesofneuralnetworks,thefeedforwardneuralnetworkbeginswithaninputlayerthatmayconnecttoahiddenlayerortotheoutputlayer.Ifitconnectstoahiddenlayer,thehiddenlayercansubsequentlyconnecttoanotherhiddenlayerortotheoutputlayer.Anynumberofhiddenlayerscanexist.
FeedforwardNeuralNetworkStructure
InChapter1,“NeuralNetworkBasics,”wediscussedthatneuralnetworkscouldhavemultiplehiddenlayersandanalyzedthepurposesoftheselayers.Inthischapter,wewillfocusmoreonthestructureoftheinputandoutputneurons,beginningwiththestructureoftheoutputlayer.Thetypeofproblemdictatesthestructureoftheoutputlayer.Aclassificationneuralnetworkwillhaveanoutputneuronforeachclass,whereasaregressionneuralnetworkwillhaveoneoutputneuron.
Single-OutputNeuralNetworksforRegression
Thoughfeedforwardneuralnetworkscanhavemorethanoneoutputneuron,wewillbeginbylookingatasingle-outputneuronnetworkinaregressionproblem.Aregressionnetworkiscapableofpredictingasinglenumericvalue.Figure4.1illustratesasingle-outputfeedforwardneuralnetwork:
Figure4.1:Single-OutputFeedforwardNetwork
Thisneuralnetworkwilloutputasinglenumericvalue.Wecanusethistypeofneuralnetworkinthefollowingways:
Regression–Computeanumberbasedontheinputs.(e.g.,Howmanymilespergallon(MPG)willaspecifictypeofcarachieve?)BinaryClassification–Decidebetweentwooptions,basedontheinputs.(e.g.,Ofthegivencharacteristics,whichisacanceroustumor?)
Weprovidearegressionexampleforthischapterthatutilizesdataaboutvariouscarmodelsandpredictsthemilespergallonthatthecarwillachieve.Youcanfindthisdata
setatthefollowingURL:
https://archive.ics.uci.edu/ml/datasets/Auto+MPG
Asmallsamplingofthisdataisshownhere:
mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
18,8,307,130,3504,12,70,1,"chevroletchevellemalibu"
15,8,350,165,3693,11,70,1,"buickskylark320"
18,8,318,150,3436,11,70,1,"plymouthsatellite"
16,8,304,150,3433,12,70,1,"amcrebelsst"
Foraregressionproblem,theneuralnetworkwouldcreatecolumnssuchascylinders,displacement,horsepower,andweighttopredicttheMPG.Thesevaluesareallfieldsusedintheabovelistingthatspecifyqualitiesofeachcar.Inthiscase,thetargetisMPG;however,wecouldalsoutilizeMPG,cylinders,horsepower,weight,andaccelerationtopredictdisplacement.
Tomaketheneuralnetworkperformregressiononmultiplevalues,youmightapplymultipleoutputneurons.Forexample,cylinders,displacement,andhorsepowercanpredictbothMPGandweight.Althoughamulti-outputneuralnetworkiscapableofperformingregressionontwovariables,wedon’trecommendthistechnique.Youwillusuallyachievebetterresultswithseparateneuralnetworksforeachregressionoutcomethatyouaretryingtopredict.
CalculatingtheOutput
InChapter1,“NeuralNetworkBasics,”weexploredhowtocalculatetheindividualneuronsthatcompriseaneuralnetwork.Asabriefreview,theoutputofanindividualneuronissimplytheweightedsumofitsinputsandabias.Thissummationispassedtoanactivationfunction.Equation4.1summarizesthecalculatedoutputofaneuralnetwork:
Equation4.1:NeuronOutput
Theneuronmultipliestheinputvector(x)bytheweights(w)andpassestheresultintoanactivationfunction(φ,phi).Thebiasvalueisthelastvalueintheweightvector(w),anditisaddedbyconcatenatinga1valuetotheinput.Forexample,consideraneuronthathastwoinputsandabias.Iftheinputswere0.1and0.2,theinputvectorwouldappearasfollows:
[0.1,0.2,1.0]
Inthisexample,addthevalue1.0tosupportthebiasweight.Wecanalsocalculatethevaluewiththefollowingweightvector:
[0.01,0.02,0.3]
Thevalues0.01and0.02aretheweightsforthetwoinputstotheneuron.Thevalue0.3isthebias.Theweightedsumiscalculatedasfollows:
(0.1*0.01)+(0.2*0.02)+(1.0*0.3)=0.305
Thevalue0.305isthenpassedtoanactivationfunction.
Calculatinganentireneuralnetworkisessentiallyamatteroffollowingthissameprocedureforeachneuroninthenetwork.Thisprocessallowsyoutoworkyourwayfromtheinputneuronstotheoutput.Youcanimplementthisprocessbycreatingobjectsforeachconnectioninthenetworkorbyaligningtheseconnectionvaluesintomatrices.
Object-orientedprogrammingallowsyoutodefineanobjectforeachneuronanditsweights.Thisapproachcanproduceveryreadablecode,butithastwosignificantproblems:
Theweightsarestoredacrossmanyobjects.Performancesuffersbecauseittakesmanyfunctioncallsandmemoryaccessestopiecealltheweightstogether.
Itisvaluabletocreateweightsintheneuralnetworkasasinglevector.Avarietyofdifferentoptimizationalgorithmscanadjustavectortoperfectascoringfunction.ArtificialIntelligenceforHumans,Volumes1&2includeadiscussionoftheseoptimizationfunctions.Laterinthischapter,wewillseehowsimulatedannealingoptimizestheweightvectorfortheneuralnetwork.
Toconstructaweightvector,wewillfirstlookatanetworkthathasthefollowingattributes:
InputLayer:2neurons,1biasHiddenLayer:2neurons,1biasOutputLayer:1neuron
Thesecharacteristicsgivethisnetworkatotalof7neurons.
Youcannumbertheseneuronsforthevectorinthefollowingmanner:
Neuron0:Output1
Neuron1:Hidden1
Neuron2:Hidden2
Neuron3:Bias2(setto1,usually)
Neuron4:Input1
Neuron5:Input2
Neuron6:Bias1(setto1,usually)
Graphically,youcanseethenetworkasFigure4.2:
Figure4.2:SimpleNeuralNetwork
Youcancreateseveraladditionalvectorstodefinethestructureofthenetwork.Thesevectorsholdindexvaluestoallowthequicknavigationoftheweightvector.Thesevectorsarelistedhere:
layerFeedCounts:[1,2,2]
layerCounts:[1,3,3]
layerIndex:[0,1,4]
layerOutput:[0.0,0.0,0.0,1.0,0.0,0.0,1.0]
weightIndex:[0,3,9]
Eachvectorstoresthevaluesfortheoutputlayerfirstandworksitswaytotheinput
layer.ThelayerFeedCountsvectorholdsthecountofnon-biasneuronsineachlayer.Thischaracteristicisessentiallythecountofnon-biasneurons.ThelayerOutputvectorholdsthecurrentvalueofeachneuron.Initially,allneuronsstartwith0.0exceptforthebiasneurons,whichstartat1.0.ThelayerIndexvectorholdsindexestowhereeachlayerbeginsinthelayerOuputvector.TheweightIndexholdsindexestothelocationofeachlayerintheweightvector.
Theweightsarestoredintheirownvectorandstructuredasfollows:
Weight0:H1->O1
Weight1:H2->O1
Weight2:B2->O1
Weight3:I1->H1
Weight4:I2->H1
Weight5:B1->H1
Weight6:I1->H2
Weight7:I2->H2
Weight8:B1->H2
Oncethevectorshavebeenarranged,calculatingtheoutputoftheneuralnetworkisrelativelyeasy.Listing4.1canaccomplishthiscalculation:
Listing4.1:CalculateFeedforwardOutput
defcompute(net,input):
sourceIndex=len(net.layerOutput)
-net.layerCounts[len(net.layerCounts)-1]
#CopytheinputintothelayerOutputvector
array_copy(input,0,net.layerOutput,sourceIndex,net.inputCount)
#Calculateeachlayer
foriinreversed(range(0,len(layerIndex))):
compute_layer(i)
#updatecontextvalues
offset=net.contextTargetOffset[0]
#Createresult
result=vector(net.outputCount)
array_copy(net.layerOutput,0,result,0,net.outputCount)
returnresult
defcompute_layer(net,currentLayer):
inputIndex=net.layerIndex[currentLayer]
outputIndex=net.layerIndex[currentLayer-1]
inputSize=net.layerCounts[currentLayer]
outputSize=net.layerFeedCounts[currentLayer-1]
index=this.weightIndex[currentLayer-1]
limit_x=outputIndex+outputSize
limit_y=inputIndex+inputSize
#weightvalues
forxinrange(outputIndex,limit_x):
sum=0;
foryinrange(inputIndex,limit_y):
sum+=net.weights[index]*net.layerOutput[y]
net.layerSums[x]=sum
net.layerOutput[x]=sum
index=index+1
net.activationFunctions[currentLayer-1]
.activation_function(
net.layerOutput,outputIndex,outputSize)
InitializingWeights
Theweightsofaneuralnetworkdeterminetheoutputfortheneuralnetwork.Theprocessoftrainingcanadjusttheseweightssotheneuralnetworkproducesusefuloutput.Mostneuralnetworktrainingalgorithmsbeginbyinitializingtheweightstoarandomstate.Trainingthenprogressesthroughaseriesofiterationsthatcontinuouslyimprovetheweightstoproducebetteroutput.
Therandomweightsofaneuralnetworkimpacthowwellthatneuralnetworkcanbetrained.Ifaneuralnetworkfailstotrain,youcanremedytheproblembysimplyrestarting
withanewsetofrandomweights.However,thissolutioncanbefrustratingwhenyouareexperimentingwiththearchitectureofaneuralnetworkandtryingdifferentcombinationsofhiddenlayersandneurons.Ifyouaddanewlayer,andthenetwork’sperformanceimproves,youmustaskyourselfifthisimprovementresultedfromthenewlayerorfromanewsetofweights.Becauseofthisuncertainty,welookfortwokeyattributesinaweightinitializationalgorithm:
Howconsistentlydoesthisalgorithmprovidegoodweights?Howmuchofanadvantagedotheweightsofthealgorithmprovide?
Oneofthemostcommon,yetleasteffective,approachestoweightinitializationistosettheweightstorandomvalueswithinaspecificrange.Numbersbetween-1and+1or-5and+5areoftenthechoice.Ifyouwanttoensurethatyougetthesamesetofrandomweightseachtime,youshoulduseaseed.Theseedspecifiesasetofpredefinedrandomweightstouse.Forexample,aseedof1000mightproducerandomweightsof0.5,0.75,and0.2.Thesevaluesarestillrandom;youcannotpredictthem,yetyouwillalwaysgetthesevalueswhenyouchooseaseedof1000.
Notallseedsarecreatedequal.Oneproblemwithrandomweightinitializationisthattherandomweightscreatedbysomeseedsaremuchmoredifficulttotrainthanothers.Infact,theweightscanbesobadthattrainingisimpossible.Ifyoufindthatyoucannottrainaneuralnetworkwithaparticularweightset,youshouldgenerateanewsetofweightsusingadifferentseed.
Becauseweightinitializationisaproblem,therehasbeenconsiderableresearcharoundit.OvertheyearswehavestudiedthisresearchandaddedsixdifferentweightinitializationroutinestotheEncogproject.Fromourresearch,theXavierweightinitializationalgorithm,introducedin2006byGlorot&Bengio,producesgoodweightswithreasonableconsistency.Thisrelativelysimplealgorithmusesnormallydistributedrandomnumbers.
TousetheXavierweightinitialization,itisnecessarytounderstandthatnormallydistributedrandomnumbersarenotthetypicalrandomnumbersbetween0and1thatmostprogramminglanguagesgenerate.Infact,normallydistributedrandomnumbersarecenteredonamean(μ,mu)thatistypically0.If0isthecenter(mean),thenyouwillgetanequalnumberofrandomnumbersaboveandbelow0.Thenextquestionishowfartheserandomnumberswillventurefrom0.Intheory,youcouldendupwithbothpositiveandnegativenumbersclosetothemaximumpositiveandnegativerangessupportedbyyourcomputer.However,therealityisthatyouwillmorelikelyseerandomnumbersthatarebetween0andthreestandarddeviationsfromthecenter.
Thestandarddeviationσ(sigma)parameterspecifiesthesizeofthisstandarddeviation.Forexample,ifyouspecifiedastandarddeviationof10,thenyouwouldmainlyseerandomnumbersbetween-30and+30,andthenumbersnearerto0haveamuchhigherprobabilityofbeingselected.Figure4.3showsthenormaldistribution:
Figure4.3:TheNormalDistribution
Theabovefigureillustratesthatthecenter,whichinthiscaseis0,willbegeneratedwitha0.4(40%)probability.Additionally,theprobabilitydecreasesveryquicklybeyond-2or+2standarddeviations.Bydefiningthecenterandhowlargethestandarddeviationsare,youareabletocontroltherangeofrandomnumbersthatyouwillreceive.
Mostprogramminglanguageshavethecapabilityofgeneratingnormallydistributedrandomnumbers.Ingeneral,theBox-Mulleralgorithmisthebasisforthisfunctionality.Theexamplesinthisvolumewilleitherusethebuilt-innormalrandomnumbergeneratorortheBox-Mulleralgorithmtotransformregular,uniformlydistributedrandomnumbersintoanormaldistribution.ArtificialIntelligenceforHumans,Volume1:FundamentalAlgorithmscontainsanexplanationoftheBox-Mulleralgorithm,butyoudonotnecessarilyneedtounderstanditinordertograsptheideasinthisbook.
TheXavierweightinitializationsetsalloftheweightstonormallydistributedrandomnumbers.Theseweightsarealwayscenteredat0;however,theirstandarddeviationvariesdependingonhowmanyconnectionsarepresentforthecurrentlayerofweights.Specifically,Equation4.2candeterminethestandarddeviation:
Equation4.2:StandardDeviationforXavierAlgorithm
Theaboveequationshowshowtoobtainthevarianceforalloftheweights.Thesquarerootofthevarianceisthestandarddeviation.Mostrandomnumbergenerators
acceptastandarddeviationratherthanavariance.Asaresult,youusuallyneedtotakethesquarerootoftheaboveequation.Figure4.4showshowonelayermightbeinitialized:
Figure4.4:XavierInitializationofaLayer
Thisprocessiscompletedforeachlayerintheneuralnetwork.
Radial-BasisFunctionNetworks
Radial-basisfunction(RBF)networksareatypeoffeedforwardneuralnetworkintroducedbyBroomheadandLowe(1988).Thesenetworkscanbeusedforbothclassificationandregression.Thoughtheycansolveavarietyofproblems,RBFnetworksseemtolosingpopularity.Bytheirverydefinition,RBFnetworkscannotbeusedinconjunctionwithdeeplearning.
TheRBFnetworkutilizesaparametervector,amodelthatspecifiesweightsandcoefficients,inordertoallowtheinputtogeneratethecorrectoutput.Byadjustingarandomparametervector,theRBFnetworkproducesoutputconsistentwiththeirisdataset.Theprocessofadjustingtheparametervectortoproducethedesiredoutputiscalledtraining.ManydifferentmethodsexistfortraininganRBFnetwork.Theparametervectorsalsorepresentitslong-termmemory.
Inthenextsection,wewillbrieflyreviewRBFsanddescribetheexactmakeupofthesevectors.
Radial-BasisFunctions
BecausemanyAIalgorithmsutilizeradial-basisfunctions,theyareaveryimportantconcepttounderstand.Aradial-basisfunctionissymmetricwithrespecttoitscenter,whichisusuallysomewherealongthex-axis.TheRBFwillreachitsmaximumvalueorpeakatthecenter.WhereasatypicalsettingforthepeakinRBFnetworksis1,thecentervariesaccordingly.
RBFscanhavemanydimensions.RegardlessofthenumberofdimensionsinthevectorpassedtotheRBF,itsoutputwillalwaysbeasinglescalarvalue.
RBFsarequitecommoninAI.Wewillstartwiththemostprevalent,theGaussianfunction.Figure4.5showsagraphofa1DGaussianfunctioncenteredat0:
Figure4.5:GaussianFunction
Youmightrecognizetheabovecurveasanormaldistributionorabellcurve,whichisaradial-basisfunction.TheRBFs,suchasaGaussianfunction,canselectivelyscalenumericvalues.ConsiderFigure4.5above.Ifyouappliedthisfunctiontoscalenumericvalues,theresultwouldhavemaximumintensityatthecenter.Asyoumovedfromthecenter,theintensitywoulddiminishineitherthepositiveornegativedirections.
BeforewecanlookattheequationfortheGaussianRBF,wemustconsiderhowtoprocessthemultipledimensions.RBFsacceptmulti-dimensionalinputandreturnasinglevaluebycalculatingthedistancebetweentheinputandthecentervector.Thisdistanceiscalledr.TheRBFcenterandinputtotheRBFmustalwayshavethesamenumberofdimensionsforthecalculationtooccur.Oncewecalculater,wecandeterminetheindividualRBF.AlloftheRBFsusethiscalculatedr.
Equation4.3showshowtocalculater:
Equation4.3:Calculatingr
Thedoubleverticalbarsthatyouseeintheaboveequationsignifythatthefunctiondescribesadistanceoranorm.Incertaincases,thesedistancescanvary;however,RBFstypicallyutilizeEuclideandistance.Asaresult,theexamplesthatweprovideinthisbookalwaysapplytheEuclideandistance.Therefore,rissimplytheEuclideandistancebetweenthecenterandthexvector.IneachoftheRBFsinthissection,wewillusethisvaluer.Equation4.4showstheequationforaGaussianRBF:
Equation4.4:GaussianRBF
Onceyou’vecalculatedr,determiningtheRBFisfairlyeasy.TheGreekletterφ,whichyouseeattheleftoftheequation,alwaysrepresentstheRBF.TheconstanteinEquation4.4representsEuler’snumber,orthenaturalbase,andisapproximately2.71828.
Radial-BasisFunctionNetworks
RBFnetworksprovideaweightedsummationofoneormoreradial-basisfunctions;eachofthesefunctionsreceivestheweightedinputattributesinordertopredicttheoutput.ConsidertheRBFnetworkasalongequationthatcontainstheparametervector.Equation4.5showstheequationneededtocalculatetheoutputofthisnetwork:
Equation4.5:TheRBFNetwork
Notethatthedoubleverticalbarsintheaboveequationsignifythatyoumustcalculatethedistance.Becausethesesymbolsdonotspecifywhichdistancealgorithmtouse,youcanselectthealgorithm.Intheaboveequation,Xistheinputvectorofattributes;cisthevectorcenteroftheRBF;pisthechosenRBF(Gaussian,forexample);aisthevectorcoefficient(orweight)foreachRBF;andbspecifiesthevectorcoefficienttoweighttheinputattributes.
Inourexample,wewillapplyanRBFnetworktotheirisdataset.Figure4.6providesagraphicrepresentationofthisapplication:
Figure4.6:TheRBFNetworkfortheIrisData
Theabovenetworkcontainsfourinputs(thelengthandwidthofpetalsandsepals)thatindicatethefeaturesthatdescribeeachirisspecies.Theabovediagramassumesthatweareusingone-of-nencodingforthethreedifferentirisspecies.Usingequilateralencodingforonlytwooutputsisalsopossible.Tokeepthingssimple,wewilluseone-of-nandarbitrarilychoosethreeRBFs.EventhoughadditionalRBFsallowthemodeltolearnmorecomplexdatasets,theyrequiremoretimetoprocess.
Arrowsrepresentallcoefficientsfromtheequation.InEquation4.5,brepresentsthearrowsbetweentheinputattributesandtheRBFs.Similarly,arepresentsthearrowsbetweentheRBFsandthesummation.Noticealsothebiasbox,whichisasyntheticfunctionthatalwaysreturnsavalueof1.Becausethebiasfunction’soutputisconstant,theprogramdoesnotrequireinputs.Theweightsfromthebiastothesummationspecifythey-interceptfortheequation.Inshort,biasisnotalwaysbad.ThiscasedemonstratesthatbiasisanimportantcomponenttotheRBFnetwork.Biasnodesarealsoverycommoninneuralnetworks.
Becausemultiplesummationsexist,youcanseethedevelopmentofaclassificationproblem.Thehighestsummationspecifiesthepredictedclass.Aregressionproblemindicatesthatthemodelwilloutputasinglenumericvalue.
YouwillalsonoticethatFigure4.4containsabiasnodeintheplaceofanadditional
RBF.UnliketheRBF,thebiasnodedoesnotacceptanyinput.Italwaysoutputsaconstantvalueof1.Ofcourse,thisconstantvalueof1ismultipliedbyacoefficientvalue,whichalwayscausesthecoefficienttobedirectlyaddedtotheoutput,regardlessoftheinput.Whentheinputis0,biasnodesareveryusefulbecausetheyallowtheRBFlayertooutputvaluesdespitethelowvalueoftheinput.
Thelong-termmemoryvectorfortheRBFnetworkhasseveraldifferentcomponents:
InputcoefficientsOutput/SummationcoefficientsRBFwidthscalars(samewidthinalldimensions)RBFcentervectors
TheRBFnetworkwillstoreallofthesecomponentsasasinglevectorthatwillbecomeitslong-termmemory.Thenanoptimizationalgorithmcansetthevectortovaluesthatwillproducethecorrectirisspeciesforthefeaturespresented.ThisbookcontainsseveraloptimizationalgorithmsthatcantrainanRBFnetwork.
Inconclusion,thisintroductionprovidedabasicoverviewofvectors,distance,andRBFnetworks.SincethisdiscussionincludedonlytheprerequisitematerialtounderstandVolume3,refertoVolumes1and2foramorethoroughexplanationofthesetopics.
NormalizingData
Normalizationwasbrieflymentionedpreviouslyinthisbook.Inthissection,wewillseeexactlyhowitisperformed.Dataarenotusuallypresentedtotheneuralnetworkinexactlythesamerawformasyoufoundit.Usuallydataarescaledtoaspecificrangeinaprocesscallednormalization.Therearemanydifferentwaystonormalizedata.Forafullsummary,refertoArtificialIntelligenceforHumans,Volume1:FundamentalAlgorithms.Thischapterwillpresentafewnormalizationmethodsmostusefulforneuralnetworks.
One-of-NEncoding
Ifyouhaveacategoricalvalue,suchasthespeciesofaniris,themakeofanautomobile,orthedigitlabelintheMNISTdataset,youshoulduseone-of-nencoding.Thistypeofencodingissometimesreferredtoasone-hotencoding.Toencodeinthisway,youwoulduseoneoutputneuronforeachclassintheproblem.RecalltheMNSITdatasetfromthebook’sintroduction,whereyouhaveimagesfordigitsbetween0and9.Thisproblemismostcommonlyencodedastenoutputneuronswithasoftmaxactivationfunctionthatgivestheprobabilityoftheinputbeingoneofthesedigits.Usingone-of-nencoding,thetendigitsmightbeencodedasfollows:
0->[1,0,0,0,0,0,0,0,0,0]
1->[0,1,0,0,0,0,0,0,0,0]
2->[0,0,1,0,0,0,0,0,0,0]
3->[0,0,0,1,0,0,0,0,0,0]
4->[0,0,0,0,1,0,0,0,0,0]
5->[0,0,0,0,0,1,0,0,0,0]
6->[0,0,0,0,0,0,1,0,0,0]
7->[0,0,0,0,0,0,0,1,0,0]
8->[0,0,0,0,0,0,0,0,1,0]
0->[0,0,0,0,0,0,0,0,0,1]
One-of-nencodingshouldalwaysbeusedwhentheclasseshavenoordering.Anotherexampleofthistypeofencodingisthemakeofanautomobile.Usuallythelistofautomakersisunorderedunlessthereissomemeaningyouwishtoconveybythisordering.Forexample,youmightordertheautomakersbythenumberofyearsinbusiness.However,thisclassificationshouldonlybedoneifthenumberofyearsinbusinesshasmeaningtoyourproblem.Ifthereistrulynoorder,thenone-of-nshouldalwaysbeused.
Becauseyoucaneasilyorderthedigits,youmightwonderwhyweuseone-of-nencodingforthem.However,theorderofthedigitsdoesnotmeantheprogramcanrecognizethem.Thefactthat“1”and“2”arenumericallynexttoeachotherdoesnothingtohelptheprogramrecognizetheimage.Therefore,weshouldnotuseasingle-outputneuronthatsimplyoutputsthedigitrecognized.Thedigits0-9arecategories,notactualnumericvalues.Encodingcategorieswithasinglenumericvalueisdetrimentaltotheneuralnetwork’sdecisionsprocess.
Boththeinputandoutputcanuseone-of-nencoding.Theabovelistingused0’sand1’s.Normallyyouwillusetherectifiedlinearunit(ReLU)andsoftmaxactivation,andthistypeofencodingisnormal.However,ifyouareworkingwithahyperbolictangentactivationfunction,youshouldutilizeavalueof-1forthe0’stomatchthehyperbolictangent’srangeof-1to1.
Ifyouhaveanextremelylargenumberofclasses,one-of-nencodingcanbecomecumbersomebecauseyoumusthaveaneuronforeveryclass.Insuchcases,youhaveseveraloptions.First,youmightfindawaytoorderyourcategories.Withthisordering,yourcategoriescannowbeencodedasanumericvalue,whichwouldbethecurrentcategory’spositionwithintheorderedlist.
Anotherapproachtodealingwithanextremelylargenumberofcategoriesisfrequency-inversedocumentfrequency(TF-IDF)encodingbecauseeachclassessentiallybecomestheprobabilityofthatclass’soccurrencerelativetotheothers.Inthisway,TF-IDFallowstheprogramtomapalargenumberofclassestoasingleneuron.AcompletediscussionofTF-IDFisbeyondthescopeofthisbook;however,itisbuiltintomanymachinelearningframeworksforlanguagessuchasR,Python,andsomeothers.
RangeNormalization
Ifyouhavearealnumberoranorderedlistofcategories,youmightchooserangenormalizationbecauseitsimplymapstheinputdata’srangeintotherangeofyouractivationfunction.Sigmoid,ReLUandsoftmaxusearangebetween0and1,whereashyperbolictangentusesarangebetween-1and1.
YoucannormalizeanumberwithEquation4.6:
Equation4.6:NormalizetoaRange
Toperformthenormalization,youneedthehighandlowvaluesofthedatatobenormalized,givenbydlanddhintheequationabove.Similarly,youneedthehighandlowvaluestonormalizeinto(usually0and1),givenbynlandnh.
Sometimesyouwillneedtoundothenormalizationperformedonanumberandreturnittoadenormalizedstate.Equation4.7performsthisoperation:
Equation4.7:DenormalizefromaRange
Averysimplewaytothinkofrangenormalizationispercentages.Considerthefollowinganalogy.Youseeanadvertisementstatingthatyouwillreceivea$10(USD)reductiononaproduct,andyouhavetodecideifthisdealisworthwhile.Ifyouarebuyingat-shirt,thisofferisprobablyagooddeal;however,ifyouarebuyingacar,$10doesnotreallymatter.Furthermore,youneedtobefamiliarwiththecurrentvalueofUSdollarsinordertomakeyourdecision.Thesituationchangesifyoulearnthatthemerchanthadoffereda10%discount.Thus,thevalueisnowmoremeaningful.Nomatterifyouarebuyingat-shirt,carorevenahouse,the10%discounthasclearramificationsontheproblembecauseittranscendscurrencies.Inotherwords,thepercentageisatypeofnormalization.Justlikeintheanalogy,normalizingtoarangehelpstheneuralnetworkevaluateallinputswithequalsignificance.
Z-ScoreNormalization
Z-scorenormalizationisthemostcommonnormalizationforeitherarealnumberoranorderedlist.Fornearlyallapplications,z-scorenormalizationshouldbeusedinplaceofrangenormalization.Thisnormalizationtypeisbasedonthestatisticalconceptofz-scores,thesametechniqueforgradingexamsonacurve.Z-scoresprovideevenmoreinformationthanpercentages.
Considerthefollowingexample.StudentAscored85%ofthepointsonherexam.StudentBscored75%ofthepointsonhisexam.Whichstudentearnedthebettergrade?Iftheprofessorissimplyreportingthepercentageofcorrectpoints,thenstudentAearnedabetterscore.However,youmightchangeyouranswerifyoulearnedthattheaverage(mean)scoreforstudentA’sveryeasyexamwas95%.Similarly,youmightreconsideryourpositionifyoudiscoveredthatstudentB’sclasshadanaveragescoreof65%.StudentBperformedaboveaverageonhisexam.EventhoughstudentAearnedabetterscore,sheperformedbelowaverage.Totrulyreportacurvedscore(az-score)youmusthavethemeanscoreandthestandarddeviation.Equation4.8showsthecalculationofamean:
Equation4.8:CalculatetheArithmeticMean
Youcancalculatethemean(μ,mu)byaddingallofthescoresanddividingbythenumberofscores.Thisprocessisthesameastakinganaverage.Nowthatyouhavetheaverage,youneedthestandarddeviation.Ifyouhadameanscoreof50points,theneveryonetakingtheexamvariedfromthemeanbysomeamount.Theaverageamountthatstudentsvariedfromthemeanisessentiallythestandarddeviation.Equation4.9showsthecalculationofthestandarddeviation(σ,sigma):
Equation4.9:StandardDeviation
Essentially,theprocessoftakingastandarddeviationissquaringandsummingeachscore’sdifferencefromthemean.Thesevaluesareaddedtogetherandthesquarerootistakenofthistotal.Nowthatyouhavethestandarddeviation,youcancalculatethez-scorewithEquation4.10:
Equation4.10:Z-Score
Listing4.2showsthepseudocodeneededtocalculateaz-score:
Listing4.2:CalculateaZ-Score
#Datatoscore:
data=[5,10,3,20,4]
#Sumthevalues
sum=0
fordindata:
sum=sum+d
#Calculatemean
mean=float(sum)/len(data)
print("Mean:"+mean)
#Calculatethevariance
variance=0
fordindata:
variance=variance+((mean-d)**2)
variance=variance/len(data)
print("Variance:"+variance)
#Calculatethestandarddeviation
sdev=sqrt(variance)
print("StandardDeviation:"+sdev)
#Calculatezscore
zscore=[]
fordindata:
zscore.append((d-mean)/sdev)
print("Z-Scores:"+str(zscore))
Theabovecodewillresultinthefollowingoutput:
Mean:8.4
Variance:39.440000000000005
StandardDeviation:6.280127387243033
Z-Scores:[-0.5413902920037097,0.2547719021193927,-0.8598551696529507,
1.8470962903655976,-0.7006227308283302]
Thez-scoreisanumericvaluewhere0representsascorethatisexactlythemean.Apositivez-scoreisaboveaverage;anegativez-scoreisbelowaverage.Tohelpvisualizez-scores,considerthefollowingmappingbetweenz-scoresandlettergrades:
<-2.0=D+
-2.0=C-
-1.5=C
-1.0=C+
-0.5=B-
0.0=B
+0.5=B+
+1.0=A-
+1.5=A
+2.0=A+
Wetookthemappinglistedabovefromanundergraduatesyllabus.Thereisagreatdealofvariationonz-scoretolettergrademapping.Mostprofessorswillsetthe0.0z-scoretoeitheraCoraB,dependingoniftheprofessor/universityconsidersCorBtorepresentanaveragegrade.TheaboveprofessorconsideredBtobeaverage.Thez-scoreworkswellforneuralnetworkinputasitiscenteredat0andwillveryrarelygoabove+3andbelow-3.
ComplexNormalization
Theinputtoaneuralnetworkiscommonlycalleditsfeaturevector.Theprocessofcreatingafeaturevectoriscriticaltomappingyourrawdatatoaformthattheneuralnetworkcancomprehend.Theprocessofmappingtherawdatatoafeaturevectoriscalledencoding.Toseethismappingatwork,considertheautoMPGdataset:
1.mpg:numeric
2.cylinders:numeric,3unique
3.displacement:numeric
4.horsepower:numeric
5.weight:numeric
6.acceleration:numeric
7.modelyear:numeric,3unique
8.origin:numeric,7unique
9.carname:string(uniqueforeachinstance)
Toencodetheabovedata,wewilluseMPGastheoutputandtreatthedatasetasregression.TheMPGfeaturewillbez-scoreencoded,anditfallswithintherangeofthe
linearactivationfunctionthatwewilluseontheoutput.
Wewilldiscardthecarname.Cylindersandmodel-yeararebothone-of-nencoded,theremainingfieldswillbez-scoreencoded.Thefollowingfeaturevectorresults:
InputFeatureVector:
Feature1:cylinders-2,-1no,+1yes
Feature2:cylinders-4,-1no,+1yes
Feature3:cylinders-8,-1no,+1yes
Feature4:displacementz-score
Feature5:horsepowerz-score
Feature6:weightz-score
Feature7:accelerationz-score
Feature8:modelyear-1977,-1no,+1yes
Feature9:modelyear-1978,-1no,+1yes
Feature10:modelyear-1979,-1no,+1yes
Feature11:origin-1
Feature12:origin-2
Feature13:origin-3
Output:
mpgz-score
Asyoucansee,thefeaturevectorhasgrownfromtheninerawfieldstothirteenfeaturesplusanoutput.Aneuralnetworkforthesedatawouldhavethirteeninputneuronsandasingleoutput.Assumingasingle-hiddenlayeroftwentyneuronswiththeReLUactivation,thisnetworkwouldlooklikeFigure4.7:
Figure4.7:SimpleRegressionNeuralNetwork
ChapterSummary
Feedforwardneuralnetworksareoneofthemostcommonalgorithmsinartificialintelligence.Inthischapter,weintroducedthemultilayerfeedforwardneuralnetworkandtheradial-basisfunction(RBF)neuralnetwork.Classificationandregressionapplybothofthesetypesofneuralnetwork.
Feedforwardnetworkshavewell-definedlayers.Theinputlayeracceptstheinputfromthecomputerprogram.Theoutputlayerreturnstheprocessingresultoftheneuralnetworktothecallingprogram.Betweentheselayersarehiddenneuronsthathelptheneuralnetworktorecognizeapatternpresentedattheinputlayerandproducethecorrectresultontheoutputlayer.
RBFneuralnetworksuseaseriesofradial-basisfunctionsfortheirhiddenlayer.Inadditiontotheweights,itisalsopossibletochangethewidthsandcentersoftheseRBFs.ThoughanRBFandfeedforwardnetworkcanapproximateanyfunction,theygoabouttheprocessindifferentways.
Sofar,we’veseenonlyhowtocalculatethevaluesforneuralnetworks.Trainingistheprocessbywhichweadjusttheweightsofneuralnetworkssothattheneuralnetworkoutputsthevaluesthatwedesire.Totrainneuralnetworks,wealsoneedtohaveawaytoevaluateit.Thenextchapterintroducesbothtrainingandvalidationofneuralnetworks.
Chapter5:Training&EvaluationMeanSquaredErrorSensitivity&SpecificityROCCurveSimulatedAnnealing
Sofarwe’veseenhowtocalculateaneuralnetworkbasedonitsweights;however,wehavenotseenwheretheseweightvaluesactuallycomefrom.Trainingistheprocesswhereaneuralnetwork’sweightsareadjustedtoproducethedesiredoutput.Trainingusesevaluation,whichistheprocesswheretheoutputoftheneuralnetworkisevaluatedagainsttheexpectedoutput.
Thischapterwillcoverevaluationandintroducetraining.Becauseneuralnetworkscanbetrainedandevaluatedinmanydifferentways,weneedaconsistentmethodtojudgethem.Anobjectivefunctionevaluatesaneuralnetworkandreturnsascore.Trainingadjuststheneuralnetworkinwaysthatmightachievebetterresults.Typically,theobjectivefunctionwantslowerscores.Theprocessofattemptingtoachievelowerscoresiscalledminimization.Youmightestablishmaximizationproblems,inwhichtheobjectivefunctionwantshigherscores.Therefore,youcanusemosttrainingalgorithmsforeitherminimizationormaximizationproblems.
Youcanoptimizeweightsofaneuralnetworkwithanycontinuousoptimizationalgorithm,suchassimulatedannealing,particleswarmoptimization,geneticalgorithms,hillclimbing,Nelder-Mead,orrandomwalk.Inthischapter,wewillintroducesimulatedannealingasasimpletrainingalgorithm.However,inadditiontooptimizationalgorithms,youcantrainneuralnetworkswithbackpropagation.Chapter6,“BackpropagationTraining,”andChapter7,“OtherPropagationTraining,”willintroduceseveralalgorithmsthatwerebasedonthebackpropagationtrainingalgorithmsintroducedinChapter6.
EvaluatingClassification
Classificationistheprocessbywhichaneuralnetworkattemptstoclassifytheinputintooneormoreclasses.Thesimplestwayofevaluatingaclassificationnetworkistotrackthepercentageoftrainingsetitemsthatwereclassifiedincorrectly.Wetypicallyscorehumanexamplesinthismanner.Forexample,youmighthavetakenmultiple-choiceexamsinschoolinwhichyouhadtoshadeinabubbleforchoicesA,B,C,orD.Ifyouchosethewrongletterona10-questionexam,youwouldearna90%.Inthesameway,wecangradecomputers;however,mostclassificationalgorithmsdonotsimplychooseA,B,C,orD.Computerstypicallyreportaclassificationastheirpercentconfidenceineachclass.Figure5.1showshowacomputerandahumanmightbothrespondtoquestion#1onanexam:
Figure5.1:HumanExamversusComputerClassification
Asyoucansee,thehumantesttakermarkedthefirstquestionas“B.”However,thecomputertesttakerhadan80%(0.8)confidencein“B”andwasalsosomewhatsurewith10%(0.1)on“A.”Thecomputerthendistributedtheremainingpointsontheothertwo.Inthesimplestsense,themachinewouldget80%ofthescoreforthisquestionifthecorrectanswerwere“B.”Themachinewouldgetonly5%(0.05)ofthepointsifthecorrectanswerwere“D.”
BinaryClassification
Binaryclassificationoccurswhenaneuralnetworkmustchoosebetweentwooptions,whichmightbetrue/false,yes/no,correct/incorrect,orbuy/sell.Toseehowtousebinaryclassification,wewillconsideraclassificationsystemforacreditcardcompany.Thisclassificationsystemmustdecidehowtorespondtoanewpotentialcustomer.Thissystemwilleither“issueacreditcard”or“declineacreditcard.”
Whenyouhaveonlytwoclassesthatyoucanconsider,theobjectivefunction’sscoreisthenumberoffalsepositivepredictionsversusthenumberoffalsenegatives.Falsenegativesandfalsepositivesarebothtypesoferrors,anditisimportanttounderstandthedifference.Forthepreviousexample,issuingacreditcardwouldbethepositive.Afalsepositiveoccurswhenacreditcardisissuedtosomeonewhowillbecomeabadcreditrisk.Afalsenegativehappenswhenacreditcardisdeclinedtosomeonewhowouldhavebeenagoodrisk.
Becauseonlytwooptionsexist,wecanchoosethemistakethatisthemoreserioustypeoferror,afalsepositiveorafalsenegative.Formostbanksissuingcreditcards,afalsepositiveisworsethanafalsenegative.Decliningapotentiallygoodcreditcardholderisbetterthanacceptingacreditcardholderwhowouldcausethebanktoundertakeexpensivecollectionactivities.
Aclassificationproblemseekstoassigntheinputintooneormorecategories.Abinaryclassificationemploysasingle-outputneuralnetworktoclassifyintotwocategories.ConsidertheautoMPGdatasetthatisavailablefromtheUniversityofCaliforniaatIrvine(UCI)machinelearningrepositoryatthefollowingURL:
https://archive.ics.uci.edu/ml/datasets/Auto+MPG
FortheautoMPGdataset,wemightcreateclassificationsforcarsbuiltinsideoftheUnitedStates.Thefieldnamedoriginprovidesinformationonthelocationofthecarassembly.Thus,thesingle-outputneuronwouldgiveanumberthatindicatesthe
probabilitythatthecarwasbuiltintheUSA.
Toperformthisprediction,youneedtochangetheoriginfieldtoholdvaluesbetween1andthelow-endrangeoftheactivationfunction.Forexample,thelowendoftherangeforthesigmoidfunctionis0;forthehyperbolictangent,itis-1.TheneuralnetworkwilloutputavaluethatindicatestheprobabilityofacarbeingmadeintheUSAorelsewhere.Valuescloserto1indicateahigherprobabilityofthecaroriginatingintheUSA;valuescloserto0or-1indicateacaroriginatingfromoutsidetheUSA.
YoumustchooseacutoffvaluethatdifferentiatesthesepredictionsintoeitherUSAornon-USA.IfUSAis1.0andnon-USAis0.0,wecouldjustchoose0.5asthecutoffvalue.Consequently,acarwithanoutputof0.6wouldbeUSA,and0.4wouldbenon-USA.
Invariably,thisneuralnetworkwillproduceerrorsasitclassifiescars.AUSA-madecarmightyieldanoutputof0.45;however,becausetheneuralnetworkisbelowthecutoffvalue,itwouldnotputthecarinthecorrectcategory.BecausewedesignedthisneuralnetworktoclassifyUSA-madecars,thiserrorwouldbecalledafalsenegative.Inotherwords,theneuralnetworkindicatedthatthecarwasnon-USA,creatinganegativeresultbecausethecarwasactuallyfromtheUSA.Thus,thenegativeclassificationwasfalse.Thiserrorisalsoknownasatype-2error.
Similarly,thenetworkmightfalselyclassifyanon-USAcarasUSA.Thiserrorisafalsepositive,oratype-1.Neuralnetworkspronetoproducefalsepositivesarecharacterizedasmorespecific.Similarly,neuralnetworksthatproducemorefalsenegativesarelabeledasmoresensitive.Figure5.2summarizestheserelationshipsbetweentrue/false,positives/negatives,type-1&type-2errors,andsensitivity/specificity:
Figure5.2:TypesofErrors
Settingthecutofffortheoutputneuronselectswhethersensitivityorspecificityismoreimportant.Itispossibletomakeaneuralnetworkmoresensitiveorspecificbyadjustingthiscutoff,asillustratedinFigure5.3:
Figure5.3:Sensitivityvs.Specificity
Asthelimitlinemovesleft,thenetworkbecomesmorespecific.Thedecreaseinthesizeofthetruenegative(TN)areamakesthisspecificityevident.Conversely,asthelimitlinemovesright,thenetworkbecomesmoresensitive.Thissensitivityisevidentinthedecreaseinsizeofthetruepositive(TP)area.
Increasesinsensitivitywillusuallyresultinadecreaseofspecificity.Figure5.4showsaneurallimitdesignedtomaketheneuralnetworkverysensitive:
Figure5.4:SensitiveCutoff
Theneuralnetworkcanalsobecalibratedforgreatersensitivity,asshowninFigure5.5:
Figure5.5:SpecificCutoff
Attaining100%specificityorsensitivityisnotnecessarilygood.Amedicaltestcanreach100%specificitybysimplypredictingthateveryonedoesnothavethedisease.Thistestwillnevercommitafalsepositiveerrorbecauseitnevergaveapositiveanswer.Obviously,thistestisnotuseful.Highlyspecificorsensitiveneuralnetworksproducethesamemeaninglessresult.Weneedawaytoevaluatethetotaleffectivenessoftheneuralnetworkthatisindependentofthecutoffpoint.Thetotalpredictionratecombinesthepercentageoftruepositivesandtruenegatives.Equation5.1cancalculatethetotalpredictionrate:
Equation5.1:TotalPredictionRate
Additionally,youcanvisualizethetotalpredictionrate(TPR)withareceiveroperatorcharacteristic(ROC)chart,asseeninFigure5.6:
Figure5.6:ReceiverOperatorCharacteristic(ROC)Chart
TheabovechartshowsthreedifferentROCcurves.ThedashedlineshowsanROCwithzeropredictivepower.Thedottedlineshowsabetterneuralnetwork,andthesolidlineshowsanearlyperfectneuralnetwork.TounderstandhowtoreadanROCchart,lookfirstattheorigin,whichismarkedby0%.AllROClinesalwaysstartattheoriginandmovetotheupper-rightcornerwheretruepositive(TP)andfalsepositive(FP)areboth100%.
They-axisshowstheTPpercentagesfrom0to100.Asyoumoveupthey-axis,bothTPandFPincrease.AsTPincreases,sodoessensitivity;however,specificityfalls.TheROCchartallowsyoutoselectthelevelofsensitivityyouneed,butitalsoshowsyouthenumberofFPsyoumustaccepttoachievethatlevelofsensitivity.
Theworstnetwork,thedashedline,alwayshasa50%totalpredictionrate.Giventhatthereareonlytwooutcomes,thisresultisnobetterthanrandomguessing.Toget100%TP,youmustalsohavea100%FP,whichstillresultsinhalfofthepredictionsbeingwrong.
ThefollowingURLallowsyoutoexperimentwithasimpleneuralnetworkandROCcurve:
http://www.heatonresearch.com/aifh/vol3/anneal_roc.html
WecantraintheneuralnetworkattheaboveURLwithsimulatedannealing.Eachtimeanannealingepochiscompleted,theneuralnetworkimproves.Wecanmeasurethisimprovementbythemeansquarederrorcalculation(MSE).AstheMSEdrops,theROCcurvestretchestowardstheupperleftcorner.WewilldescribetheMSEingreaterdetaillaterinthischapter.Fornow,simplythinkofitasameasurementoftheneuralnetwork’serrorwhenyoucompareittotheexpectedoutput.AlowerMSEisdesirable.Figure5.7showstheROCcurveafterwehavetrainedthenetworkforanumberofiterations:
Figure5.7:ROCCurve
Itisimportanttonotethatthegoalisnotalwaystomaximizethetotalpredictionrate.Sometimesafalsepositive(FP)isbetterthanafalsenegative(FN.)Consideraneuralnetworkthatpredictsabridgecollapse.AFPmeansthattheprogrampredictsacollapsewhenthebridgewasactuallysafe.Inthiscase,checkingastructurallysoundbridgewouldwasteanengineer’stime.Ontheotherhand,aFNwouldmeanthattheneuralnetworkpredictedthebridgewassafewhenitactuallycollapsed.Abridgecollapsingisamuchworseoutcomethanwastingthetimeofanengineer.Therefore,youshouldarrangethistypeofneuralnetworksothatitisoverlyspecific.
Toevaluatethetotaleffectivenessofthenetwork,youshouldconsidertheareaunderthecurve(AUC).TheoptimalAUCwouldbe1.0,whichisa100%(1.0)x100%(1.0)rectanglethatpushestheareaunderthecurvetothemaximum.WhenreadinganROCcurve,themoreeffectiveneuralnetworkshavemorespaceunderthecurve.Thecurvesshownpreviously,inFigure5.6,correspondwiththisassessment.
Multi-ClassClassification
Ifyouwanttopredictmorethanoneoutcome,youwillneedmorethanoneoutputneuron.Becauseasingleneuroncanpredicttwooutcomes,aneuralnetworkwithtwooutputneuronsissomewhatrare.Iftherearethreeormoreoutcomes,therewillbethreeormoreoutputneurons.ArtificialIntelligenceforHumans,Volume1:FundamentalAlgorithmsdoesshowamethodthatcanencodethreeoutcomesintotwooutputneurons.
ConsiderFisher’sirisdataset.Thisdatasetcontainsfourdifferentmeasurementsforthreedifferentspeciesofirisflower.ThefollowingURLcontainsthisdataset:
https://archive.ics.uci.edu/ml/datasets/Iris
Sampledatafromtheirisdatasetisshownhere:
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolour
6.4,3.2,4.5,1.5,Iris-versicolour
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
Fourmeasurementscanpredictthespecies.Ifyouareinterestedinreadingmoreabouthowtomeasureanirisflower,refertotheabovelink.Forthisprediction,themeaningofthefourmeasurementsdoesnotreallymatter.Thesemeasurementswillteachtheneuralnetworktopredict.Figure5.8showsaneuralnetworkstructurethatcanpredicttheirisdataset:
Figure5.8:IrisDataSetNeuralNetwork
Theaboveneuralnetworkacceptsthefourmeasurementsandoutputsthreenumbers.Eachoutputcorrespondswithoneoftheirisspecies.Theoutputneuronthatproducesthehighestnumberdeterminesthespeciespredicted.
LogLoss
Classificationnetworkscanderiveaclassfromtheinputdata.Forexample,thefourirismeasurementscangroupthedataintothethreespeciesofiris.Oneeasymethodtoevaluateclassificationistotreatitlikeamultiple-choiceexamandreturnapercentscore.Althoughthistechniqueiscommon,mostmachinelearningmodelsdonotanswermultiple-choicequestionslikeyoudidinschool.Considerhowthefollowingquestionmightappearonanexam:
1.Wouldanirissetosahaveasepallengthof5.1cm,
asepalwidthof3.5cm,apetallengthof1.4cm,and
apetalwidthof0.2cm?
A)True
B)False
Thisquestionisexactlythetypethataneuralnetworkmustfaceinaclassificationtask.However,theneuralnetworkwillnotrespondwithananswerof“True”or“False.”Itwillanswerthequestioninthefollowingmanner:
True:80%
Theaboveresponsemeansthattheneuralnetworkis80%surethattheflowerisasetosa.Thistechniquewouldbeveryhandyinschool.Ifyoucouldnotdecidebetweentrueandfalse,youcouldsimplyplace80%on“True.”Scoringisrelativelyeasybecauseyoureceiveyourpercentagevalueforthecorrectanswer.Inthiscase,if“True”werethecorrectanswer,yourscorewouldbe80%forthatquestion.
However,loglossisnotquitethatsimple.Equation5.2istheequationforlogloss:
Equation5.2:LogLossFunction
Youshouldusethisequationonlyasanobjectivefunctionforclassificationsthathavetwooutcomes.Thevariabley-hatistheneuralnetwork’sprediction,andthevariableyistheknowncorrectanswer.Inthiscase,ywillalwaysbe0or1.Thetrainingdatahavenoprobabilities.Theneuralnetworkclassifiesiteitherintooneclass(1)ortheother(0).
ThevariableNrepresentsthenumberofelementsinthetrainingset—thenumberofquestionsinthetest.WedividebyNbecausethisprocessiscustomaryforanaverage.Wealsobegintheequationwithanegativebecausethelogfunctionisalwaysnegativeoverthedomain0to1.Thisnegationallowsapositivescoreforthetrainingtominimize.
Youwillnoticetwotermsareseparatedbytheaddition(+).Eachcontainsalogfunction.Becauseywillbeeither0or1,thenoneofthesetwotermswillcanceloutto0.Ifyis0,thenthefirsttermwillreduceto0.Ifyis1,thenthesecondtermwillbe0.
Ifyourpredictionforthefirstclassofatwo-classpredictionisy-hat,thenyourpredictionforthesecondclassis1minusy-hat.Essentially,ifyourpredictionforclassAis70%(0.7),thenyourpredictionforclassBis30%(0.3).Yourscorewillincreasebythelogofyourpredictionforthecorrectclass.Iftheneuralnetworkhadpredicted1.0forclassA,andthecorrectanswerwasA,yourscorewouldincreasebylog(1),whichis0.Forlogloss,weseekalowscore,soacorrectanswerresultsin0.Someoftheselogvaluesforaneuralnetwork’sprobabilityestimateforthecorrectclass:
-log(1.0)=0-log(0.95)=0.02-log(0.9)=0.05-log(0.8)=0.1-log(0.5)=0.3-log(0.1)=1-log(0.01)=2-log(1.0e-12)=12-log(0.0)=negativeinfinity
Asyoucansee,givingalowconfidencetothecorrectansweraffectsthescorethemost.Becauselog(0)isnegativeinfinity,wetypicallyimposeaminimumvalue.Ofcourse,theabovelogvaluesareforasingletrainingsetelement.Wewillaveragethelogvaluesfortheentiretrainingset.
Multi-ClassLogLoss
Ifmorethantwooutcomesareclassified,thenwemustusemulti-classlogloss.Thislossfunctionisverycloselyrelatedtothebinaryloglossjustdescribed.Equation5.3showstheequationformulti-classlogloss:
Equation5.3:Multi-ClassLogLoss
Intheaboveequation,Nisthenumberoftrainingsetelements,andMrepresentsthenumberofcategoriesfortheclassificationprocess.Conceptually,themulti-classloglossobjectivefunctionworkssimilarlytosinglelogloss.Theaboveequationessentiallygivesyouascorethatistheaverageofthenegative-logofyourpredictionforthecorrectclassoneachofthedatasets.Theinnermostsigma-summationintheaboveequationfunctionsasanif-thenstatementandallowsonlythecorrectclasswithayof1.0tocontributetothesummation.
EvaluatingRegression
Meansquarederror(MSE)calculationisthemostcommonlyutilizedprocessforevaluatingregressionmachinelearning.MostInternetexamplesofneuralnetworks,supportvectormachines,andothermodelsapplyMSE(Draper,1998),showninEquation5.4:
Equation5.4:MeanSquaredError(MSE)
Intheaboveequation,yistheidealoutputandy-hatistheactualoutput.Themeansquarederrorisessentiallythemeanofthesquaresoftheindividualdifferences.Becausetheindividualdifferencesaresquared,thepositiveornegativenatureofthedifferencedoesnotmattertoMSE.
YoucanevaluateclassificationproblemswithMSE.ToevaluateclassificationoutputwithMSE,eachclass’sprobabilityissimplytreatedasanumericoutput.Theexpectedoutputsimplyhasavalueof1.0forthecorrectclass,and0fortheothers.Forexample,ifthefirstclasswerecorrect,andtheotherthreeclassesincorrect,theexpectedoutcomevectorwouldlooklikethefollowing:
[1.0,0,0,0]
Youcanusenearlyanyregressionobjectivefunctionforclassificationinthisway.Avarietyoffunctions,suchasrootmeansquare(RMS)andsumofsquareserror(SSE)canevaluateregression,andwediscussedthesefunctionsinArtificialIntelligenceforHumans,Volume1:FundamentalAlgorithms.
TrainingwithSimulatedAnnealing
Totrainaneuralnetwork,youmustdefineitstasks.Anobjectivefunction,otherwiseknownasscoringorlossfunctions,cangeneratethesetasks.Essentially,anobjectivefunctionevaluatestheneuralnetworkandreturnsanumberindicatingtheusefulnessoftheneuralnetwork.Thetrainingprocessmodifiestheweightsoftheneuralnetworkineachiterationsothevaluereturnedfromtheobjectivefunctionimproves.
SimulatedannealingisaneffectiveoptimizationtechniquethatweexaminedinArtificialIntelligenceforHumansVolume1.Inthischapter,wewillreviewsimulatedannealingaswellasshowyouhowanyvectoroptimizationfunctioncanimprovetheweightsofafeedforwardneuralnetwork.Inthenextchapter,wewillexamineevenmoreadvancedoptimizationtechniquesthattakeadvantageofthedifferentiablelossfunction.
Asareview,simulatedannealingworksbyfirstassigningtheweightvectorofaneuralnetworktorandomvalues.Thisvectoristreatedlikeaposition,andtheprogramevaluateseverypossiblemovefromthatposition.Tounderstandhowaneuralnetworkweightvectortranslatestoaposition,thinkofaneuralnetworkwithjustthreeweights.Intherealworld,weconsiderpositionintermsofthex,yandzcoordinates.Wecanwriteanypositionasavectorof3.Ifwearewillingtomoveinasingledimension,wecouldmoveinatotalofsixdifferentdirections.Wewouldhavetheoptionofmovingforwardorbackwardsinthex,yorzdimensions.
Simulatedannealingfunctionsbymovingforwardorbackwardsinallavailabledimensions.Ifthealgorithmtakesthebestmove,asimplehill-climbingalgorithmwouldresult.Hillclimbingonlyimprovesscores.Therefore,itiscalledagreedyalgorithm.Toreachthebestposition,analgorithmwillsometimeneedtomovetoalowerposition.Asaresult,simulatedannealingverymuchfollowstheexpressionoftwostepsforward,onestepback.
Inotherwords,simulatedannealingwillsometimesallowamovetoaweightconfigurationwithaworsescore.Theprobabilityofacceptingsuchamovestartshighanddecreases.Thisprobabilityisknownasthecurrenttemperature,anditsimulatestheactualmetallurgicalannealingprocesswhereametalcoolsandachievesgreaterhardness.Figure5.9showstheentireprocess:
Figure5.9:SimulatedAnnealing
Afeedforwardneuralnetworkcanutilizesimulatedannealingtolearntheirisdataset.Thefollowingprogramshowstheoutputfromthistraining:
Iteration#1,Score=0.3937,k=1,kMax=100,t=343.5891,prob=0.9998
Iteration#2,Score=0.3937,k=2,kMax=100,t=295.1336,prob=0.9997
Iteration#3,Score=0.3835,k=3,kMax=100,t=253.5118,prob=0.9989
Iteration#4,Score=0.3835,k=4,kMax=100,t=217.7597,prob=0.9988
Iteration#5,Score=0.3835,k=5,kMax=100,t=187.0496,prob=0.9997
Iteration#6,Score=0.3835,k=6,kMax=100,t=160.6705,prob=0.9997
Iteration#7,Score=0.3835,k=7,kMax=100,t=138.0116,prob=0.9996
...
Iteration#99,Score=0.1031,k=99,kMax=100,t=1.16E-4,prob=2.8776E-7
Iteration#100,Score=0.1031,k=100,kMax=100,t=9.9999E-5,prob=2.1443E-70
Finalscore:0.1031
[0.22222222222222213,0.6249999999999999,0.06779661016949151,
0.04166666666666667]->Iris-setosa,Ideal:Iris-setosa
[0.1666666666666668,0.41666666666666663,0.06779661016949151,
0.04166666666666667]->Iris-setosa,Ideal:Iris-setosa
...
[0.6666666666666666,0.41666666666666663,0.711864406779661,
0.9166666666666666]->Iris-virginica,Ideal:Iris-virginica
[0.5555555555555555,0.20833333333333331,0.6779661016949152,0.75]->
Iris-virginica,Ideal:Iris-virginica
[0.611111111111111,0.41666666666666663,0.711864406779661,
0.7916666666666666]->Iris-virginica,Ideal:Iris-virginica
[0.5277777777777778,0.5833333333333333,0.7457627118644068,
0.9166666666666666]->Iris-virginica,Ideal:Iris-virginica
[0.44444444444444453,0.41666666666666663,0.6949152542372881,
0.7083333333333334]->Iris-virginica,Ideal:Iris-virginica
[1.178018083703488,16.66575553359515,-0.6101619300462806,
-3.9894606091020965,13.989551673146842,-8.87489712462323,
8.027287801488647,-4.615098285283519,6.426489182215509,
-1.4672962642199618,4.136699061975335,4.20036115439746,
0.9052469139543605,-2.8923515248132063,-4.733219252086315,
18.6497884912826,2.5459600552510895,-5.618872440836617,
4.638827606092005,0.8887726364890928,8.730809901357286,
-6.4963370793479545,-6.4003385330186795,-11.820235441582424,
-3.29494170904095,-1.5320936828139837,0.1094081633203249,
0.26353076268018827,3.935780218339343,0.8881280604852664,
-5.048729642423418,8.288232057956957,-14.686080237582006,
3.058305829324875,-2.4144038920292608,21.76633883966702,
12.151853576801647,-3.6372061664901416,6.28253174293219,
-4.209863472970308,0.8614258660906541,-9.382012074551428,
-3.346419915864691,-0.6326977049713416,2.1391118323593203,
0.44832732990560714,6.853600355726914,2.8210824313745957,
1.3901883615737192,-5.962068350552335,0.502596306917136]
Theinitialrandomneuralnetworkstartsoutwithahighmulti-classloglossscoreof30.Asthetrainingprogresses,thisvaluefallsuntilitislowenoughfortrainingtostop.Forthisexample,thetrainingstopsassoonastheerrorfallsbelow10.Todetermineagoodstoppingpointfortheerror,youshouldevaluatehowwellthenetworkisperformingforyourintendeduse.Aloglossbelow0.5isoftenintheacceptablerange;however,youmightnotbeabletoachievethisscorewithalldatasets.
ThefollowingURLshowsanexampleofaneuralnetworktrainedwithsimulatedannealing:
http://www.heatonresearch.com/aifh/vol3/anneal_roc.html
ChapterSummary
Objectivefunctionscanevaluateneuralnetworks.Theysimplyreturnanumberthatindicatesthesuccessoftheneuralnetwork.Regressionneuralnetworkswillfrequentlyutilizemeansquarederror(MSE).Classificationneuralnetworkswilltypicallyusealoglossormulti-classloglossfunction.Theseneuralnetworkscreatecustomobjectivefunctions.
Simulatedannealingcanoptimizetheneuralnetwork.YoucanutilizeanyoftheoptimizationalgorithmspresentedinVolumes1and2ofArtificialIntelligenceforHumans.Infact,youcanoptimizeanyvectorinthiswaybecausetheoptimizationalgorithmsarenottiedtoaneuralnetwork.Inthenextchapter,youwillseeseveraltrainingmethodsdesignedspecificallyforneuralnetworks.Whilethesespecializedtrainingalgorithmsareoftenmoreefficient,theyrequireobjectivefunctionsthathaveaderivative.
Chapter6:BackpropagationTrainingGradientCalculationBackpropagationLearningRate&MomentumStochasticGradientDescent
Backpropagationisoneofthemostcommonmethodsfortraininganeuralnetwork.Rumelhart,Hinton,&Williams(1986)introducedbackpropagation,anditremainspopulartoday.Programmersfrequentlytraindeepneuralnetworkswithbackpropagationbecauseitscalesreallywellwhenrunongraphicalprocessingunits(GPUs).Tounderstandthisalgorithmforneuralnetworks,wemustexaminehowtotrainitaswellashowitprocessesapattern.
Classicbackpropagationhasbeenextendedandmodifiedtogiverisetomanydifferenttrainingalgorithms.Inthischapter,wewilldiscussthemostcommonlyusedtrainingalgorithmsforneuralnetworks.Webeginwithclassicbackpropagationandthenendthechapterwithstochasticgradientdescent(SGD).
UnderstandingGradients
Backpropagationisatypeofgradientdescent,andmanytextswillusethesetwotermsinterchangeably.Gradientdescentreferstothecalculationofagradientoneachweightintheneuralnetworkforeachtrainingelement.Becausetheneuralnetworkwillnotoutputtheexpectedvalueforatrainingelement,thegradientofeachweightwillgiveyouanindicationabouthowtomodifyeachweighttoachievetheexpectedoutput.Iftheneuralnetworkdidoutputexactlywhatwasexpected,thegradientforeachweightwouldbe0,indicatingthatnochangetotheweightisnecessary.
Thegradientisthederivativeoftheerrorfunctionattheweight’scurrentvalue.Theerrorfunctionmeasuresthedistanceoftheneuralnetwork’soutputfromtheexpectedoutput.Infact,wecanusegradientdescent,aprocessinwhicheachweight’sgradientvaluecanreachevenlowervaluesoftheerrorfunction.
Withrespecttotheerrorfunction,thegradientisessentiallythepartialderivativeofeachweightintheneuralnetwork.Eachweighthasagradientthatistheslopeoftheerrorfunction.Aweightisaconnectionbetweentwoneurons.Calculatingthegradientoftheerrorfunctionallowsthetrainingmethodtodeterminewhetheritshouldincreaseordecreasetheweight.Inturn,thisdeterminationwilldecreasetheerroroftheneuralnetwork.Theerroristhedifferencebetweentheexpectedoutputandactualoutputoftheneuralnetwork.Manydifferenttrainingmethodscalledpropagation-trainingalgorithmsutilizegradients.Inallofthem,thesignofthegradienttellstheneuralnetworkthefollowinginformation:
Zerogradient–Theweightisnotcontributingtotheerroroftheneuralnetwork.Negativegradient–Theweightshouldbeincreasedtoachievealowererror.Positivegradient–Theweightshouldbedecreasedtoachievealowererror.
Becausemanyalgorithmsdependongradientcalculation,wewillbeginwithananalysisofthisprocess.
WhatisaGradient
Firstofall,let’sexaminethegradient.Essentially,trainingisasearchforthesetofweightsthatwillcausetheneuralnetworktohavethelowesterrorforatrainingset.Ifwehadaninfiniteamountofcomputationresources,wewouldsimplytryeverypossiblecombinationofweightstodeterminetheonethatprovidedthelowesterrorduringthetraining.
Becausewedonothaveunlimitedcomputingresources,wehavetousesomesortofshortcuttopreventtheneedtoexamineeverypossibleweightcombination.Thesetrainingmethodsutilizeclevertechniquestoavoidperformingabrute-forcesearchofallweightvalues.Thistypeofexhaustivesearchwouldbeimpossiblebecauseevensmallnetworkshaveaninfinitenumberofweightcombinations.
Considerachartthatshowstheerrorofaneuralnetworkforeachpossibleweight.Figure6.1isagraphthatdemonstratestheerrorforasingleweight:
Figure6.1:GradientofaSingleWeight
Lookingatthischart,youcaneasilyseethattheoptimalweightisthelocationwherethelinehasthelowesty-value.Theproblemisthatweseeonlytheerrorforthecurrentvalueoftheweight;wedonotseetheentiregraphbecausethatprocesswouldrequireanexhaustivesearch.However,wecandeterminetheslopeoftheerrorcurveataparticularweight.Intheabovechart,weseetheslopeoftheerrorcurveat1.5.Thestraightlinethatbarelytouchestheerrorcurveat1.5givestheslope.Inthiscase,theslope,orgradient,is-0.5622.Thenegativeslopeindicatesthatanincreaseintheweightwilllowertheerror.
Thegradientistheinstantaneousslopeoftheerrorfunctionatthespecifiedweight.Thederivativeoftheerrorcurveatthatpointgivesthegradient.Thislinetellsusthesteepnessoftheerrorfunctionatthegivenweight.
Derivativesareoneofthemostfundamentalconceptsincalculus.Forthepurposesofthisbook,youjustneedtounderstandthataderivativeprovidestheslopeofafunctionataspecificpoint.Atrainingtechniqueandthisslopecangiveyoutheinformationtoadjusttheweightforalowererror.Usingourworkingdefinitionofthegradient,wewillnowshowhowtocalculateit.
CalculatingGradients
Wewillcalculateanindividualgradientforeachweight.Ourfocusisnotonlytheequationsbutalsotheapplicationsinactualneuralnetworkswithrealnumbers.Figure6.2showstheneuralnetworkthatwewilluse:
Figure6.2:AnXORNetwork
Additionally,weusethissameneuralnetworkinseveralexamplesonthewebsiteforthisbook.Inthischapter,wewillshowseveralcalculationsthatdemonstratethetrainingofaneuralnetwork.Wemustusethesamestartingweightssothatthesecalculationsareconsistent.However,theaboveweightshavenospecialcharacteristic;theprogramgeneratedthemrandomly.
Theaforementionedneuralnetworkisatypicalthree-layerfeedforwardnetworkliketheoneswehavepreviouslystudied.Thecirclesindicateneurons.Thelinesconnectingthecirclesaretheweights.Therectanglesinthemiddleoftheconnectionsgivetheweightforeachconnection.
Theproblemthatwenowfaceiscalculatingthepartialderivativeforeachoftheweightsintheneuralnetwork.Weuseapartialderivativewhenanequationhasmorethanonevariable.Eachoftheweightsisconsideredavariablebecausetheseweightvalueswillchangeindependentlyastheneuralnetworkchanges.Thepartialderivativesofeachweightsimplyshoweachweight’sindependenteffectontheerrorfunction.Thispartial
derivativeisthegradient.
Wecancalculateeachpartialderivativewiththechainruleofcalculus.Wewillbeginwithonetrainingsetelement.ForFigure6.2weprovideaninputof[1,0]andexpectanoutputof[1].Youcanseethatweapplytheinputontheabovefigure.Thefirstinputneuronhasaninputvalueof1.0,andthesecondinputneuronhasaninputvalueof0.0.
Thisinputfeedsthroughthenetworkandeventuallyproducesanoutput.Chapter4,“FeedforwardNeuralNetworks,”coverstheexactprocesstocalculatetheoutputandsums.Backpropagationhasbothaforwardandbackwardspass.Theforwardpassoccurswhenwecalculatetheoutputoftheneuralnetwork.Wewillcalculatethegradientsonlyforthisiteminthetrainingset.Otheritemsinthetrainingsetwillhavedifferentgradients.Wewilldiscusshowtocombinethegradientsfortheindividualtrainingsetelementlaterinthechapter.
Wearenowreadytocalculatethegradients.Thestepsinvolvedincalculatingthegradientsforeachweightaresummarizedhere:
Calculatetheerror,basedontheidealofthetrainingset.Calculatethenode(neuron)deltafortheoutputneurons.Calculatethenodedeltafortheinteriorneurons.Calculateindividualgradients.
Wewilldiscussthesestepsinthesubsequentsections.
CalculatingOutputNodeDeltas
Calculatingaconstantvalueforeverynode,orneuron,intheneuralnetworkisthefirststep.Wewillstartwiththeoutputnodesandworkourwaybackwardsthroughtheneuralnetwork.Thetermbackpropagationcomesfromthisprocess.Weinitiallycalculatetheerrorsfortheoutputneuronsandpropagatetheseerrorsbackwardsthroughtheneuralnetwork.
Thenodedeltaisthevaluethatwewillcalculateforeachnode.Layerdeltaalsodescribesthisvaluebecausewecancalculatethedeltasonelayeratatime.Themethodfordeterminingthenodedeltascandifferifyouarecalculatingforanoutputorinteriornode.Theoutputnodesarecalculatedfirst,andtheytakeintoaccounttheerrorfunctionfortheneuralnetwork.Inthisvolume,wewillexaminethequadraticerrorfunctionandthecrossentropyerrorfunction.
QuadraticErrorfunction
Programmersofneuralnetworksfrequentlyusethequadraticerrorfunction.Infact,youcanfindmanyexamplesofthequadraticerrorfunctionontheInternet.Ifyouarereadinganexampleprogram,anditdoesnotmentionaspecificerrorfunction,theprogramisprobablyusingthequadraticerrorfunction,alsoknownasthemeansquarederror(MSE)functiondiscussedinChapter5,“TrainingandEvaluation.”Equation6.1showstheMSEfunction:
Equation6.1:MeanSquareError(MSE)
Theaboveequationcomparestheneuralnetwork’sactualoutput(y)withtheexpectedoutput(y-hat).Thevariablencontainsthenumberoftrainingelementstimesthenumberofoutputneurons.MSEhandlesmultipleoutputneuronsasindividualcases.Equation6.2showsthenodedeltausedinconjunctionwiththequadraticerrorfunction:
Equation6.2:NodeDeltaofMSEOutputLayer
Thequadraticerrorfunctionisverysimplebecauseittakesthedifferencebetweentheexpectedandactualoutputfortheneuralnetwork.TheGreekletterφ(phi-prime)representsthederivativeoftheactivationfunction.
CrossEntropyErrorFunction
Thequadraticerrorfunctioncansometimestakealongtimetoproperlyadjusttheweight.Equation6.3showsthecrossentropyerrorfunction:
Equation6.3:CrossEntropyError
ThenodedeltacalculationforthecrossentropyerrorturnsouttobemuchlesscomplexthantheMSE,asseeninEquation6.4.
Equation6.4:NodeDeltaofCrossEntropyOutputLayer
Thecrossentropyerrorfunctionwilltypicallybetterresultsthanthequadraticitwillcreateamuchsteepergradientforerrors.Youshouldalwaysusethecrossentropyerrorfunction.
CalculatingRemainingNodeDeltas
Nowthattheoutputnodedeltahasbeencalculatedaccordingtotheappropriateerrorfunction,wecancalculatethenodedeltasfortheinteriornodes,asdemonstratedbyEquation6.5:
Equation6.5:CalculatingInteriorNodeDeltas
Wewillcalculatethenodedeltaforallhiddenandnon-biasneurons,butwedonotneedtocalculatethenodedeltafortheinputandbiasneurons.EventhoughwecaneasilycalculatethenodedeltaforinputandbiasneuronswithEquation6.5,gradientcalculationdoesnotrequirethesevalues.Asyouwillsoonsee,gradientcalculationforaweightonlyconsiderstheneurontowhichtheweightisconnected.Biasandinputneuronsareonlythebeginningpointforaconnection;theyarenevertheendpoint.
Ifyouwouldliketoseethegradientcalculationprocess,severalJavaScriptexampleswillshowtheindividualcalculations.TheseexamplescanbefoundatthefollowingURL:
http://www.heatonresearch.com/aifh/vol3/
DerivativesoftheActivationFunctions
Thebackpropagationprocessrequiresthederivativesoftheactivationfunctions,andtheyoftendeterminehowthebackpropagationprocesswillperform.Mostmoderndeepneuralnetworksusethelinear,softmax,andReLUactivationfunctions.WewillalsoexaminethederivativesofthesigmoidandhyperbolictangentactivationfunctionssothatwecanseewhytheReLUactivationfunctionperformssowell.
DerivativeoftheLinearActivationFunction
Thelinearactivationfunctionisbarelyanactivationfunctionatallbecauseitsimplyreturnswhatevervalueitisgiven.Forthisreason,thelinearactivationfunctionissometimescalledtheidentityactivationfunction.Thederivativeofthisfunctionis1,asdemonstratedbyEquation6.6:
Equation6.6:DerivativeoftheLinearActivationFunction
TheGreekletterφ(phi)representstheactivationfunction,asinpreviouschapters.However,theapostrophejustaboveandtotherightofφ(phi)meansthatweareusingthederivativeoftheactivationfunction.Thisisoneofseveralwaysthataderivativeisexpressedinamathematicalform.
DerivativeoftheSoftmaxActivationFunction
Inthisvolume,thesoftmaxactivationfunction,alongwiththelinearactivationfunction,isusedonlyontheoutputlayeroftheneuralnetworks.AsmentionedinChapter1,“NeuralNetworkBasics,”thesoftmaxactivationfunctionisdifferentfromtheotheractivationfunctionsinthatitsvalueisdependentontheotheroutputneurons,notjustontheoutputneuroncurrentlybeingcalculated.Forconvenience,thesoftmaxactivationfunctionisrepeatedinEquation6.7:
Equation6.7:SoftmaxActivationFunction
Thezvectorrepresentstheoutputfromalloutputneurons.Equation6.8showsthederivativeofthisactivationfunction:
Equation6.8:DerivativeoftheSoftmaxActivationFunction
Weusedslightlydifferentnotationfortheabovederivative.Theratio,withthecursive-stylized“d”symbolmeansapartialderivative,whichoccurswhenyoudifferentiateanequationwithmultiplevariables.Totakeapartialderivative,youdifferentiatetheequationrelativetoonevariable,holdingallothersconstant.Thetop“d”tellsyouwhatfunctionyouaredifferentiating.Inthiscase,itistheactivationfunctionφ(phi).Thebottom“d”denotestherespectivedifferentiationofthepartialderivative.Inthiscase,wearecalculatingtheoutputoftheneuron.Allothervariablesaretreatedasconstant.Aderivativeistheinstantaneousrateofchange—onlyonethingcanchangeatonce.
Youwillnotusethederivativeofthelinearorsoftmaxactivationfunctionstocalculatethegradientsoftheneuralnetworkifyouusethecrossentropyerrorfunction.Youshouldusethelinearandsoftmaxactivationfunctionsonlyattheoutputlayerofaneuralnetwork.Therefore,wedonotneedtoworryabouttheirderivativesfortheinteriornodes.Fortheoutputnodeswithcrossentropy,thederivativeofbothlinearandsoftmaxisalways1.Asaresult,youwillneverusethelinearorsoftmaxderivativesforinteriornodes.
DerivativeoftheSigmoidActivationFunction
Equation6.9showsthederivativeofthesigmoidactivationfunction:
Equation6.9:DerivativeoftheSigmoidActivationFunction
Machinelearningfrequentlyutilizesthesigmoidfunctionrepresentedintheaboveequation.Wederivedtheformulathroughalgebraicmanipulationofthesigmoidderivativeinordertousethesigmoidactivationfunctioninitsownderivative.Forcomputationalefficiency,theGreekletterφ(phi)intheaboveactivationfunctionrepresentsthesigmoidfunction.Duringthefeedforwardpass,wecalculatedthevalueofthesigmoidfunction.Retainingthesigmoidfunctionmakesthesigmoidderivativeasimplecalculation.IfyouareinterestedinhowtoobtainEquation6.9,youcanrefertothefollowingURL:
http://www.heatonresearch.com/aifh/vol3/deriv_sigmoid.html
DerivativeoftheHyperbolicTangentActivationFunction
Equation6.10showsthederivativeofthehyperbolictangentactivationfunction:
Equation6.10:DerivativeoftheHyperbolicTangentActivationFunction
Werecommendthatyoualwaysusethehyperbolictangentactivationfunctioninsteadofthesigmoidactivationfunction.
DerivativeoftheReLUActivationFunction
Equation6.11showsthederivativeoftheReLUfunction:
Equation6.11:DerivativeoftheReLUActivationFunction
Strictlyspeaking,theReLUfunctiondoesnothaveaderivativeat0.However,becauseofconvention,thegradientof0issubstitutedwhenxis0.Deepneuralnetworkswithsigmoidandhyperbolictangentactivationfunctionscanbedifficulttotrainusingbackpropagation.Severalfactorscausethisdifficulty.Thevanishinggradientproblemisonethemostcommoncauses.Figure6.3showsthehyperbolictangentfunction,alongwithitsgradient/derivative:
Figure6.3:TanhActivationFunction&Derivative
Figure6.3showsthatasthehyperbolictangent(blueline)saturatesto-1and1,thederivativeofthehyperbolictangent(redline)vanishesto0.Thesigmoidandhyperbolictangentactivationfunctionsbothhavethisproblem,butReLUdoesn’t.Figure6.4showsthesamegraphforthesigmoidactivationfunctionanditsvanishingderivative:
Figure6.4:SigmoidActivationFunction&Derivative
ApplyingBackpropagation
Backpropagationisasimpletrainingmethodthatadjuststheweightsoftheneuralnetworkwithitscalculatedgradients.Thismethodisaformofgradientdescentsincewearedescendingthegradientstolowervalues.Astheprogramadjuststheseweights,theneuralnetworkshouldproducemoredesirableoutput.Theglobalerroroftheneuralnetworkshouldfallasittrains.Beforewecanexaminethebackpropagationweightupdateprocess,wemustexaminetwodifferentwaystoupdatetheweights.
BatchandOnlineTraining
Wehavealreadyshownhowtocalculatethegradientsforanindividualtrainingsetelement.Earlierinthischapter,wecalculatedthegradientsforacaseinwhichwegavetheneuralnetworkaninputof[1,0]andexpectedanoutputof[1].Thisresultisacceptableforasingletrainingsetelement.However,mosttrainingsetshavemanyelements.Therefore,wecanhandlemultipletrainingsetelementsthroughtwoapproachescalledonlineandbatchtraining.
Onlinetrainingimpliesthatyoumodifytheweightsaftereverytrainingsetelement.Usingthegradientsobtainedinthefirsttrainingsetelement,youcalculateandapplyachangetotheweights.Trainingprogressestothenexttrainingsetelementandalsocalculatesanupdatetotheneuralnetwork.Thistrainingcontinuesuntilyouhaveusedeverytrainingsetelement.Atthispoint,oneiteration,orepoch,oftraininghascompleted.
Batchtrainingalsoutilizesallthetrainingsetelements.However,wehavenotupdatedtheweights.Instead,wesumthegradientsforeachtrainingsetelement.Oncewehavesummedthetrainingsetelements,wecanupdatetheneuralnetworkweights.Atthispoint,theiterationiscomplete.
Sometimes,wecansetabatchsize.Forexample,youmighthaveatrainingsetsizeof10,000elements.Youmightchoosetoupdatetheweightsoftheneuralnetworkevery1,000elements,therebycausingtheneuralnetworkweightstoupdatetentimesduringthetrainingiteration.
Onlinetrainingwastheoriginalmethodforbackpropagation.Ifyouwouldliketoseethecalculationsforthebatchversionofthisprogram,refertothefollowingonlineexample:
http://www.heatonresearch.com/aifh/vol3/xor_batch.html
StochasticGradientDescent
Batchandonlinetrainingarenottheonlychoicesforbackpropagation.Stochasticgradientdescent(SGD)isthemostpopularofthebackpropagationalgorithms.SGDcanworkineitherbatchoronlinemode.Onlinestochasticgradientdescentsimplyselectsatrainingsetelementatrandomandthencalculatesthegradientandperformsaweightupdate.Thisprocesscontinuesuntiltheerrorreachesanacceptablelevel.Choosingrandomtrainingsetelementswillusuallyconvergetoanacceptableweightfasterthanloopingthroughtheentiretrainingsetforeachiteration.
Batchstochasticgradientdescentworksbychoosingabatchsize.Foreachiteration,amini-batchischosenbyrandomlyselectinganumberoftrainingsetelementsuptothechosenbatchsize.Thegradientsfromthemini-batcharesummedjustasregularbackpropagationbatchupdating.Thisupdateisverysimilartoregularbatchupdatingexceptthatthemini-batchesarerandomlychoseneachtimetheyareneeded.TheiterationstypicallyprocessasinglebatchinSGD.Batchesareusuallymuchsmallerthantheentiretrainingsetsize.Acommonchoiceforthebatchsizeis600.
BackpropagationWeightUpdate
Wearenowreadytoupdatetheweights.Aspreviouslymentioned,wewilltreattheweightsandgradientsasasingle-dimensionalarray.Giventhesetwoarrays,wearereadytocalculatetheweightupdateforaniterationofbackpropagationtraining.Equation6.6showstheformulatoupdatetheweightsforbackpropagation:
Equation6.12:BackpropagationWeightUpdate
Theaboveequationcalculatesthechangeinweightforeachelementintheweightarray.Youwillalsonoticethattheaboveequationcallsfortheweightchangefromthepreviousiteration.Youmustkeepthesevaluesinanotherarray.Aspreviouslymentioned,thedirectionoftheweightupdateisinverselyrelatedtothesignofthegradient—apositivegradientshouldcauseaweightdecrease,andviceversa.BecauseofthisinverserelationshipEquation6.12beginswithanegative.
Theaboveequationcalculatestheweightdeltaastheproductofthegradientandthelearningrate(representedbyε,epsilon).Furthermore,weaddtheproductofthepreviousweightchangeandthemomentumvalue(representedbyα,alpha).Thelearningrateand
momentumaretwoparametersthatwemustprovidetothebackpropagationalgorithm.Choosingvaluesforlearningrateandmomentumisveryimportanttotheperformanceofthetraining.Unfortunately,theprocessfordetermininglearningrateandmomentumismostlytrialanderror.
Thelearningratescalesthegradientandcanslowdownorspeeduplearning.Alearningratebelow0willslowdownlearning.Forexample,alearningrateof0.5woulddecreaseeverygradientby50%.Alearningrateabove1.0wouldacceleratetraining.Inreality,thelearningrateisalmostalwaysbelow1.
Choosingalearningratethatistoohighwillcauseyourneuralnetworktofailtoconvergeandhaveahighglobalerrorthatsimplybouncesaroundinsteadofconvergingtoalowvalue.Choosingalearningratethatistoolowwillcausetheneuralnetworktotakeagreatdealoftimetoconverge.
Likethelearningrate,themomentumisalsoascalingfactor.Althoughitisoptional,momentumdeterminesthepercentofthepreviousiteration’sweightchangethatshouldbeappliedtotheiteration.Ifyoudonotwanttousemomentum,justspecifyavalueof0.
Momentumisatechniqueaddedtobackpropagationthathelpsthetrainingescapelocalminima,whicharelowpointsontheerrorgraphthatarenotthetrueglobalminimum.Backpropagationhasatendencytofinditswayintoalocalminimumandnotfinditswaybackoutagain.Thisprocesscausesthetrainingtoconvergetoahigherundesirableerror.Momentumgivestheneuralnetworksomeforceinitscurrentdirectionandmayallowittobreakthroughalocalminimum.
ChoosingLearningRateandMomentum
Momentumandlearningratecontributetothesuccessofthetraining,buttheyarenotactuallypartoftheneuralnetwork.Oncetrainingiscomplete,thetrainedweightsremainandnolongerutilizemomentumorthelearningrate.Theyareessentiallypartofthetemporaryscaffoldingthatcreatesatrainedneuralnetwork.Choosingthecorrectmomentumandlearningratecanimpacttheeffectivenessofyourtraining.
Thelearningrateaffectsthespeedatwhichyourneuralnetworktrains.Decreasingthelearningratemakesthetrainingmoremeticulous.Higherlearningratesmightskippastoptimalweightsettings.Alowertrainingratewillalwaysproducebetterresults.However,loweringthetrainingratecangreatlyincreaseruntime.Loweringthelearningrateasthenetworktrainscanbeaneffectivetechnique.
Youcanusethemomentumtocombatlocalminima.Ifyoufindtheneuralnetworkstagnating,ahighermomentumvaluemightpushthetrainingpastthelocalminimumthatitencountered.Ultimately,choosinggoodvaluesformomentumandlearningrateisaprocessoftrialanderror.Youcanvarybothastrainingprogresses.Momentumisoftensetto0.9andthelearningrateto0.1orlower.
NesterovMomentum
Thestochasticgradientdescent(SGD)algorithmcansometimesproduceerraticresultsbecauseoftherandomnessintroducedbythemini-batches.Theweightsmightgetaverybeneficialupdateinoneiteration,butapoorchoiceoftrainingelementscanundoitinthenextmini-batch.Therefore,momentumisaresourcefultoolthatcanmitigatethissortoferratictrainingresult.
NesterovmomentumisarelativelynewapplicationofatechniqueinventedbyYuNesterovin1983andupdatedinhisbook,IntroductoryLecturesonConvexOptimization:ABasicCourse(Nesterov,2003).ThistechniqueisoccasionallyreferredtoasNesterov’sacceleratedgradientdescent.AlthoughafullmathematicalexplanationofNesterovmomentumisbeyondthescopeofthisbook,wewillpresentitfortheweightsinsufficientdetailsoyoucanimplementit.Thisbook’sexamples,includingthosefortheonlineJavaScript,containanimplementationofNesterovmomentum.Additionally,thebook’swebsitecontainsJavascriptthatoutputexamplecalculationsfortheweightupdatesofNesterovmomentum.
Equation6.13calculatesapartialweightupdatebasedonboththelearningrate(ε,epsilon)andmomentum(α,alpha):
Equation6.13:NesterovMomentum
Thecurrentiterationissignifiedbyt,andthepreviousiterationbyt-1.Thispartialweightupdateiscallednandinitiallystartsoutat0.Subsequentcalculationsofthepartialweightupdatearebasedonthepreviousvalueofthepartialweightupdate.Thepartialderivativeintheaboveequationisthegradientoftheerrorfunctionatthecurrentweight.Equation6.14showstheNesterovmomentumupdatethatreplacesthestandardbackpropagationweightupdateshownearlierinEquation6.12:
Equation6.14:NesterovUpdate
Theaboveweightchangeiscalculatedasanamplificationofthepartialweightchange.Thedeltaweightshownintheaboveequationisaddedtothecurrentweight.Stochasticgradientdescent(SGD)withNesterovmomentumisoneofthemosteffectivetrainingalgorithmsfordeeplearning.
ChapterSummary
Thischapterintroducedclassicbackpropagationaswellasstochasticgradientdescent(SGD).Thesemethodsareallbasedongradientdescent.Inotherwords,theyoptimizedindividualweightswithderivatives.Foragivenweightvalue,thederivativegavetheprogramtheslopeoftheerrorfunction.Theslopeallowedtheprogramtodeterminehowtochangetheweightvalue.Eachtrainingalgorithminterpretsthisslope,orgradient,differently.
Despitethefactthatbackpropagationisoneoftheoldesttrainingalgorithms,itremainsoneofthemostpopularones.Backpropagationsimplyaddsthegradienttotheweight.Anegativegradientwillincreasetheweight,andapositivegradientwilldecreasetheweight.Wescaletheweightbythelearningrateinordertopreventtheweightsfromchangingtoorapidly.Alearningrateof0.5wouldmeantoaddhalfofthegradienttotheweight,whereasalearningrateof2.0wouldmeantoaddtwicethegradient.
Thereareanumberofvariantstothebackpropagationalgorithm.Someofthese,suchasresilientpropagation,aresomewhatpopular.Thenextchapterwillintroducesomebackpropagationvariants.Thoughthesevariantsareusefultoknow,stochasticgradientdescent(SGD)remainsthemostcommondeeplearningtrainingalgorithm.
Chapter7:OtherPropagationTrainingResilientPropagationLevenberg-MarquardtHessianandJacobeanMatrices
Thebackpropagationalgorithmhasinfluencedmanytrainingalgorithms,suchasthestochasticgradientdescent(SGD),introducedinthepreviouschapter.Formostpurposes,theSGDalgorithm,alongwithNesterovmomentum,isagoodchoiceforatrainingalgorithm.However,otheroptionsexist.Inthischapter,weexaminetwopopularalgorithmsinspiredbyelementsfrombackpropagation.
Tomakeuseofthesetwoalgorithms,youdonotneedtounderstandeverydetailoftheirimplementation.Essentially,bothalgorithmsaccomplishthesameobjectiveasbackpropagation.Thus,youcansubstitutethemforbackpropagationorstochasticgradientdescent(SGD)inmostneuralnetworkframeworks.IfyoufindSGDisnotconverging,youcanswitchtoresilientpropagation(RPROP)orLevenberg-Marquardtalgorithminordertoexperiment.However,youcanskipthischapterifyouarenotinterestedintheactualimplementationdetailsofeitheralgorithm.
ResilientPropagation
RPROPfunctionsverymuchlikebackpropagation.BothbackpropagationandRPROPmustfirstcalculatethegradientsfortheweightsoftheneuralnetwork.However,backpropagationandRPROPdifferinthewaytheyusethegradients.Reidmiller&Braun(1993)introducedtheRPROPalgorithm.
OneimportantfeatureoftheRPROPalgorithmisthatithasnonecessarytrainingparameters.Whenyouutilizebackpropagation,youmustspecifythelearningrateandmomentum.Thesetwoparameterscangreatlyimpacttheeffectivenessofyourtraining.AlthoughRPROPdoesincludeafewtrainingparameters,youcanalmostalwaysleavethemattheirdefault.
TheRPROPprotocolhasseveralvariants.Someofthevariantsarelistedbelow:
RPROP+RPROP-iRPROP+iRPROP-
WewillfocusonclassicRPROP,asdescribedbyReidmiller&Braun(1994).TheotherfourvariantsdescribedabovearerelativelyminoradaptationsofclassicRPROP.Inthenextsections,wewilldescribehowtoimplementtheclassicRPROPalgorithm.
RPROPArguments
Aspreviouslymentioned,oneadvantageRPROPhasoverbackpropagationisthatyoudon’tneedtoprovideanytrainingargumentsinordertouseRPROP.However,thisdoesn’tmeanthatRPROPlacksconfigurationsettings.ItsimplymeansthatyouusuallydonotneedtochangetheconfigurationsettingsforRPROPfromtheirdefaults.However,ifyoureallywanttochangethem,youcanchooseamongthefollowingconfigurationsettings:
InitialUpdateValuesMaximumStep
Asyouwillseeinthenextsection,RPROPkeepsanarrayofupdatevaluesfortheweights,whichdetermineshowmuchyouwillaltereachweight.Thischangeissimilartothelearningrateinbackpropagation,butitismuchbetterbecausethealgorithmadjuststheupdatevalueofeveryweightintheneuralnetworkastrainingprogresses.Althoughsomebackpropagationalgorithmswillvarythelearningrateandmomentumaslearningprogresses,mostwilluseasinglelearningratefortheentireneuralnetwork.Therefore,theRPROPapproachhasanadvantageoverbackpropagationalgorithms.
Westarttheseupdatevaluesatthedefaultof0.1,accordingtotheinitialupdatevaluesargument.Asageneralrule,weshouldneverchangethisdefault.However,wecanmakeanexceptiontothisruleifwehavealreadytrainedtheneuralnetwork.Inthecaseofapreviouslytrainedneuralnetwork,someoftheinitialupdatevaluesaregoingtobetoostrong,andtheneuralnetworkwillregressformanyiterationsbeforeitcanimprove.Asaresult,atrainedneuralnetworkmaybenefitfromamuchsmallerinitialupdate.
Anotherapproachforanalreadytrainedneuralnetworkistosavetheupdatevaluesoncetrainingstopsandusethemforthenewtraining.Thismethodwillallowyoutoresumetrainingwithouttheinitialspikeinerrorsthatyouwouldnormallyseewhenresumingresilientpropagationtraining.Thisapproachwillonlyworkifyouarecontinuingresilientpropagationonanalreadytrainednetwork.Ifyouwerepreviouslytrainingtheneuralnetworkwithadifferenttrainingalgorithm,thenyouwillbeabletorestorefromanarrayofupdatevalues.
Astrainingprogresses,youwillusethegradientstoadjusttheupdatesupanddown.Themaximumstepargumentdefinesthemaximumupwardstepsizethatthegradientcantakeovertheupdatevalues.Thedefaultvalueforthemaximumstepargumentis50.Itisunlikelythatyouwillneedtochangethevalueofthisargument.
Inadditiontothesearguments,RPROPkeepsconstantsduringprocessing.Thesearevaluesthatyoucanneverchange.Theconstantsarelistedasfollows:
DeltaMinimum(1e-6)Negativeη(Eta)(0.5)
Positive-η(Eta)(1.2)ZeroTolerance(1e-16)
Deltaminimumspecifiestheminimumvaluethatanyoftheupdatevaluescanreach.Ifanupdatevaluewereat0,itwouldneverbeabletoincreasebeyond0.Wewilldescribenegativeandpositiveη(eta)inthenextsections.
Thezerotolerancedefineshowcloselyanumbershouldreach0beforethatnumberisequalto0.Incomputerprogramming,itistypicallybadpracticetocompareafloating-pointnumberto0becausethenumberwouldhavetoequal0exactly.Rather,youtypicallyseeiftheabsolutevalueofanumberisbelowanarbitrarilysmallnumber.Asufficientlysmallnumberisconsidered0.
DataStructures
YoumustkeepseveraldatastructuresinmemorywhileyouperformRPROPtraining.Thesestructuresareallarraysoffloating-pointnumbers.Theyaresummarizedhere:
CurrentUpdateValuesLastWeightChangeValuesCurrentWeightChangeValuesCurrentGradientValuesPreviousGradientValues
Youkeepthecurrentupdatevaluesforthetraining.Ifyouwanttoresumetrainingatsomepoint,youmuststorethisupdatevaluearray.Eachweighthasoneupdatevaluethatcannotgobelowtheminimumdeltaconstant.Likewise,theseupdatevaluescannotexceedthemaximumstepargument.
RPROPmustkeepseveralvaluesbetweeniterations.Youmustalsotrackthelastweightdeltavalue.Backpropagationkeepsthepreviousweightdeltaformomentum.RPROPusesthisdeltavalueinadifferentwaythatwewillexamineinthenextsection.Youalsoneedthecurrentandpreviousgradients.RPROPneedstoknowwhenthesignchangesfromthecurrentgradienttothepreviousgradient.Thischangeindicatesthatyoumustactontheupdatevalues.Wewilldiscusstheseactionsinthenextsection.
UnderstandingRPROP
Intheprevioussections,weexaminedthearguments,constants,anddatastructuresnecessaryforRPROP.Inthissection,wewillshowyouaniterationofRPROP.Whenwediscussedbackpropagationinearliersections,wementionedtheonlineandbatchweightupdatemethods.However,RPROPdoesnotsupportonlinetrainingsoallweightupdatesforRPROPwillbeperformedinbatchmode.Asaresult,eachiterationofRPROPwillreceivegradientsthatarethesumoftheindividualgradientsofeachtrainingset.Thisaspectisconsistentwithbackpropagationinbatchmode.
DetermineSignChangeofGradient
Atthispoint,wehavethegradientsthatarethesameasthegradientscalculatedbythebackpropagationalgorithm.BecauseweusethesameprocesstoobtaingradientsinbothRPROPandbackpropagation,wewillnotrepeatithere.Forthefirststep,wecomparethegradientofthecurrentiterationtothegradientofthepreviousiteration.Ifthereisnopreviousiteration,thenwecanassumethatthepreviousgradientwas0.
Todeterminewhetherthegradientsignhaschanged,wewillusethesign(sgn)function.Equation7.1definesthesgnfunction:
Equation7.1:TheSignFunction(sgn)
Thesgnfunctionreturnsthesignofthenumberprovided.Ifxislessthan0,theresultis-1.Ifxisgreaterthan0,thentheresultis1.Ifxisequalto0,thentheresultis0.Weusuallyimplementthesgnfunctiontouseatolerancefor0,sinceitisnearlyimpossibleforfloating-pointoperationstohit0preciselyonacomputer.
Todeterminewhetherthegradienthaschangedsign,weuseEquation7.2:
Equation7.2:DetermineGradientSignChange
Equation7.2willresultinaconstantc.Weevaluatethisvalueasnegativeorpositiveorcloseto0.Anegativevalueforcindicatesthatthesignhaschanged.Apositivevalueindicatesthatthereisnochangeinsignforthegradient.Avaluenear0indicatesaverysmallchangeinsignoralmostnochangeinsign.
Considerthefollowingsituationsforthesethreeoutcomes:
-1*1=-1(negative,changedfromnegativetopositive)
1*1=1(positive,nochangeinsign)
1.0*0.000001=0.000001(nearzero,almostchangedsigns,butnotquite)
Nowthatwehavecalculatedtheconstantc,whichgivessomeindicationofsignchange,wecancalculatetheweightchange.Thenextsectionincludesadiscussionofthiscalculation.
CalculateWeightChange
Nowthatwehavethechangeinsignofthegradient,wecanobservewhathappensineachofthethreecasesmentionedintheprevioussection.Equation7.3summarizesthesethreecases:
Equation7.3:CalculateRPROPWeightChange
Thisequationcalculatestheactualweightchangeforeachiteration.Ifthevalueofcispositive,thentheweightchangewillbeequaltothenegativeoftheweightupdatevalue.Similarly,ifthevalueofcisnegative,theweightchangewillbeequaltothepositiveof
theweightupdatevalue.Finally,ifthevalueofcisnear0,therewillbenoweightchange.
ModifyUpdateValues
Weusetheweightupdatevaluesfromtheprevioussectiontoupdatetheweightsoftheneuralnetwork.Everyweightintheneuralnetworkhasaseparateweightupdatevaluethatworksmuchbetterthanthesinglelearningrateofbackpropagation.Wemodifytheseweightupdatevaluesduringeachtrainingiteration,asseeninEquation7.4:
Equation7.4:ModifyUpdateValues
Wecanmodifytheweightupdatevaluesinawaythatisverysimilartothechangesoftheweights.Webasetheseweightupdatevaluesonthepreviouslycalculatedvaluec,justliketheweights.
Ifthevalueofcispositive,thenwemultiplytheweightupdatevaluebythevalueofpositive+η(eta).Similarly,ifthevalueofcisnegative,wemultiplytheweightupdatevaluebynegative-η(eta).Finally,ifthevalueofcisnear0,thenwedon’tchangetheweightupdatevalue.
TheJavaScriptexamplesiteforthisbookhasexamplesoftheRPROPupdateaswellasexamplesofthepreviousequationsandsamplecalculations.
Levenberg-MarquardtAlgorithm
TheLevenberg–Marquardtalgorithm(LMA)isaveryefficienttrainingmethodforneuralnetworks.Inmanycases,LMAwilloutperformRPROP.Asaresult,everyneuralnetworkprogrammershouldconsiderthistrainingalgorithm.Levenberg(1940)introducedthefoundationfortheLMA,andMarquardt(1963)expandeditsmethods.
LMAisahybridalgorithmthatisbasedonNewton’smethod(GNA)andongradientdescent(backpropagation).Thus,LMAcombinesthestrengthsofGNAand
backpropagation.Althoughgradientdescentisguaranteedtoconvergetoalocalminimum,itisslow.Newton’smethodisfast,butitoftenfailstoconverge.Byusingadampingfactortointerpolatebetweenthetwo,wecreateahybridmethod.Tounderstandhowthishybridworks,wewillfirstexamineNewton’smethod.Equation7.5showsNewton’smethod:
Equation7.5:Newton’sMethod(GNA)
Youwillnoticeseveralvariablesintheaboveequation.Theresultoftheequationisthatyoucanapplydeltastotheweightsoftheneuralnetwork.ThevariableHrepresentstheHessian,whichwewilldiscussinthenextsection.Thevariablegrepresentsthegradientsoftheneuralnetwork.Youwillalsonoticethe-1“exponent”onthevariableH,whichspecifiesthatwearedoingamatrixdecompositionofthevariablesHandg.
Wecouldeasilyspendanentirechapteronmatrixdecomposition.However,wewillsimplytreatmatrixdecompositionasablackboxatomicoperatorforthepurposesofthisbook.Becausewewillnotexplainhowtocalculatematrixdecomposition,wehaveincludedacommonpieceofcodetakenfromtheJAMApackage.Manymathematicalcomputerapplicationshaveusedthispublicdomaincode,adaptedfromaFORTRANprogram.Toperformmatrixdecomposition,youcanuseJAMAoranothersource.
Althoughseveraltypesofmatrixdecompositionexist,wearegoingtousetheLUdecomposition,whichrequiresasquarematrix.ThisdecompositionworkswellbecausetheHessianmatrixhasthesamenumberofrowsascolumns.Everyweightintheneuralnetworkhasarowandcolumn.TheLUdecompositiontakestheHessian,whichisamatrixofthesecondderivativeofthepartialderivativesoftheoutputofeachoftheweights.TheLUdecompositionsolvestheHessianbythegradients,whicharethesquareoftheerrorofeachweight.ThesegradientsarethesameasthosethatwecalculatedinChapter6,“BackpropagationTraining,”excepttheyaresquared.Becausetheerrorsaresquared,wemustusethesumofsquareerrorwhendealingwithLMA.
Secondderivativeisanimportanttermtoknow.Itisthederivativeofthefirstderivative.RecallfromChapter6,“BackpropagationTraining,”thatthederivativeofafunctionistheslopeatanypoint.Thisslopeshowsthedirectionthatthecurveisapproachingforalocalminimum.Thesecondderivativeisalsoaslope,anditpointsinadirectiontominimizethefirstderivative.ThegoalofNewton’smethod,aswellasoftheLMA,istoreduceallofthegradientsto0.
It’sinterestingtonotethatthegoaldoesnotincludetheerror.Newton’smethodandLMAcanbeoblivioustotheerrorbecausetheytrytoreduceallthegradientsto0.Inreality,theyarenotcompletelyoblivioustotheerrorbecausetheyuseittocalculatethegradients.
Newton’smethodwillconvergetheweightsofaneuralnetworktoalocalminimum,alocalmaximum,orastraddleposition.Weachievethisconvergencebyminimizingallthegradients(firstderivatives)to0.Thederivativeswillbe0atlocalminima,maxima,or
straddleposition.Figure7.1showsthesethreepoints:
Figure7.1:LocalMinimum,StraddleandLocalMaximum
Thealgorithmimplementationmustensurethatlocalmaximaandstraddlepointsarefilteredout.TheabovealgorithmworksbytakingthematrixdecompositionoftheHessianmatrixandthegradients.TheHessianmatrixistypicallyestimated.SeveralmethodsexisttoestimatetheHessianmatrix.However,ifitisinaccurate,itcanharmNewton’smethod.
LMAenhancesNewton’salgorithmtothefollowingformulainEquation7.6:
Equation7.6:Levenberg–MarquardtAlgorithm
Inthisequation,weaddadampingfactormultipliedbyanidentitymatrix.Thedampingfactorisrepresentedbyλ(lambda),andIrepresentstheidentitymatrix,whichisasquarematrixwithallthevaluesat0exceptforanorthwest(NW)lineofvaluesat1.Aslambdaincreases,theHessianwillbefactoredoutoftheaboveequation.Aslambdadecreases,theHessianbecomesmoresignificantthangradientdescent,allowingthetrainingalgorithmtointerpolatebetweengradientdescentandNewton’smethod.Higherlambdafavorsgradientdescent;lowerlambdafavorsNewton.AtrainingiterationofLMAbeginswithalowlambdaandincreasesituntiladesirableoutcomeisproduced.
CalculationoftheHessian
TheHessianmatrixisasquarematrixwithrowsandcolumnsequaltothenumberofweightsintheneuralnetwork.Eachcellinthismatrixrepresentsthesecondorderderivativeoftheoutputoftheneuralnetworkwithrespecttoagivenweightcombination.Equation7.7showstheHessian:
Equation7.7:TheHessianMatrix
ItisimportanttonotethattheHessianissymmetricalaboutthediagonal,whichyoucanusetoenhanceperformanceofthecalculation.Equation7.8calculatestheHessianbycalculatingthegradients:
Equation7.8:CalculatingtheGradients
ThesecondderivativeoftheaboveequationbecomesanelementoftheHessianmatrix.YoucanuseEquation7.9tocalculateit:
Equation7.9:CalculatingtheExactHessian
Ifnotforthesecondcomponent,youcouldeasilycalculatetheaboveformula.However,thissecondcomponentinvolvesthesecondpartialderivativeandthatisdifficulttocalculate.Becausethecomponentisnotimportant,youcanactuallydropitbecauseitsvaluedoesnotsignificantlycontributetotheoutcome.Whilethesecondpartialderivativemightbeimportantforanindividualtrainingcase,itsoverallcontributionisnotsignificant.ThesecondcomponentofEquation7.9ismultipliedbytheerrorofthattrainingcase.Weassumethattheerrorsinatrainingsetareindependentandevenlydistributedabout0.Onanentiretrainingset,theyshouldessentiallycanceleachotherout.Becausewearenotusingallcomponentsofthesecondderivative,wehaveonlyanapproximationoftheHessian,whichissufficienttogetagoodtrainingresult.
Equation7.10usestheapproximation,resultinginthefollowing:
Equation7.10:ApproximatingtheExactHessian
WhiletheaboveequationisonlyanapproximationofthetrueHessian,thesimplificationofthealgorithmtocalculatethesecondderivativeiswellworththelossinaccuracy.Infact,anincreaseinλ(lambda)willaccountforthelossofaccuracy.
TocalculatetheHessianandgradients,wemustdeterminethepartialfirstderivativesoftheoutputoftheneuralnetwork.Oncewehavethesepartialfirstderivatives,theaboveequationsallowustoeasilycalculatetheHessianandgradients.
Calculationofthefirstderivativesoftheoutputoftheneuralnetworkisverysimilarto
theprocessthatweusedtocalculatethegradientsforbackpropagation.Themaindifferenceisthatwetakethederivativeoftheoutput.Instandardbackpropagation,wetakethederivativeoftheerrorfunction.Wewillnotreviewtheentirebackpropagationprocesshere.Chapter6,“BackpropagationTraining,”coversbackpropagationandgradientcalculation.
LMAwithMultipleOutputs
SomeimplementationsofLMAsupportonlyasingle-outputneuronbecauseLMAhasrootsinmathematicalfunctionapproximation.Inmathematics,functionstypicallyreturnonlyasinglevalue.Asaresult,manybooksandpapersdonotcontaindiscussionsofmultiple-outputLMA.However,youcanuseLMAwithmultipleoutputs.
Supportformultiple-outputneuronsinvolvessummingeachcelloftheHessianasyoucalculatetheadditionaloutputneurons.TheprocessworksasifyoucalculatedaseparateHessianmatrixforeachoutputneuronandthensummedtheHessianmatricestogether.Encog(Heaton,2015)usesthisapproach,anditleadstofastconvergencetimes.
Youneedtorealizethatyouwillnotuseeveryconnectionwithmultipleoutputs.Youwillneedtocalculateindependentlyanupdatefortheweightofeachoutputneuron.Dependingontheoutputneuronyouarecurrentlycalculating,therewillbeunusedconnectionsfortheotheroutputneurons.Therefore,youmustsetthepartialderivativeforeachoftheseunusedconnectionsto0whenyouarecalculatingtheotheroutputneurons.
Forexample,consideraneuralnetworkthathastwooutputneuronsandthreehiddenneurons.Eachofthesetwooutputneuronswouldhaveatotaloffourconnectionsfromthehiddenlayer.Threeconnectionsresultfromthethreehiddenneurons,andafourthcomesfromthebiasneuron.ThissegmentoftheneuralnetworkwouldresembleFigure7.2:
Figure7.2:CalculatingOutputNeuron1
Herewearecalculatingoutputneuron1.Noticethatoutputneuron2hasfourconnectionsthatmusthavetheirpartialderivativestreatedas0.Becauseweare
calculatingoutput1asthecurrentneuron,itonlyusesitsnormalpartialderivatives.Youcanrepeatthisprocessforeachoutputneuron.
OverviewoftheLMAProcess
Sofar,wehaveexaminedonlythemathbehindLMA.Tobeeffective,LMAmustbepartofanalgorithm.ThefollowingstepssummarizetheLMAprocess:
1.Calculatethefirstderivativeofoutputoftheneuralnetworkwith
respecttoeveryweight.
2.CalculatetheHessian.
3.Calculatethegradientsoftheerror(ESS)withrespecttoeveryweight.
4.Eithersetlambdatoalowvalue(firstiteration)orthelambdaofthe
previousiteration.
5.Savetheweightsoftheneuralnetwork.
6.Calculatedeltaweightbasedonthelambda,gradients,andHessian.
7.Applythedeltastotheweightsandevaluateerror.
8.Iferrorhasimproved,endtheiteration.
9.Iferrorhasnotimproved,increaselambda(uptoamaxlambda),restore
theweights,andgobacktostep6.
Asyoucansee,theprocessforLMArevolvesaroundsettingthelambdavaluelowandthenslowlyincreasingitiftheerrorratedoesnotimprove.Youmustsavetheweightsateachchangeinlambdasothatyoucanrestorethemiftheerrordoesnotimprove.
ChapterSummary
Resilientpropagation(RPROP)solvestwolimitationsofsimplebackpropagation.First,theprogramassignseachweightaseparatelearningrate,allowingtheweightstolearnatdifferentspeeds.Secondly,RPROPrecognizesthatwhilethegradient’ssignisagreatindicatorofthedirectiontomovetheweight,thesizeofthegradientdoesnotindicatehowfartomove.Additionally,whiletheprogrammermustdetermineanappropriatelearningrateandmomentumforbackpropagation,RPROPautomaticallysetssimilararguments.
Geneticalgorithms(GAs)areanothermeansoftrainingneuralnetworks.ThereisanentirefamilyofneuralnetworksthatuseGAstoevolveeveryaspectoftheneuralnetwork,fromweightstotheoverallstructure.ThisfamilyincludestheNEAT,CPPNandHyperNEATneuralnetworksthatwewilldiscussinthenextchapter.TheGAusedbyNEAT,CPPNandHyperNEATisnotjustanothertrainingalgorithmbecausetheseneuralnetworksintroduceanewarchitecturebasedonthefeedforwardneuralnetworksexaminedsofarinthisbook.
Chapter8:NEAT,CPPN&HyperNEATNEATGeneticAlgorithmsCPPNHyperNEAT
Inthischapter,wediscussthreecloselyrelatedneuralnetworktechnologies:NEAT,CPPNandHyperNEAT.KennethStanley’sEPLEXgroupattheUniversityofCentralFloridaconductsextensiveresearchforallthreetechnologies.InformationabouttheircurrentresearchcanbefoundatthefollowingURL:
http://eplex.cs.ucf.edu/
NeuroEvolutionofAugmentingTopologies(NEAT)isanalgorithmthatevolvesneuralnetworkstructureswithgeneticalgorithms.Thecompositionalpattern-producingnetwork(CPPN)isatypeofevolvedneuralnetworkthatcancreateotherstructures,suchasimagesorotherneuralnetworks.Hypercube-basedNEAT,orHyperNEAT,atypeofCPPN,alsoevolvesotherneuralnetworks.OnceHyperNEATtrainthenetworks,theycaneasilyhandlemuchhigherresolutionsoftheirdimensions.
ManydifferentframeworkssupportNEATandHyperNEAT.ForJavaandC#,werecommendourownEncogimplementation,whichcanbefoundatthefollowingURL:
http://www.encog.org
YoucanfindacompletelistofNEATimplementationsatKennethStanley’swebsite:
http://www.cs.ucf.edu/~kstanley/neat.html
KennethStanley’swebsitealsoincludesacompletelistofHyperNEATimplementations:
http://eplex.cs.ucf.edu/hyperNEATpage/
Fortheremainderofthischapter,wewillexploreeachofthesethreenetworktypes.
NEATNetworks
NEATisaneuralnetworkstructuredevelopedbyStanleyandMiikkulainen(2002).NEAToptimizesboththestructureandweightsofaneuralnetworkwithageneticalgorithm(GA).TheinputandoutputofaNEATneuralnetworkareidenticaltoatypicalfeedforwardneuralnetwork,asseeninpreviouschaptersofthisbook.
ANEATnetworkstartsoutwithonlybiasneurons,inputneurons,andoutputneurons.Generally,noneoftheneuronshaveconnectionsattheoutset.Ofcourse,acompletelyunconnectednetworkisuseless.NEATmakesnoassumptionsaboutwhethercertaininput
neuronsareactuallyneeded.Anunneededinputissaidtobestatisticallyindependentoftheoutput.NEATwilloftendiscoverthisindependencebyneverevolvingoptimalgenomestoconnecttothatstatisticallyindependentinputneuron.
AnotherimportantdifferencebetweenaNEATnetworkandanordinaryfeedforwardneuralnetworkisthatotherthantheinputandoutputlayers,NEATnetworksdonothaveclearlydefinedhiddenlayers.However,thehiddenneuronsdonotorganizethemselvesintoclearlydelineatedlayers.OnesimilaritybetweenNEATandfeedforwardnetworksisthattheybothuseasigmoidactivationfunction.Figure8.1showsanevolvedNEATnetwork:
Figure8.1:NEATNetwork
Input2intheaboveimageneverformedanyconnectionsbecausetheevolutionaryprocessdeterminedthatinput2wasunnecessary.Arecurrentconnectionalsoexistsbetweenhidden3andhidden2.Hidden4hasarecurrentconnectiontoitself.Overall,youwillnotethataNEATnetworklacksacleardelineationoflayers.
YoucancalculateaNEATnetworkinexactlythesamewayasyoudoforaregularweightedfeedforwardnetwork.YoucanmanagetherecurrentconnectionsbyrunningtheNEATnetworkmultipletimes.Thisworksbyhavingtherecurrentconnectioninputstartat0andupdatethemeachtypeyoucyclethroughtheNEATnetwork.Additionally,youmustdefineahyper-parametertospecifythenumberoftimestocalculatetheNEATnetwork.Figure8.2showsrecurrentlinkcalculationwhenaNEATnetworkisinstructedtocyclethreetimestocalculaterecurrentconnections:
Figure8.2:CyclingtoCalculateRecurrences
Theabovediagramshowstheoutputsfromeachneuron,overeachconnection,forthreecycles.Thedashedlinesindicatetheadditionalconnections.Forsimplicity,thediagramdoesn’thavetheweights.ThepurposeofFigure8.2istoshowthattherecurrentoutputstaysonecyclebehind.
Forthefirstcycle,therecurrentconnectionprovideda0tothefirstneuronbecauseneuronsarecalculatedlefttoright.Thefirstcyclehasnovaluefortherecurrentconnection.Forthesecondcycle,therecurrentconnectionnowhastheoutput0.3,whichthefirstcycleprovided.Cycle3followsthesamepattern,takingthe0.5outputfromcycle2astherecurrentconnection’soutput.Sincetherewouldbeotherneuronsinthecalculation,wehavecontrivedthesevalues,whichthedashedarrowsshowatthebottom.However,Figure8.2doesillustratethattherecurrentconnectionsarecycledthroughpreviouscycles.
NEATnetworksextensivelyusegeneticalgorithms,whichweexaminedinArtificialIntelligenceforHumans,Volume2:Nature-InspiredAlgorithms.Althoughyoudonotneedtounderstandcompletelygeneticalgorithmstofollowthediscussionoftheminthischapter,youcanrefertoVolume2,asneeded.
NEATusesatypicalgeneticalgorithmthatincludes:
Mutation–Theprogramchoosesonefitindividualtocreateanewindividualthathasarandomchangefromitsparent.Crossover–Theprogramchoosestwofitindividualstocreateanewindividualthathasarandomsamplingofelementsfrombothparents.
Allgeneticalgorithmsengagethemutationandcrossovergeneticoperatorswithapopulationofindividualsolutions.Mutationandcrossoverchoosewithgreaterprobabilitythesolutionsthatreceivehigherscoresfromanobjectivefunction.WeexploremutationandcrossoverforNEATnetworksinthenexttwosections.
NEATMutation
NEATmutationconsistsofseveralmutationoperationsthatcanbeperformedontheparentgenome.Wediscusstheseoperationshere:
Addaneuron:Byselectingarandomlink,wecanaddaneuron.Anewneuronandtwolinksreplacethisrandomlink.Thenewneuroneffectivelysplitsthelink.Theprogramselectstheweightsofeachofthetwonewlinkstoprovidenearlythesameeffectiveoutputasthelinkbeingreplaced.Addalink:Theprogramchoosesasourceanddestination,ortworandomneurons.Thenewlinkwillbebetweenthesetwoneurons.Biasneuronscanneverbeadestination.Outputneuronscannotbeasource.Therewillneverbemorethantwolinksinthesamedirectionbetweenthesametwoneurons.Removealink:Linkscanberandomlyselectedforremoval.Iftherearenoremaininglinksinteractingwiththem,youcanremovethehiddenneurons,whichareneuronsthatarenotinput,output,orthesinglebiasneuron.Perturbaweight:Youcanchoosearandomlink.Thenmultiplyitsweightbyanumberfromanormalrandomdistributionwithagammaof1orlower.Smallerrandomnumberswillusuallycauseaquickerconvergence.Agammavalueof1orlowerwillspecifythatasinglestandarddeviationwillsamplearandomnumberof1orlower.
Youcanincreasetheprobabilityofthemutationsothattheweightperturbationoccursmorefrequently,therebyallowingfitgenomestovarytheirweightsandfurtheradaptthroughtheirchildren.Thestructuralmutationshappenwithmuchlessfrequency.YoucanadjusttheexactfrequencyofeachoperationwithmostNEATimplementations.
NEATCrossover
NEATcrossoverismorecomplexthanmanygeneticalgorithmsbecausetheNEATgenomeisanencodingoftheneuronsandconnectionsthatcompriseanindividualgenome.Mostgeneticalgorithmsassumethatthenumberofgenesisconsistentacrossallgenomesinthepopulation.Infact,childgenomesinNEATthatresultfrombothmutationandcrossovermayhaveadifferentnumberofgenesthantheirparents.ManagingthisnumberdiscrepancyrequiressomeingenuitywhenyouimplementtheNEATcrossoveroperation.
NEATkeepsadatabaseofallthechangesmadetoagenomethroughmutation.Thesechangesarecalledinnovations,andtheyexistinordertoimplementmutations.Eachtimeaninnovationisadded,itisgivenanID.TheseIDswillalsobeusedtoordertheinnovations.WewillseethatitisimportanttoselecttheinnovationwiththelowerIDwhenchoosingbetweentwoinnovations.
Itisimportanttorealizethattherelationshipbetweeninnovationsandmutationsisnotonetoone.Itcantakeseveralinnovationstoachieveonemutation.Theonlytwotypesofinnovationarecreatinganeuronandalinkbetweentwoneurons.Onemutationmightresultfrommultipleinnovations.Additionally,amutationmightnothaveanyinnovations.Onlymutationsthataddtothestructureofthenetworkwillgenerateinnovations.Thefollowinglistsummarizestheinnovationsthatthepreviouslymentionedmutationtypescouldpotentiallycreate.
Addaneuron:OnenewneuroninnovationandtwonewlinkinnovationsAddalink:OnenewlinkinnovationRemovealink:NoinnovationsPerturbaweight:Noinnovations
YoualsoneedtonotethatNEATwillnotrecreateinnovationrecordsifyouhavealreadyattemptedthistypeofinnovation.Furthermore,innovationsdonotcontainanyweightinformation;innovationsonlycontainstructuralinformation.
Crossoverfortwogenomesoccursbyconsideringtheinnovations,andthistraitallowsNEATtoensurethatallprerequisiteinnovationsarealsopresent.Anaïvecrossover,suchasthosethatmanygeneticalgorithmsuse,wouldpotentiallycombinelinkswithnonexistentneurons.Listing8.1showstheentireNEATcrossoverfunctioninpseudocode:
Listing8.1:NEATCrossover
defneat_crossover(rnd,mom,dad):
#Choosebestgenome(byobjectivefunction),iftie,chooserandom.
best=favor_parent(rnd,mom,dad)
not_best=dadif(best<>mom)elsemom
selected_links=[]
selected_neurons=[]
#currentgeneindexfrommomanddad
cur_mom=0
cur_dad=0
selected_gene=None
#addintheinputandbias,theyshouldalwaysbehere
always_count=mom.input_count+mom.output_count+1
forifrom0toalways_count-1:
selected_neurons.add(i,best,not_best)
#Loopoverallgenesinbothmotherandfather
while(cur_mom<mom.num_genes)or(cur_dad<dad.num_genes):
#Themomanddadgeneobject
mom_gene=None
mom_innovation=-1
dad_gene=None
dad_innovation=-1
#grabtheactualobjectsfrommomanddadforthespecified
#indexes
#iftherearenone,thenNone
ifcur_mom<mom.num_genes:
mom_gene=mom.links[cur_mom];
mom_innovation=mom_gene.innovation_id
ifcur_dad<dad.num_genes:
dad_gene=dad.links[cur_dad]
dad_gene=dad.links[cur_dad]
dad_innovation_id=dad_gene.innovation_id
#nowselectagenefrormomordad.Thisgeneisforthebaby
#Dadgeneonly,momhasrunout
ifmom_gene==Noneanddad_gene<>None:
cur_dad=cur_dad+1
selected_gene=dad_gene
#Momgeneonly,dadhasrunout
elseifdadGene==nullandmomGene<>null:
cur_mom=cur_mom+1
selected_gene=mom_gene
#Momhaslowerinnovationnumber
elseifmom_innovation_id<dad_innovation_id:
cur_mom=cur_mom+1
ifbest==mom:
selected_gene=mom_gene
#Dadhaslowerinnovationnumber
elseifdad_innovation_id<mom_innovation_id:
cur_dad=cur_dad+1
ifbest==dad:
selected_gene=dad_gene
#Momanddadhavethesameinnovationnumber
#Flipacoin.
elseifdad_innovation_id==mom_innovation_id:
cur_dad=cur_dad+1
cur_mom=cur_mom+1
ifrnd.next_double()>0.5:
selected_gene=dad_gene
else:
selected_gene=mom_gene
#Ifagenewaschosenforthechildthenprocessit.
#Ifnot,theloopcontinues.
ifselected_gene<>None:
#Donotaddthesameinnovationtwiceinarow.
ifselected_links.count==0:
selected_links.add(selected_gene)
else:
ifselected_links[selected_links.count-1]
.innovation_id<>selected_gene.innovation_id{
selected_links.add(selected_gene)
#Checkifwealreadyhavethenodesreferredtoin
#SelectedGene.
#Ifnot,theyneedtobeadded.
selected_neurons.add(
selected_gene.from_neuron_id,best,not_best)
selected_neurons.add(
selected_gene.to_neuron_id,best,not_best)
#Doneloopingoverparent'sgenes
baby=newNEATGenome(selected_links,selected_neurons)
returnbaby
TheaboveimplementationofcrossoverisbasedontheNEATcrossoveroperatorimplementedinEncog.Weprovidetheabovecommentsinordertoexplainthecriticalsectionsofcode.Theprimaryevolutionoccursonthelinkscontainedinthemotherandfather.Anyneuronsneededtosupporttheselinksarebroughtalongwhenthechild
genomeiscreated.Thecodecontainsamainloopthatloopsoverbothparents,therebyselectingthemostsuitablelinkgenefromeachparent.Thelinkgenesfrombothparentsareessentiallystitchedtogethersotheycanfindthemostsuitablegene.Becausetheparentsmightbedifferentlengths,onewilllikelyexhaustitsgenesbeforethisprocessiscomplete.
Eachtimethroughtheloop,ageneischosenfromeitherthemotherorfatheraccordingtothefollowingcriteria:
Ifmomordadhasrunout,choosetheother.Movepastthechosengene.IfmomhasalowerinnovationIDnumber,choosemomifshehasthebestscore.Ineithercase,movepastmom’sgene.IfdadhasalowerinnovationIDnumber,choosedadifhehasthebestscore.Ineithercase,movepastdad’sgene.IfmomanddadhavethesameinnovationID,pickonerandomly,andmovepasttheirgene.
Youcanconsiderthatthemotherandfather’sgenesarebothonalongtape.Amarkerforeachtapeholdsthecurrentposition.Accordingtotherulesabove,themarkerwillmovepastaparent’sgene.Atsomepoint,eachparent’smarkermovestotheendofthetape,andthatparentrunsoutofgenes.
NEATSpeciation
Crossoverisatrickyforcomputerstoproperlyperform.Intheanimalandplantkingdoms,crossoveroccursonlybetweenmembersofthesamespecies.Whatexactlydowemeanbyspecies?Inbiology,scientistsdefinespeciesasmembersofapopulationthatcanproduceviableoffspring.Therefore,acrossoverbetweenahorseandhummingbirdgenomewouldbecatastrophicallyunsuccessful.Yetanaivegeneticalgorithmwouldcertainlytrysomethingjustasdisastrouswithartificialcomputergenomes!
TheNEATspeciationalgorithmhasseveralvariants.Infact,oneofthemostadvancedvariantscangroupthepopulationintoapredefinednumberofclusterswithatypeofk-meansclustering.Youcansubsequentlydeterminetherelativefitnessofeachspecies.Theprogramgiveseachspeciesapercentageofthenextgeneration’spopulationcount.Themembersofeachspeciesthencompeteinvirtualtournamentstodeterminewhichmembersofthespecieswillbeinvolvedincrossoverandmutationforthenextgeneration.
Atournamentisaneffectivewaytoselectparentsfromaspecies.Theprogramperformsacertainnumberoftrials.Typicallyweusefivetrials.Foreachtrial,theprogramselectstworandomgenomesfromthespecies.Thefitterofeachgenomeadvancestothenexttrial.Thisprocessisveryefficientforthreading,anditisalsobiologicallyplausible.Theadvantagetothisselectionmethodisthatthewinnerdoesn’thavetobeatthebestgenomeinthespecies.Ithastobeatthebestgenomeinthetrials.Youmustrunatournamentforeachparentneeded.Mutationrequiresoneparent,and
crossoverneedstwoparents.
Inadditiontothetrials,severalotherfactorsdeterminethespeciesmemberschosenformutationandcrossover.Thealgorithmwillalwayscarryoneormoreelitegenomestothenextspecies.Thenumberofelitegenomesisconfigurable.Theprogramgivesyoungergenomesabonussotheyhaveachancetotrynewinnovations.Interspeciescrossoverwilloccurwithaverylowprobability.
AllofthesefactorstogethermakeNEATaveryeffectiveneuralnetworktype.NEATremovestheneedtodefinehowthehiddenlayersofaneuralnetworkarestructured.TheabsenceofastrictstructureofhiddenlayersallowsNEATneuralnetworkstoevolvetheconnectionsthatareactuallyneeded.
CPPNNetworks
Thecompositionalpattern-producingnetwork(CPPN)wasinventedbyStanley(2007)andisavariationoftheartificialneuralnetwork.CPPNrecognizesonebiologicallyplausiblefact.Innature,genotypesandphenotypesarenotidentical.Inotherwords,thegenotypeistheDNAblueprintforanorganism.Thephenotypeiswhatactuallyresultsfromthatplan.
Innature,thegenomeistheinstructionsforproducingaphenotypethatismuchmorecomplexthanthegenotype.IntheoriginalNEAT,asseeninthelastsection,thegenomedescribeslinkforlinkandneuronforneuronhowtoproducethephenotype.However,CPPNisdifferentbecauseitcreatesapopulationofspecialNEATgenomes.Thesegenomesarespecialintwoways.First,CPPNdoesn’thavethelimitationsofregularNEAT,whichalwaysusesasigmoidactivationfunction.CPPNcanuseanyofthefollowingactivationfunctions:
ClippedlinearBipolarsteepenedsigmoidGaussianSineOthersyoumightdefine
YoucanseetheseactivationfunctionsinFigure8.3:
Figure8.3:CPPNActivationFunctions
TheseconddifferenceisthattheNEATnetworksproducedbythesegenomesarenotthefinalproduct.Theyarenotthephenotype.However,theseNEATgenomesdoknowhowtocreatethefinalproduct.
ThefinalphenotypeisaregularNEATnetworkwithasigmoidactivationfunction.Wecanusetheabovefouractivationfunctionsonlyforthegenomes.Theultimatephenotypealwayshasasigmoidactivationfunction.
CPPNPhenotype
CPPNsaretypicallyusedinconjunctionwithimages,astheCPPNphenotypeisusuallyanimage.ThoughimagesaretheusualproductofaCPPN,theonlyrealrequirementisthattheCPPNcomposesomething,therebyearningitsnameofcompositionalpattern-producingnetwork.TherearecaseswhereaCPPNdoesnotproduceanimage.Themostpopularnon-imageproducingCPPNisHyperNEAT,whichisdiscussedinthenextsection.
Creatingagenomeneuralnetworktoproduceaphenotypeneuralnetworkisacomplexbutworthwhileendeavor.Becausewearedealingwithalargenumberofinputandoutputneurons,thetrainingtimescanbeconsiderable.However,CPPNsarescalable
andcanreducethetrainingtimes.
OnceyouhaveevolvedaCPPNtocreateanimage,thesizeoftheimage(thephenotype)doesnotmatter.Itcanbe320x200,640x480orsomeotherresolutionaltogether.Theimagephenotype,generatedbytheCPPNwillgrowtothesizeneeded.Aswewillseeinthenextsection,CPPNsgiveHyperNEATthesamesortofscalability.
WewillnowlookathowaCPPN,whichisitselfaNEATnetwork,producesanimage,orthefinalphenotype.TheNEATCPPNshouldhavethreeinputvalues:thecoordinateonthehorizontalaxis(x),thecoordinateontheverticalaxis(y),andthedistanceofthecurrentcoordinatefromthecenter(d).Inputtingdprovidesabiastowardssymmetry.Inbiologicalgenomes,symmetryisimportant.TheoutputfromtheCPPNcorrespondstothepixelcoloratthex-coordinateandy-coordinate.TheCPPNspecificationonlydetermineshowtoprocessagrayscaleimagewithasingleoutputthatindicatesintensity.Forafull-colorimage,youcoulduseoutputneuronsforred,green,andblue.Figure8.4showsaCPPNforimages:
Figure8.4:CPPNforImages
YoucanquerytheaboveCPPNforeveryx-coordinateandy-coordinateneeded.Listing8.2showsthepseudocodethatyoucanusetogeneratethephenotype:
Listing8.2:GenerateCPPNImage
defrender_cppn(net,bitmap):
foryfrom1tobitmap.height:
forxfrom1tobitmap.width:
#Normalizexandyto-1,1
norm_x=(2*(x/bitmap.width))-1
norm_y=(2*(y/bitmap.height))-1
#Distancefromcenter
d=sqrt((norm_x/2)^2
+(norm_y/2)^2)
#CallCPPN
input=[x,y,d]
color=net.compute(input)
#Outputpixel
bitmap.plot(x-1,y-1,color)
TheabovecodesimplyloopsovereverypixelandqueriestheCPPNforthecoloratthatlocation.Thex-coordinateandy-coordinatearenormalizedtobeingbetween-1and+1.YoucanseethisprocessinactionatthePicbreederwebsiteatfollowingURL:
http://picbreeder.org/
DependingonthecomplexityoftheCPPN,thisprocesscanproduceimagessimilartoFigure8.5:
Figure8.5:ACPPN-ProducedImage(picbreeder.org)
Picbreederallowsyoutoselectoneormoreparentstocontributetothenextgeneration.Weselectedtheimagethatresemblesamouth,aswellastheimagetotheright.Figure8.6showsthesubsequentgenerationthatPicbreederproduced.
Figure8.6:ACPPN-ProducedImage(picbreeder.org)
CPPNnetworkshandlesymmetryjustlikehumanbodies.Withtwohands,twokidneys,twofeet,andotherbodypartpairs,thehumangenomeseemstohaveahierarchyofrepeatedfeatures.Instructionsforcreatinganeyeorvarioustissuesdonotexist.Fundamentally,thehumangenomedoesnothavetodescribeeverydetailofanadulthumanbeing.Rather,thehumangenomeonlyhastodescribehowtobuildanadulthumanbeingbygeneralizingmanyofthesteps.Thisgreatlysimplifiestheamountofinformationthatisneededinagenome.
AnothergreatfeatureoftheimageCPPNisthatyoucancreatetheaboveimagesatanyresolutionandwithoutretraining.Becausethex–coordinateandy–coordinatearenormalizedtobetween-1and+1,youcanuseanyresolution.
HyperNEATNetworks
HyperNEATnetworks,inventedbyStanley,D’Ambrosio,&Gauci(2009),arebasedupontheCPPN;however,insteadofproducinganimage,aHyperNEATnetworkcreatesanotherneuralnetwork.JustliketheCPPNinthelastsection,HyperNEATcaneasilycreatemuchhigherresolutionneuralnetworkswithoutretraining.
HyperNEATSubstrate
Oneinterestinghyper-parameteroftheHyperNEATnetworkisthesubstratethatdefinesthestructureofaHyperNEATnetwork.Asubstratedefinesthex-coordinateandthey-coordinatefortheinputandoutputneurons.StandardHyperNEATnetworksusuallyemploytwoplanestoimplementthesubstrate.Figure8.7showsthesandwichsubstrate,oneofthemostcommonsubstrates:
Figure8.7:HyperNEATSandwichSubstrate
Togetherwiththeabovesubstrate,aHyperNEATCPPNiscapableofcreatingthephenotypeneuralnetwork.Thesourceplanecontainstheinputneurons,andthetargetplanecontainstheoutputneurons.Thex-coordinateandthey-coordinateforeachareinthe-1to+1range.Therecanpotentiallybeaweightbetweeneachofthesourceneuronsandeverytargetneuron.Figure8.8showshowtoquerytheCPPNtodeterminetheseweights:
Figure8.8:CPPNforHyperNEAT
TheinputtotheCPPNconsistsoffourvalues:x1,y1,x2,andy2.Thefirsttwovaluesx1andy1specifytheinputneurononthesourceplane.Thesecondtwovaluesx2andy2specifytheinputneurononthetargetplane.HyperNEATallowsthepresenceofasmanydifferentinputandoutputneuronsasdesired,withoutretraining.JustliketheCPPNimagecouldmapmoreandmorepixelsbetween-1and+1,sotoocanHyperNEATpackinmoreinputandoutputneurons.
HyperNEATComputerVision
ComputervisionisagreatapplicationofHyperNEAT,asdemonstratedbytherectanglesexperimentprovidedintheoriginalHyperNEATpaperbyStanley,KennethO.,etal.(2009).Thisexperimentplacedtworectanglesinacomputer’svisionfield.Ofthesetworectangles,oneisalwayslargerthantheother.Theneuralnetworkistrainedtoplacearedrectanglenearthecenterofthelargerrectangle.Figure8.9showsthisexperimentrunningundertheEncogframework:
Figure8.9:BoxesExperiment(11resolution)
Asyoucanseefromtheaboveimage,theredrectangleisplaceddirectlyinsideofthelargerofthetworectangles.The“NewCase”buttoncanbepressedtomovetherectangles,andtheprogramcorrectlyfindsthelargerrectangle.Whilethisworksquitewellat11x11,thesizecanbeincreasedto33x33.Withthelargersize,noretrainingisneeded,asshowninFigure8.10:
Figure8.10:BoxesExperiment(33resolution)
Whenthedimensionsareincreasedto33x33,theneuralnetworkisstillabletoplacetheredsquareinsideofthelargerrectangle.
Theaboveexampleusesasandwichsubstratewiththeinputandoutputplaneboth
equaltothesizeofthevisualfield,inthiscase33x33.Theinputplaneprovidesthevisualfield.Theneuronintheoutputplanewiththehighestoutputistheprogram’sguessatthecenterofthelargerrectangle.ThefactthatthepositionofthelargerectangledoesnotconfusethenetworkshowsthatHyperNEATpossessessomeofthesamefeaturesastheconvolutionalneuralnetworksthatwewillseeinChapter10,“ConvolutionalNetworks.”
ChapterSummary
ThischapterintroducedNEAT,CPPN,andHyperNEAT.KennethStanley’sEPLEXgroupattheUniversityofCentralFloridaextensivelyresearchesallthreetechnologies.NeuroEvolutionofAugmentingTopologies(NEAT)isanalgorithmthatusesgeneticalgorithmstoautomaticallyevolveneuralnetworkstructures.Oftenthedecisionofthestructureofaneuralnetworkcanbeoneofthemostcomplexaspectsofneuralnetworkdesign.NEATneuralnetworkscanevolvetheirownstructureandevendecidewhatinputfeaturesareimportant.
Thecompositionalpattern-producingnetwork(CPPN)isatypeofneuralnetworkthatisevolvedtocreateotherstructures,suchasimagesorotherneuralnetworks.ImagegenerationisacommontaskforCPPNs.ThePicbreederwebsiteallowsnewimagestobebredbasedonpreviousimagesgeneratedatthissite.CPPNscangeneratemorethanjustimages.TheHyperNEATalgorithmisanapplicationofCPPNsforproducingneuralnetworks.
Hypercube-basedNEAT,orHyperNEAT,isatypeofCPPNthatevolvesotherneuralnetworksthatcaneasilyhandlemuchhigherresolutionsoftheirdimensionsassoonastheyaretrained.HyperNEATallowsaCPPNtobeevolvedthatcancreateneuralnetworks.Beingabletogeneratetheneuralnetworkallowsyoutointroducesymmetry,anditgivesyoutheabilitytochangetheresolutionoftheproblemwithoutretraining.
Neuralnetworkshaverisenanddeclinedinpopularityseveraltimessincetheirintroduction.Currently,thereisinterestinneuralnetworksthatusedeeplearning.Infact,deeplearninginvolvesseveraldifferentconcepts.Thenextchapterintroducesdeepneuralnetworks,andweexpandthistopicthroughouttheremainderofthisbook.
Chapter9:DeepLearningConvolutionalNeuralNetworks&DropoutToolsforDeepLearningContrastiveDivergenceGibb’sSampling
Deeplearningisarelativelynewadvancementinneuralnetworkprogrammingandrepresentsawaytotraindeepneuralnetworks.Essentially,anyneuralnetworkwithmorethantwolayersisdeep.TheabilitytocreatedeepneuralnetworkshasexistedsincePitts(1943)introducedthemultilayerperceptron.However,wehaven’tbeenabletoeffectivelytrainneuralnetworksuntilHinton(1984)becamethefirstresearchertosuccessfullytrainthesecomplexneuralnetworks.
DeepLearningComponents
Deeplearningiscomprisedofanumberofdifferenttechnologies,andthischapterisanoverviewofthesetechnologies.Subsequentchapterswillcontainmoreinformationonthesetechnologies.Deeplearningtypicallyincludesthefollowingfeatures:
PartiallyLabeledDataRectifiedLinearUnits(ReLU)ConvolutionalNeuralNetworksDropout
Thesucceedingsectionsprovideanoverviewofthesetechnologies.
PartiallyLabeledData
Mostlearningalgorithmsareeithersupervisedorunsupervised.Supervisedtrainingdatasetsprovideanexpectedoutcomeforeachdataitem.Unsupervisedtrainingdatasetsdonotprovideanexpectedoutcome,whichiscalledalabel.Theproblemisthatmostdatasetsareamixtureoflabeledandunlabeleddataitems.
Tounderstandthedifferencebetweenlabeledandunlabeleddata,considerthefollowingreal-lifeexample.Whenyouwereachild,youprobablysawmanyvehiclesasyougrewup.Earlyinyourlife,youdidnotknowifyouwereseeingacar,truck,orvan.Yousimplyknewthatyouwereseeingsomesortofvehicle.Youcanconsiderthisexposureastheunsupervisedpartofyourvehicle-learningjourney.Atthatpoint,you
learnedcommonalitiesoffeaturesamongthesevehicles.
Laterinyourlearningjourney,youweregivenlabels.Asyouencountereddifferentvehicles,anadulttoldyouthatyouwerelookingatacar,truck,orvan.Theunsupervisedtrainingcreatedyourfoundation,andyoubuiltuponthatknowledge.Asyoucansee,supervisedandunsupervisedlearningareverycommoninreallife.Initsownway,deeplearningdoeswellwithacombinationofunsupervisedandsupervisedlearningdata.
Somedeeplearningarchitectureshandlepartiallylabeleddataandinitializetheweightsbyusingtheentiretrainingsetwithouttheoutcomes.Youcanindependentlytraintheindividuallayerswithoutthelabels.Becauseyoucantrainthelayersinparallel,thisprocessisscalable.Oncetheunsupervisedphasehasinitializedtheseweights,thesupervisedphasecantweakthem.
RectifiedLinearUnits
TheRectifiedlinearunit(ReLU)hasbecomethestandardactivationfunctionforthehiddenlayersofadeepneuralnetwork.However,therestrictedBoltzmannmachine(RBM)isthestandardforthedeepbeliefneuralnetwork(DBNN).InadditiontotheReLUactivationfunctionsforthehiddenlayers,deepneuralnetworkswillusealinearorsoftmaxactivationfunctionfortheoutputlayer,dependingoniftheneuralnetworksupportsregressionorclassification.WeintroducedReLUsinChapter1,“NeuralNetworkBasics,”andexpandeduponthisinformationin“Chapter6,BackpropagationTraining.”
ConvolutionalNeuralNetworks
Convolutionisanimportanttechnologythatisoftencombinedwithdeeplearning.Hinton(2014)introducedconvolutiontoallowimage-recognitionnetworkstofunctionsimilarlytobiologicalsystemsandachievemoreaccurateresults.Oneapproachissparseconnectivityinwhichwedonotcreateeverypossibleweight.Figure9.1showssparseconnectivity:
Figure9.1:SparseConnectivity
Aregularfeedforwardneuralnetworkusuallycreateseverypossibleweightconnectionbetweentwolayers.Indeeplearningterminology,werefertotheselayersasdenselayers.Inadditiontonotrepresentingeveryweightpossible,convolutionalneuralnetworkswillalsoshareweights,asseeninFigure9.2:
Figure9.2:SharedWeights
Asyoucanseeintheabovefigure,theneuronsshareonlythreeindividualweights.Thered(solid),black(dashed),andblue(dotted)linesindicatetheindividualweights.Sharingweightsallowstheprogramtostorecomplexstructureswhilemaintainingmemoryandcomputationefficiency.
Thissectionpresentedanoverviewofconvolutionalneuralnetworks.Chapter10,“ConvolutionalNeuralNetworks,”isdevotedentirelytothisnetworktype.
NeuronDropout
Dropoutisaregularizationtechniquethatholdsmanybenefitsfordeeplearning.Likemostregularizationtechniques,dropoutcanpreventoverfitting.Youcanalsoapplydropouttoaneuralnetworkinalayer-by-layerfashionasyoudoinconvolution.Youmustdesignateasinglelayerasadropoutlayer.Infact,youcanmixthesedropoutlayerswithregularlayersandconvolutionallayersintheneuralnetwork.Nevermixthedropoutandconvolutionallayerswithinasinglelayer.
Hinton(2012)introduceddropoutasasimpleandeffectiveregularizationalgorithmtoreduceoverfitting.Dropoutworksbyremovingcertainneuronsinthedropoutlayer.Theactofdroppingtheseneuronspreventsotherneuronsfrombecomingoverlydependentonthedroppedneurons.Theprogramremovesthesechosenneurons,alongwithalloftheirconnections.Figure9.3illustratesthisprocess:
Figure9.3:DropoutLayer
Fromlefttoright,theaboveneuralnetworkcontainsaninputlayer,adropoutlayer,andanoutputlayer.Thedropoutlayerhasremovedseveraloftheneurons.Thecircles,madeofdottedlines,indicatetheneuronsthatthedropoutalgorithmremoved.Thedashedconnectorlinesindicatetheweightsthatthedropoutalgorithmremovedwheniteliminatedtheneurons.
Bothdropoutandotherformsofregularizationareextensivetopicsinthefieldofneuralnetworks.Chapter12,“DropoutandRegularization,”coversregularizationwithparticularfocusondropout.ThatchapteralsocontainsanexplanationontheL1andL2regularizationalgorithms.L1andL2discourageneuralnetworksfromtheexcessiveuseoflargeweightsandtheinclusionofcertainirrelevantinputs.Essentially,asingleneuralnetworkcommonlyusesdropoutaswellasotherregularizationalgorithms.
GPUTraining
Hinton(1987)introducedaverynovelwaytotrainthedeepbeliefneuralnetwork(DBNN)efficiently.WeexaminethisalgorithmandDBNNslaterinthischapter.Asmentionedpreviously,deepneuralnetworkshaveexistedalmostaslongastheneuralnetwork.However,untilHinton’salgorithm,noeffectivewaytotraindeepneuralnetworksexisted.Thebackpropagationalgorithmsareveryslow,andthevanishinggradientproblemhindersthetraining.
Thegraphicsprocessingunit(GPU),thepartofthecomputerthatisresponsibleforgraphicsdisplay,isthewaythatresearcherssolvedthetrainingproblemoffeedforwardneuralnetworks.MostofusarefamiliarwithGPUsbecauseofmodernvideogamesthatutilize3Dgraphics.Renderingthesegraphicalimagesismathematicallyintense,and,toperformtheseoperations,earlycomputersreliedonthecentralprocessingunit(CPU).However,thisapproachwasnoteffective.Thegraphicssystemsinmodernvideogamesrequirededicatedcircuitry,whichbecametheGPU,orvideocard.Essentially,modernGPUsarecomputersthatfunctionwithinyourcomputer.
Asresearchersdiscovered,theprocessingpowercontainedinaGPUcanbeharnessedformathematicallyintensetasks,suchasneuralnetworktraining.WerefertothisutilizationoftheGPUforgeneralcomputingtasks,asidefromcomputergraphics,asgeneral-purposeuseoftheGPU(GPGPU).Whenappliedtodeeplearning,theGPUperformsextraordinarilywell.CombiningitwithReLUactivationfunctions,regularization,andregularbackpropagationcanproduceamazingresults.
However,GPGPUcanbedifficulttouse.ProgramswrittenfortheGPUmustemployaverylow-levelprogramminglanguagecalledC99.ThislanguageisverysimilartotheregularCprogramminglanguage.However,inmanyways,theC99requiredbytheGPUismuchmoredifficultthantheregularCprogramminglanguage.Furthermore,GPUsaregoodonlyatcertaintasks—eventhoseconducivetotheGPUbecauseoptimizingtheC99codeischallenging.GPUsmustbalanceseveralclassesofmemory,registers,andsynchronizationofhundredsofprocessorcores.Additionally,GPUprocessinghastwocompetingstandards—CUDAandOpenCL.Twostandardscreatemoreprocessesfortheprogrammertolearn.
Fortunately,youdonotneedtolearnGPUprogrammingtoexploititsprocessingpower.Unlessyouarewillingtodevoteaconsiderableamountofefforttolearnthenuancesofacomplexandevolvingfield,wedonotrecommendthatyoulearntoprogramtheGPUbecauseitisquitedifferentfromCPUprogramming.Goodtechniquesthatproduceefficient,CPU-basedprogramswilloftenproducehorriblyinefficientGPUprograms.Thereverseisalsotrue.IfyouwouldliketouseGPU,youshouldworkwithanoff-the-shelfpackagethatsupportsit.Ifyourneedsdonotfitintoadeeplearningpackage,youmightconsiderusingalinearalgebrapackage,suchasCUBLAS,whichcontainsmanyhighlyoptimizedalgorithmsforthesortsoflinearalgebrathatmachinelearning
commonlyrequires.
TheprocessingpowerofahighlyoptimizedframeworkfordeeplearningandafastGPUcanbeamazing.GPUscanachieveoutstandingresultsbasedonsheerprocessingpower.In2010,theSwissAILabIDSIAshowedthat,despitethevanishinggradientproblem,thesuperiorprocessingpowerofGPUsmadebackpropagationfeasiblefordeepfeedforwardneuralnetworks(Ciresanetal.,2010).ThemethodoutperformedallothermachinelearningtechniquesonthefamousMNISThandwrittendigitproblem.
ToolsforDeepLearning
Oneoftheprimarychallengesofdeeplearningistheprocessingtimetotrainanetwork.Weoftenruntrainingalgorithmsformanyhours,orevendays,seekingneuralnetworksthatfitwelltothedatasets.Weuseseveralframeworksforourresearchandpredictivemodeling.Theexamplesinthisbookalsoutilizetheseframeworks,andwewillpresentallofthesealgorithmsinsufficientdetailforyoutocreateyourownimplementation.However,unlessyourgoalistoconductresearchtoenhancedeeplearningitself,youarebestservedbyworkingwithanestablishedframework.Mostoftheseframeworksaretunedtotrainveryfast.
Wecandividetheexamplesfromthisbookintotwogroups.Thefirstgroupshowsyouhowtoimplementaneuralnetworkortotrainanalgorithm.However,mostoftheexamplesinthisbookarebasedonalgorithms,andweexaminethealgorithmatitslowestlevel.
Applicationexamplesarethesecondtypeofexamplecontainedinthisbook.Thesehigher-levelexamplesshowhowtouseneuralnetworkanddeeplearningalgorithms.Theseexampleswillusuallyutilizeoneoftheframeworksdiscussedinthissection.Inthisway,thebookstrikesabalancebetweentheoryandreal-worldapplication.
H2O
H2Oisamachinelearningframeworkthatsupportsawidevarietyofprogramminglanguages.ThoughH2OisimplementedinJava,itisdesignedasawebservice.H2OcanbeusedwithR,Python,Scala,Java,andanylanguagethatcancommunicatewithH2O’sRESTAPI.
Additionally,H2OcanbeusedwithApacheSparkforbigdataandbigcomputeoperations.TheSparkingWaterpackageallowsH2Otorunlargemodelsinmemoryacrossagridofcomputers.FormoreinformationaboutH2O,refertothefollowingURL:
http://0xdata.com/product/deep-learning/
Inadditiontodeeplearning,H2Osupportsavarietyofothermachinelearningmodels,
suchaslogisticregression,decisiontrees,andgradientboosting.
Theano
TheanoisamathematicalpackageforPython,similartothewidelyusedPythonpackage,Numpy(J.Bergstra,O.Breuleux,F.Bastien,etal.,J.Bergstra,O.Breuleux,F.Bastien,2012).LikeNumpy,Theanoprimarilytargetsmathematics.ThoughTheanodoesnotdirectlyimplementdeepneuralnetworks,itprovidesallofthemathematicaltoolsnecessaryfortheprogrammertocreatedeepneuralnetworkapplications.TheanoalsodirectlysupportsGPGPU.YoucanfindtheTheanopackageatthefollowingURL:
http://deeplearning.net/software/theano/
ThecreatorsofTheanoalsowroteanextensivetutorialfordeeplearning,usingTheanothatcanbefoundatthefollowingURL:
http://deeplearning.net/
LasagneandNoLearn
BecauseTheanodoesnotdirectlysupportdeeplearning,severalpackageshavebeenbuiltuponTheanotomakeiteasyfortheprogrammertoimplementdeeplearning.Onepairofpackages,oftenusedtogether,isLasagneandNolearn.NolearnisapackageforPythonthatprovidesabstractionsaroundseveralmachinelearningalgorithms.Inthisway,Nolearnissimilartothepopularframework,Scikit-Learn.WhileScikit-Learnfocuseswidelyonmachinelearning,Nolearnspecializesonneuralnetworks.OneoftheneuralnetworkpackagessupportedbyNolearnisLasagne,whichprovidesdeeplearningandcanbefoundatthefollowingURL:
https://pypi.python.org/pypi/Lasagne/0.1dev
YoucanaccesstheNolearnpackageatthefollowingURL:
https://github.com/dnouri/nolearn
ThedeeplearningframeworkLasangetakesitsnamefromtheItalianfoodlasagna.Thespellings“lasange”and“lasagna”arebothconsideredvalidspellingsoftheItalianfood.IntheItalianlanguage,“lasange”issingular,and“lasagna”isthepluralform.Regardlessofthespellingused,lasagnaisagoodnameforadeeplearningframework.Figure9.4showsthat,likeadeepneuralnetwork,lasagnaismadeupofmanylayers:
Figure9.4:LasagnaLayers
ConvNetJS
DeeplearningsupporthasalsobeencreatedforJavascript.TheConvNetJSpackageimplementsmanydeeplearningalgorithms,particularlyintheareaofconvolutionalneuralnetworks.ConvNetJSprimarilytargetsthecreationofdeeplearningexamplesonwebsites.WeusedConvNetJStoprovidemanyofthedeeplearningJavaScriptexamplesonthisbook’swebsite:
http://cs.stanford.edu/people/karpathy/convnetjs/
DeepBeliefNeuralNetworks
Thedeepbeliefneuralnetwork(DBNN)wasoneofthefirstapplicationsofdeeplearning.ADBNNissimplyaregularbeliefnetworkwithmanylayers.Beliefnetworks,introducedbyNeilin1992aredifferentfromregularfeedforwardneuralnetworks.Hinton(2007)describesDBNNsas“probabilisticgenerativemodelsthatarecomposedofmultiplelayersofstochastic,latentvariables.”Becausethistechnicaldescriptioniscomplicated,wewilldefinesometerms.
Probabilistic–DBNNsareusedtoclassify,andtheiroutputistheprobabilitythataninputbelongstoeachclass.Generative–DBNNscanproduceplausible,randomlycreatedvaluesfortheinputvalues.SomeDBNNliteraturesrefertothistraitasdreaming.Multiplelayers–Likeaneuralnetwork,DBNNscanbemadeofmultiplelayers.Stochastic,latentvariables–DBNNsaremadeupofBoltzmannmachinesthatproducerandom(stochastic)valuesthatcannotbedirectlyobserved(latent).
TheprimarydifferencesbetweenaDBNNandafeedforwardneuralnetwork(FFNN)aresummarizedasfollows:
InputtoaDBNNmustbebinary;inputtoaFFNNisadecimalnumber.TheoutputfromaDBNNistheclasstowhichtheinputbelongs;theoutputfromaFFNNcanbeaclass(classification)oranumericprediction(regression).DBNNscangenerateplausibleinputbasedonagivenoutcome.FFNNscannotperformliketheDBNNs.
Theseareimportantdifferences.ThefirstbulletitemisoneofthemostlimitingfactorsofDBNNs.ThefactthataDBNNcanacceptonlybinaryinputoftenseverelylimitsthetypeofproblemthatitcantackle.YoualsoneedtonotethataDBNNcanbeusedonlyforclassificationandnotforregression.Inotherwords,aDBNNcouldclassifystocksintocategoriessuchasbuy,hold,orsell;however,itcouldnotprovideanumericpredictionaboutthestock,suchastheamountthatmaybeattainedoverthenext30days.Ifyouneedanyofthesefeatures,youshouldconsideraregulardeepfeedforwardnetwork.
Comparedtofeedforwardneuralnetworks,DBNNsmayinitiallyseemsomewhatrestrictive.However,theydohavetheabilitytogenerateplausibleinputcasesbasedonagivenoutput.OneoftheearliestDBNNexperimentswastohaveaDBNNclassifytendigits,usinghandwrittensamples.ThesedigitswerefromtheclassicMNISThandwrittendigitsdatasetthatwasincludedinthisbook’sintroduction.OncetheDBNNistrainedontheMNISTdigits,itcanproducenewrepresentationsofeachdigit,asseeninFigure9.5:
Figure9.5:DBNNDreamingofDigits
TheabovedigitsweretakenfromHinton’s(2006)deeplearningpaper.ThefirstrowshowsavarietyofdifferentzerosthattheDBNNgeneratedfromitstrainingdata.
TherestrictedBoltzmannmachine(RBM)isthecenteroftheDBNN.InputprovidedtotheDBNNpassesthroughaseriesofstackedRBMsthatmakeupthelayersofthenetwork.CreatingadditionalRBMlayerscausesdeeperDBNNs.ThoughRBMsareunsupervised,thedesireisfortheresultingDBNNtobesupervised.Toaccomplishthesupervision,afinallogisticregressionlayerisaddedtodistinguishoneclassfromanother.InthecaseofHinton’sexperiment,showninFigure9.6,theclassesarethetendigits:
Figure9.6:DeepBeliefNeuralNetwork(DBNN)
TheabovediagramshowsaDBNNthatusesthesamehyper-parametersasHinton’sexperiment.Hyper-parametersspecifythearchitectureofaneuralnetwork,suchasthenumberoflayers,hiddenneuroncounts,andothersettings.EachofthedigitimagespresentedtotheDBNNis28x28pixels,orvectorsof784pixels.Thedigitsaremonochrome(black&white)sothese784pixelsaresinglebitsandarethuscompatiblewiththeDBNN’srequirementthatallinputbebinary.TheabovenetworkhasthreelayersofstackedRBMs,containing500neurons,asecond500-neuronlayer,and2,000neurons,respectively.
ThefollowingsectionsdiscussanumberofalgorithmsusedtoimplementDBNNs.
RestrictedBoltzmannMachines
BecauseChapter3,“Hopfield&BoltzmannMachines,”includesadiscussionofBoltzmannmachines,wewillnotrepeatthismaterialhere.ThischapterdealswiththerestrictedversionoftheBoltzmannmachineandstackingtheseRBMstoachievedepth.Figure2.10,fromChapter3,showsanRBM.TheprimarydifferencewithanRBMisthatthevisible(input)neuronsandthehidden(output)neuronshavetheonlyconnections.InthecaseofastackedRBM,thehiddenunitsbecometheoutputtothenextlayer.Figure9.7showshowtwoBoltzmannmachinesarestacked:
Figure9.7:StackedRBMs
WecancalculatetheoutputfromanRBMexactlyasshowninChapter3,“Hopfield&BoltzmannMachines,”inEquation3.6.TheonlydifferenceisnowwehavetwoBoltzmannmachinesstacked.ThefirstBoltzmannmachinereceivesthreeinputspassedtoitsvisibleunits.Thehiddenunitspasstheiroutputdirectlytothetwoinputs(visibleunits)ofthesecondRBM.NoticethattherearenoweightsbetweenthetwoRBMs,andtheoutputfromtheH1andH2unitsinRBM1passdirectlytoI1andI2fromRBM2.
TrainingaDBNN
TheprocessoftrainingaDBNNrequiresanumberofsteps.Althoughthemathematicsbehindthisprocesscanbecomesomewhatcomplex,youdon’tneedtounderstandeverydetailfortrainingDBNNsinordertousethem.Youjustneedtoknowthefollowingkeypoints:
DBNNsundergosupervisedandunsupervisedtraining.Duringtheunsupervisedportion,theDBNNusestrainingdatawithouttheirlabels,whichallowsDBNNstohaveamixofsupervisedandunsuperviseddata.Duringthesupervisedportion,onlytrainingdatawithlabelsareused.EachDBNNlayeristrainedindependentlyduringtheunsupervisedportion.ItispossibletotraintheDBNNlayersconcurrently(withthreads)duringtheunsupervisedportion.Aftertheunsupervisedportioniscomplete,theoutputfromthelayersisrefinedwithsupervisedlogisticregression.Thetoplogisticregressionlayerpredictstheclasstowhichtheinputbelongs.
Armedwiththisknowledge,youcanskipaheadtothedeepbeliefclassificationexampleinthischapter.However,ifyouwishtolearnthespecificdetailsofDBNNtraining,readon.
Figure9.8providesasummaryofthestepsofDBNNtraining:
Figure9.8:DBNNTraining
Layer-WiseSampling
ThefirststepwhenperformingunsupervisedtrainingonanindividuallayeristocalculateallvaluesoftheDBNNuptothatlayer.Youwilldothiscalculationforeverytrainingset,andtheDBNNwillprovideyouwithsampledvaluesatthelayerthatyouarecurrentlytraining.Sampledreferstothefactthattheneuralnetworkrandomlychoosesatrue/falsevaluebasedonaprobability.
Youneedtounderstandthatsamplingusesrandomnumberstoprovideyouwithyourresults.Becauseofthisrandomness,youwillnotalwaysgetthesameresult.IftheDBNNdeterminesthatahiddenneuron’sprobabilityoftrueis0.75,thenyouwillgetavalueoftrue75%ofthetime.Layer-wisesamplingisverysimilartothemethodthatweusedtocalculatetheoutputofBoltzmannmachinesinChapter3,“Hopfield&BoltzmannMachines.”WewilluseEquation3.6,fromchapter3tocomputetheprobability.TheonlydifferenceisthatwewillusetheprobabilitygivenbyEquation3.6togeneratearandomsample.
Thepurposeofthelayer-wisesamplingistoproduceabinaryvectortofeedintothe
contrastivedivergencealgorithm.WhentrainingeachRBM,wealwaysprovidetheoutputofthepreviousRBMastheinputtothecurrentRBM.IfwearetrainingthefirstRBM(closesttotheinput),wesimplyusethetraininginputvectorforcontrastivedivergence.ThisprocessallowseachoftheRBMstobetrained.ThefinalsoftmaxlayeroftheDBNNisnottrainedduringtheunsupervisedphase.Thefinallogisticregressionphasewilltrainthesoftmaxlayer.
ComputingPositiveGradients
Oncethelayer-wisetraininghasprocessedeachoftheRBMlayers,wecanutilizetheup-downalgorithm,orthecontrastivedivergencealgorithm.Thiscompletealgorithmincludesthefollowingsteps,coveredinthenextsectionsofthisbook:
ComputingPositiveGradientsGibbsSamplingUpdateWeightsandBiasesSupervisedBackpropagation
Likemanyofthegradient-descent-basedalgorithmspresentedinChapter6,“BackpropagationTraining,”thecontrastivedivergencealgorithmisalsobasedongradientdescent.Itusesthederivativeofafunctiontofindtheinputstothefunctionthatproducesthelowestoutputforthatfunction.Severaldifferentgradientsareestimatedduringcontrastivedivergence.Wecanusetheseestimatesinsteadofactualcalculationsbecausetherealgradientsaretoocomplextocalculate.Formachinelearning,anestimateisoftengoodenough.
Additionally,wemustcalculatethemeanprobabilityofthehiddenunitsbypropagatingthevisibleunitstothehiddenones.Thiscomputationisthe“up”portionoftheup-downalgorithm.Equation9.1performsthiscalculation:
Equation9.1:PropagateUp
Theaboveequationcalculatesthemeanprobabilityofeachofthehiddenneurons(h).Thebarabovethehdesignatesitasamean,andthepositivesubscriptindicatesthatwearecalculatingthemeanforthepositive(orup)partofthealgorithm.Thebiasisaddedtothesigmoidfunctionvalueoftheweightedsumofallvisibleunits.
Nextavaluemustbesampledforeachofthehiddenneurons.Thisvaluewillrandomlybeeithertrue(1)orfalse(0)withthemeanprobabilityjustcalculated.Equation9.2accomplishesthissampling:
Equation9.2:SampleaHiddenValue
Thisequationassumesthatrisauniformrandomvaluebetween0and1.Auniformrandomnumbersimplymeansthateverypossiblenumberinthatrangehasanequalprobabilityofbeingchosen.
GibbsSampling
Thecalculationofthenegativegradientsisthe“down”phaseoftheup-downalgorithm.Toaccomplishthiscalculation,thealgorithmusesGibbssaplingtoestimatethemeanofthenegativegradients.GemanandGeman(1984)introducedGibbssamplingandnameditafterthephysicistJosiahWillardGibbs.Thetechniqueisaccomplishedbyloopingthroughkiterationsthatimprovethequalityoftheestimate.Eachiterationperformstwosteps:
Samplevisibleneuronsgivehiddenneurons.Samplehiddenneuronsgivevisibleneurons.
ForthefirstiterationofGibbssampling,westartwiththepositivehiddenneuronsamplesobtainedfromthelastsection.Wewillsamplevisibleneuronaverageprobabilitiesfromthese(firstbulletabove).Next,wewillusethesevisiblehiddenneuronstosamplehiddenneurons(secondbulletabove).Thesenewhiddenprobabilitiesarethenegativegradients.Forthenextcycle,wewillusethenegativegradientsinplaceofthepositiveones.Thiscontinuesforeachiterationandproducesbetternegativegradients.Equation9.3accomplishessamplingofthevisibleneurons(firstbullet):
Equation9.3:PropagateDown,SampleVisible(negative)
ThisequationisessentiallythereverseofEquation9.1.Here,wedeterminetheaveragevisiblemeanusingthehiddenvalues.Again,justlikewedidforthepositive
gradients,wesampleanegativeprobabilityusingEquation9.4:
Equation9.4:SampleaVisibleValue
Theaboveequationassumesthatrisauniformrandomnumberbetween0and1.
TheabovetwoequationsareonlyhalfoftheGibbssamplingstep.Theseequationsaccomplishedthefirstbulletpointabovebecausetheysamplevisibleneurons,givenhiddenneurons.Next,wemustaccomplishthesecondbulletpoint.Wemustsamplehiddenneurons,givenvisibleneurons.Thisprocessisverysimilartotheabovesection,“ComputingPositiveGradients.”Thistime,however,wearecalculatingthenegativegradients.
Thevisibleunitsamplesjustcalculatedcanobtainhiddenmeans,asshowninEquation9.5:
Equation9.5:PropagateUp,SampleHidden(negative)
Justasbefore,meanprobabilitycansampleanactualvalue.Inthiscase,weusethehiddenmeantosampleahiddenvalue,asdemonstratedbyEquation9.6:
Equation9.6:SampleaHiddenValue
TheGibbssamplingprocesscontinues.Thenegativehiddensamplescanprocesseachiteration.Oncethiscalculationiscomplete,youhavethefollowingsixvectors:
PositivemeanprobabilitiesofthehiddenneuronsPositivesampledvaluesofthehiddenneuronsNegativemeanprobabilitiesofvisibleneuronsNegativesampledvaluesofvisibleneuronsNegativemeanprobabilitiesofhiddenneuronsNegativesampledvaluesofhiddenneurons
Thesevalueswillupdatetheneuralnetwork’sweightsandbiases.
UpdateWeights&Biases
Thepurposeofanyneuralnetworktrainingistoupdatetheweightsandbiases.Thisadjustmentiswhatallowstheneuralnetworktolearntoperformtheintendedtask.ThisisthefinalstepfortheunsupervisedportionoftheDBNNtrainingprocess.Inthisstep,theweightsandbiasesofasinglelayer(Boltzmannmachine)willbeupdated.Aspreviouslymentioned,theBoltzmannlayersaretrainedindependently.
Theweightsandbiasesareupdatedindependently.Equation9.7showshowtoupdateaweight:
Equation9.7:BoltzmannWeightUpdate
Thelearningrate(ε,epsilon)specifieshowmuchofacalculatedchangeshouldbeapplied.Highlearningrateswilllearnquicker,buttheymightskipoveranoptimalsetofweights.Lowerlearningrateslearnmoreslowly,buttheymighthaveahigherqualityresult.Thevaluexrepresentsthecurrenttrainingsetelement.Becausexisavector(array),thexenclosedintwobarsrepresentsthelengthofx.Theaboveequationalsousesthepositivemeanhiddenprobabilities,thenegativemeanhiddenprobabilities,andthenegativesampledvalues.
Equation9.8calculatesthebiasesinasimilarfashion:
Equation9.8:BoltzmannBiasUpdate
Theaboveequationusesthesampledhiddenvaluefromthepositivephaseandthemeanhiddenvaluefromthenegativephase,aswellastheinputvector.Oncetheweightsandbiaseshavebeenupdated,theunsupervisedportionofthetrainingisdone.
DBNNBackpropagation
Uptothispoint,theDBNNtraininghasfocusedonunsupervisedtraining.TheDBNNusedonlythetrainingsetinputs(xvalues).Evenifthedatasetprovidedanexpectedoutput(yvalues),theunsupervisedtrainingdidn’tuseit.NowtheDBNNistrainedwiththeexpectedoutputs.Weuseonlydatasetitemsthatcontainanexpectedoutputduringthislastphase.ThisstepallowstheprogramtouseDBNNnetworkswithdatasetswhereeachitemdoesnotnecessarilyhaveanexpectedoutput.Werefertothedataaspartiallylabeleddatasets.
ThefinallayeroftheDBNNissimplyaneuronforeachclass.TheseneuronshaveweightstotheoutputofthepreviousRBMlayer.Theseoutputneuronsallusesigmoidactivationfunctionsandasoftmaxlayer.Thesoftmaxlayerensuresthattheoutputforeachoftheclassessumto1.
Regularbackpropagationtrainsthisfinallayer.ThefinallayerisessentiallytheoutputlayerofafeedforwardneuralnetworkthatreceivesitsinputfromthetopRBM.BecauseChapter6,“BackpropagationTraining,”containsadiscussionofbackpropagation,wewillnotrepeattheinformationhere.ThemainideaofaDBNNisthatthehierarchyallowseachlayertointerpretthedataforthenextlayer.Thishierarchyallowsthelearningtospreadacrossthelayers.Thehigherlayerslearnmoreabstractnotionswhilethelowerlayersformfromtheinputdata.Inpractice,DBNNscanprocessmuchmorecomplexofpatternsthanaregularbackpropagation-trainedfeedforwardneuralnetwork.
DeepBeliefApplication
ThischapterpresentsasimpleexampleoftheDBNN.Thisexamplesimplyacceptsaseriesofinputpatternsaswellastheclassestowhichtheseinputpatternsbelong.Theinputpatternsareshownbelow:
[[1,1,1,1,0,0,0,0],
[1,1,0,1,0,0,0,0],
[1,1,1,0,0,0,0,0],
[0,0,0,0,1,1,1,1],
[0,0,0,0,1,1,0,1],
[0,0,0,0,1,1,1,0]]
Weprovidetheexpectedoutputfromeachofthesetrainingsetelements.Thisinformationspecifiestheclasstowhicheachoftheaboveelementsbelongsandisshownbelow:
[[1,0],
[1,0],
[1,0],
[0,1],
[0,1],
[0,1]]
Theprogramprovidedinthebook’sexamplecreatesaDBNNwiththefollowingconfiguration:
InputLayerSize:8HiddenLayer#1:2HiddenLayer#2:3OutputLayerSize:2
First,wetraineachofthehiddenlayers.Finally,weperformlogisticregressionontheoutputlayer.Theoutputfromthisprogramisshownhere:
TrainingHiddenLayer#0
TrainingHiddenLayer#1
Iteration:1,Supervisedtraining:error=0.2478464544753616
Iteration:2,Supervisedtraining:error=0.23501688281192523
Iteration:3,Supervisedtraining:error=0.2228704042138232
...
Iteration:287,Supervisedtraining:error=0.001080510032410002
Iteration:288,Supervisedtraining:error=7.821742124428358E-4
[0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0]->[0.9649828726012807,
0.03501712739871941]
[1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0]->[0.9649830045627616,
0.035016995437238435]
[0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0]->[0.03413161595489315,
0.9658683840451069]
[0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0]->[0.03413137188719462,
0.9658686281128055]
Asyoucansee,theprogramfirsttrainedthehiddenlayersandthenwentthrough288iterationsofregression.Theerrorleveldroppedconsiderablyduringtheseiterations.Finally,thesampledataquizzedthenetwork.Thenetworkrespondedwiththeprobabilityoftheinputsamplebeingineachofthetwoclassesthatwespecifiedabove.
Forexample,thenetworkreportedthefollowingelement:
[0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0]
Thiselementhada96%probabilityofbeinginclass1,butithadonlya4%probabilityofbeinginclass2.Thetwoprobabilitiesreportedforeachitemmustsumto100%.
ChapterSummary
Thischapterprovidedahigh-leveloverviewofmanyofthecomponentsofdeeplearning.Adeepneuralnetworkisanynetworkthatcontainsmorethantwohiddenlayers.Althoughdeepnetworkshaveexistedforaslongasmultilayerneuralnetworks,theyhavelackedgoodtrainingmethodsuntilrecently.Newtrainingtechniques,activationfunctions,andregularizationaremakingdeepneuralnetworksfeasible.
Overfittingisacommonproblemformanyareasofmachinelearning;neuralnetworksarenoexception.Regularizationcanpreventoverfitting.Mostformsofregularizationinvolvemodifyingtheweightsofaneuralnetworkasthetrainingoccurs.Dropoutisaverycommonregularizationtechniquefordeepneuralnetworksthatremovesneuronsastrainingprogresses.Thistechniquepreventsthenetworkfrombecomingoverlydependentonanyoneneuron.
Weendedthechapterwiththedeepbeliefneuralnetwork(DBNN),whichclassifiesdatathatmightbepartiallylabeled.First,bothlabeledandunlabeleddatacaninitializetheweightsoftheneuralnetworkwithunsupervisedtraining.Usingtheseweights,alogisticregressionlayercanfine-tunethenetworktothelabeleddata.
Wealsodiscussedtheconvolutionalneuralnetworks(CNN)inthischapter.Thistypeofneuralnetworkcausestheweightstobesharedbetweenthevariousneuronsinthenetwork.ThisneuralnetworkallowstheCNNtodealwiththetypesofoverlappingfeaturesthatareverycommonincomputervision.WeprovidedonlyageneraloverviewofCNNbecausewewillexaminetheCNNsingreaterdetailinthenextchapter.
Chapter10:ConvolutionalNeuralNetworks
SparseConnectivitySharedWeightsMax-pooling
Theconvolutionalneuralnetwork(CNN)isaneuralnetworktechnologythathasprofoundlyimpactedtheareaofcomputervision(CV).Fukushima(1980)introducedtheoriginalconceptofaconvolutionalneuralnetwork,andLeCun,Bottou,Bengio&Haffner(1998)greatlyimprovedthiswork.Fromthisresearch,YanLeCunintroducedthefamousLeNet-5neuralnetworkarchitecture.ThischapterfollowstheLeNet-5styleofconvolutionalneuralnetwork.
AlthoughcomputervisionprimarilyusesCNNs,thistechnologyhassomeapplicationoutsideofthefield.YouneedtorealizethatifyouwanttoutilizeCNNsonnon-visualdata,youmustfindawaytoencodeyourdatasothatitcanmimicthepropertiesofvisualdata.
CNNsaresomewhatsimilartotheself-organizingmap(SOM)architecturethatweexaminedinChapter2,“Self-OrganizingMaps.”Theorderofthevectorelementsiscrucialtothetraining.Incontrast,mostneuralnetworksthatarenotCNNsorSOMstreattheirinputdataasalongvectorofvalues,andtheorderthatyouarrangetheincomingfeaturesinthisvectorisirrelevant.Forthesetypesofneuralnetworks,youcannotchangetheorderafteryouhavetrainedthenetwork.Inotherwords,CNNsandSOMsdonotfollowthestandardtreatmentofinputvectors.
TheSOMnetworkarrangedtheinputsintoagrid.Thisarrangementworkedwellwithimagesbecausethepixelsincloserproximitytoeachotherareimportanttoeachother.Obviously,theorderofpixelsinanimageissignificant.Thehumanbodyisarelevantexampleofthistypeoforder.Forthedesignoftheface,weareaccustomedtoeyesbeingneartoeachother.Inthesameway,neuralnetworktypeslikeSOMsadheretoanorderofpixels.Consequently,theyhavemanyapplicationstocomputervision.
AlthoughSOMsandCNNsaresimilarinthewaythattheymaptheirinputinto2Dgridsorevenhigher-dimensionobjectssuchas3Dboxes,CNNstakeimagerecognitiontohigherlevelofcapability.ThisadvanceinCNNsisduetoyearsofresearchonbiologicaleyes.Inotherwords,CNNsutilizeoverlappingfieldsofinputtosimulatefeaturesofbiologicaleyes.Untilthisbreakthrough,AIhadbeenunabletoreproducethecapabilitiesofbiologicalvision.
Scale,rotation,andnoisehavepresentedchallengesinthepastforAIcomputervisionresearch.Youcanobservethecomplexityofbiologicaleyesintheexamplethatfollows.Afriendraisesasheetofpaperwithalargenumberwrittenonit.Asyourfriendmovesnearertoyou,thenumberisstillidentifiable.Inthesameway,youcanstillidentifythenumberwhenyourfriendrotatesthepaper.Lastly,yourfriendcreatesnoisebydrawinglinesontopofthepage,butyoucanstillidentifythenumber.Asyoucansee,these
examplesdemonstratethehighfunctionofthebiologicaleyeandallowyoutounderstandbettertheresearchbreakthroughofCNNs.Thatis,thisneuralnetworkhastheabilitytoprocessscale,rotation,andnoiseinthefieldofcomputervision.
LeNET-5
WecanusetheLeNET-5architectureprimarilyfortheclassificationofgraphicalimages.Thisnetworktypeissimilartothefeedforwardnetworkthatweexaminedinpreviouschapters.Dataflowfrominputtotheoutput.However,theLeNET-5networkcontainsseveraldifferentlayertypes,asFigure10.1illustrates:
Figure10.1:ALeNET-5Network(LeCun,1998)
SeveralimportantdifferencesexistbetweenafeedforwardneuralnetworkandaLeNET-5network:
Vectorspassthroughfeedforwardnetworks;3DcubespassthroughLeNET-5networks.LeNET-5networkscontainavarietyoflayertypes.ComputervisionistheprimaryapplicationoftheLeNET-5.
However,wehavealsoexploredthemanysimilaritiesbetweenthenetworks.ThemostimportantsimilarityisthatwecantraintheLeNET-5withthesamebackpropagation-basedtechniques.AnyoptimizationalgorithmcantraintheweightsofeitherafeedforwardorLeNET-5network.Specifically,youcanutilizesimulatedannealing,geneticalgorithms,andparticleswarmfortraining.However,LeNET-5frequentlyusesbackpropagationtraining.
ThefollowingthreelayertypescomprisetheoriginalLeNET-5neuralnetworks:
ConvolutionalLayersMax-poolLayersDenseLayers
Otherneuralnetworkframeworkswilladdadditionallayertypesrelatedtocomputervision.However,wewillnotexploretheseadditionsbeyondtheLeNET-5standard.
Addingnewlayertypesisacommonmeansofaugmentingexistingneuralnetworkresearch.Chapter12,“DropoutandRegularization,”willintroduceanadditionallayertypethatisdesignedtoreduceoverfittingbyaddingadropoutlayer.
Fornow,wefocusourdiscussiononthelayertypesassociatedwithconvolutionalneuralnetworks.Wewillbeginwithconvolutionallayers.
ConvolutionalLayers
Thefirstlayerthatwewillexamineistheconvolutionallayer.Wewillbeginbylookingatthehyper-parametersthatyoumustspecifyforaconvolutionallayerinmostneuralnetworkframeworksthatsupporttheCNN:
NumberoffiltersFilterSizeStridePaddingActivationFunction/Non-Linearity
Theprimarypurposeforaconvolutionallayeristodetectfeaturessuchasedges,lines,blobsofcolor,andothervisualelements.Thefilterscandetectthesefeatures.Themorefiltersthatwegivetoaconvolutionallayer,themorefeaturesitcandetect.
Afilterisasquare-shapedobjectthatscansovertheimage.Agridcanrepresenttheindividualpixelsofagrid.Youcanthinkoftheconvolutionallayerasasmallergridthatsweepslefttorightovereachrowoftheimage.Thereisalsoahyper-parameterthatspecifiesboththewidthandheightofthesquare-shapedfilter.Figure10.1showsthisconfigurationinwhichyouseethesixconvolutionalfilterssweepingovertheimagegrid:
Aconvolutionallayerhasweightsbetweenitandthepreviouslayerorimagegrid.Eachpixeloneachconvolutionallayerisaweight.Therefore,thenumberofweightsbetweenaconvolutionallayeranditspredecessorlayerorimagefieldisthefollowing:
[FilterSize]*[FilterSize]*[#ofFilters]
Forexample,ifthefiltersizewere5(5x4)for10filters,therewouldbe250weights.
Youneedtounderstandhowtheconvolutionalfilterssweepacrossthepreviouslayer’soutputorimagegrid.Figure10.2illustratesthesweep:
Figure10.2:ConvolutionalFilter
Theabovefigureshowsaconvolutionalfilterwithasizeof4andapaddingsizeof1.Thepaddingsizeisresponsiblefortheboarderofzerosintheareathatthefiltersweeps.Eventhoughtheimageisactually8x7,theextrapaddingprovidesavirtualimagesizeof9x8forthefiltertosweepacross.Thestridespecifiesthenumberofpositionsatwhichtheconvolutionalfilterswillstop.Theconvolutionalfiltersmovetotheright,advancingbythenumberofcellsspecifiedinthestride.Oncethefarrightisreached,theconvolutionalfiltermovesbacktothefarleft,thenitmovesdownbythestrideamountandcontinuestotherightagain.
Someconstraintsexistinrelationtothesizeofthestride.Obviously,thestridecannotbe0.Theconvolutionalfilterwouldnevermoveifthestrideweresetto0.Furthermore,neitherthestride,northeconvolutionalfiltersizecanbelargerthanthepreviousgrid.Thereareadditionalconstraintsonthestride(s),padding(p)andthefilterwidth(f)foranimageofwidth(w).Specifically,theconvolutionalfiltermustbeabletostartatthefarleftortopboarder,moveacertainnumberofstrides,andlandonthefarrightorbottomboarder.Equation10.1showsthenumberofstepsaconvolutionaloperatormusttaketocrosstheimage:
Equation10.1:StepsAcrossanImage
Thenumberofstepsmustbeaninteger.Inotherwords,itcannothavedecimalplaces.Thepurposeofthepadding(p)istobeadjustedtomakethisequationbecomeanintegervalue.
Wecanusethesamesetofweightsastheconvolutionalfiltersweepsovertheimage.Thisprocessallowsconvolutionallayerstoshareweightsandgreatlyreducetheamountofprocessingneeded.Inthisway,youcanrecognizetheimageinshiftpositionsbecausethesameconvolutionalfiltersweepsacrosstheentireimage.
Theinputandoutputofaconvolutionallayerareboth3Dboxes.Fortheinputtoaconvolutionallayer,thewidthandheightoftheboxisequaltothewidthandheightoftheinputimage.Thedepthoftheboxisequaltothecolordepthoftheimage.ForanRGBimage,thedepthis3,equaltothecomponentsofred,green,andblue.Iftheinputtotheconvolutionallayerisanotherlayer,thenitwillalsobea3Dbox;however,thedimensionsofthat3Dboxwillbedictatedbythehyper-parametersofthatlayer.
Likeanyotherlayerintheneuralnetwork,thesizeofthe3Dboxoutputbyaconvolutionallayerisdictatedbythehyper-parametersofthelayer.Thewidthandheightofthisboxarebothequaltothefiltersize.However,thedepthisequaltothenumberoffilters.
Max-PoolLayers
Max-poollayersdownsamplea3Dboxtoanewonewithsmallerdimensions.Typically,youcanalwaysplaceamax-poollayerimmediatelyfollowingaconvolutionallayer.Figure10.1showsthemax-poollayerimmediatelyafterlayersC1andC3.Thesemax-poollayersprogressivelydecreasethesizeofthedimensionsofthe3Dboxespassingthroughthem.Thistechniquecanavoidoverfitting(Krizhevsky,Sutskever&Hinton,2012).
Apoolinglayerhasthefollowinghyper-parameters:
SpatialExtent(f)Stride(s)
Unlikeconvolutionallayers,max-poollayersdonotusepadding.Additionally,max-poollayershavenoweights,sotrainingdoesnotaffectthem.Theselayerssimplydownsampletheir3Dboxinput.
The3Dboxoutputbyamax-poollayerwillhaveawidthequaltoEquation10.2:
Equation10.2:WidthMax-poolOutput
Theheightofthe3Dboxproducedbythemax-poollayeriscalculatedsimilarlywith
Equation10.3:
Equation10.3:HeightofMax-poolingOutput
Thedepthofthe3Dboxproducedbythemax-poollayerisequaltothedepththe3Dboxreceivedasinput.
Themostcommonsettingforthehyper-parametersofamax-poollayeraref=2ands=2.Thespatialextent(f)specifiesthatboxesof2x2willbescaleddowntosinglepixels.Ofthesefourpixels,thepixelwiththemaximumvaluewillrepresentthe2x2pixelinthenewgrid.Becausesquaresofsize4arereplacedwithsize1,75%ofthepixelinformationislost.Figure10.3showsthistransformationasa6x6gridbecomesa3x3:
Figure10.3:Max-pooling(f=2,s=2)
Ofcourse,theabovediagramshowseachpixelasasinglenumber.Agrayscaleimagewouldhavethischaracteristic.ForanRGBimage,weusuallytaketheaverageofthethreenumberstodeterminewhichpixelhasthemaximumvalue.
DenseLayers
ThefinallayertypeinaLeNET-5networkisadenselayer.Thislayertypeisexactlythesametypeoflayeraswe’veseenbeforeinfeedforwardneuralnetworks.Adenselayerconnectseveryelement(neuron)inthepreviouslayer’soutput3Dboxtoeachneuroninthedenselayer.Theresultingvectorispassedthroughanactivationfunction.LeNET-5networkswilltypicallyuseaReLUactivation.However,wecanuseasigmoidactivationfunction;thistechniqueislesscommon.Adenselayerwilltypicallycontainthefollowinghyper-parameters:
NeuronCountActivationFunction
Theneuroncountspecifiesthenumberofneuronsthatmakeupthislayer.Theactivationfunctionindicatesthetypeofactivationfunctiontouse.Denselayerscanemploymanydifferentkindsofactivationfunctions,suchasReLU,sigmoidorhyperbolictangent.
LeNET-5networkswilltypicallycontainseveraldenselayersastheirfinallayers.ThefinaldenselayerinaLeNET-5actuallyperformstheclassification.Thereshouldbeoneoutputneuronforeachclass,ortypeofimage,toclassify.Forexample,ifthenetworkdistinguishesbetweendogs,cats,andbirds,therewillbethreeoutputneurons.Youcanapplyafinalsoftmaxfunctiontothefinallayertotreattheoutputneuronsasprobabilities.Softmaxallowseachneurontoprovidetheprobabilityoftheimagerepresentingeachclass.Becausetheoutputneuronsarenowprobabilities,softmaxensuresthattheysumto1.0(100%).Toreviewsoftmax,youcanrereadChapter4,“FeedforwardNeuralNetworks.”
ConvNetsfortheMNISTDataSet
InChapter6,“BackpropagationTraining,”weusedtheMNISThandwrittendigitsasanexampleofusingbackpropagation.InChapter10,wepresentanexampleaboutimprovingourrecognitionoftheMNISTdigits,asadeepconvolutionalneuralnetwork.Theconvolutionalnetwork,beingadeepneuralnetwork,willhavemorelayersthanthefeedforwardneuralnetworkseeninChapter6.Thehyper-parametersforthisnetworkareasfollows:
Input:Acceptsboxof[1,96,96]ConvolutionalLayer:filters=32,filter_size=[3,3]Max-poolLayer:[2,2]ConvolutionalLayer:filters=64,filter_size=[2,2]Max-poolLayer:[2,2]ConvolutionalLayer:filters=128,filter_size=[2,2]Max-poolLayer:[2,2]DenseLayer:500neuronsOutputLayer:30neurons
Thisnetworkusestheverycommonpatterntofolloweachconvolutionallayerwithamax-poollayer.Additionally,thenumberoffiltersdecreasesfromtheinputtotheoutputlayer,therebyallowingasmallernumberofbasicfeatures,suchasedges,lines,andsmallshapestobedetectedneartheinputfield.Successiveconvolutionallayersrollupthesebasicfeaturesintolargerandmorecomplexfeatures.Ultimately,thedenselayercanmapthesehigher-levelfeaturesintoeachx-coordinateandy-coordinateoftheactual15digitfeatures.
Trainingtheconvolutionalneuralnetworktakesconsiderabletime,especiallyifyouarenotusingGPUprocessing.AsofJuly2015,notallframeworkshaveequalsupportofGPUprocessing.Atthistime,usingPythonwithaTheano-basedneuralnetworkframework,suchasLasange,providesthebestresults.ManyofthesameresearcherswhoareimprovingdeepconvolutionalnetworksarealsoworkingwithTheano.Thus,theypromoteitbeforeotherframeworksonotherlanguages.
Forthisexample,weusedTheanowithLasange.Thebook’sexampledownloadmayhaveotherlanguagesavailableforthisexampleaswell,dependingontheframeworksavailableforthoselanguages.TrainingaconvolutionalneuralnetworkfordigitfeaturerecognitiononTheanotooklesstimewithaGPUthanaCPU,asaGPUhelpsconsiderablyforconvolutionalneuralnetworks.Theexactamountofperformancewillvaryaccordingtohardwareandplatform.TheaccuracycomparisonbetweentheconvolutionalneuralnetworkandtheregularReLUnetworkisshownhere:
Relu:
Bestvalidlosswas0.068229atepoch17.
Incorrect170/10000(1.7000000000000002%)
ReLU+Conv:
Bestvalidlosswas0.065753atepoch3.
Incorrect150/10000(1.5%)
IfyoucomparetheresultsfromtheconvolutionalneuralnetworktothestandardfeedforwardneuralnetworkfromChapter6,youwillseetheconvolutionalneuralnetworkperformedbetter.Theconvolutionalneuralnetworkiscapableofrecognizingsub-featuresinthedigitstoboostitsperformanceoverthestandardfeedforwardneuralnetwork.Ofcourse,theseresultswillvary,dependingontheplatformused.
ChapterSummary
Convolutionalneuralnetworksareaveryactiveareainthefieldofcomputervision.Theyallowtheneuralnetworktodetecthierarchiesoffeatures,suchaslinesandsmallshapes.Thesesimplefeaturescanformhierarchiestoteachtheneuralnetworktorecognizecomplexpatternscomposedofthemoresimplefeatures.Deepconvolutionalnetworkscantakeconsiderableprocessingpower.SomeframeworksallowtheuseofGPUprocessingtoenhanceperformance.
YannLeCunintroducedtheLeNET-5,themostcommontypeofconvolutionalnetwork.Thisneuralnetworktypeiscomprisedofdenselayers,convolutionallayersandmax-poollayers.Thedenselayersworkexactlythesamewayastraditionalfeedforwardnetworks.Max-poollayerscandownsampletheimageandremovedetail.Convolutionallayersdetectfeaturesinanypartoftheimagefield.
Therearemanydifferentapproachestodeterminethebestarchitectureforaneuralnetwork.Chapter8,“NEAT,CPPNandHyperNEAT,”introducedaneuralnetwork
algorithmthatcouldautomaticallydeterminethebestarchitecture.Ifyouareusingafeedforwardneuralnetworkyouwillmostlikelyarriveatastructurethroughpruningandmodelselection,whichwediscussinthenextchapter.
Chapter11:PruningandModelSelectionPruningaNeuralNetworkModelSelectionRandomvs.GridSearch
Inpreviouschapters,welearnedthatyoucouldbetterfittheweightsofaneuralnetworkwithvarioustrainingalgorithms.Ineffect,thesealgorithmsadjusttheweightsoftheneuralnetworkinordertolowertheerroroftheneuralnetwork.Weoftenrefertotheweightsofaneuralnetworkastheparametersoftheneuralnetworkmodel.Somemachinelearningmodelsmighthaveparametersotherthanweights.Forexample,logisticregression(whichwediscussedinArtificialIntelligenceforHumans,Volume1)hascoefficientsasparameters.
Whenwetrainthemodel,theparametersofanymachinelearningmodelchange.However,thesemodelsalsohavehyper-parametersthatdonotchangeduringtrainingalgorithms.Forneuralnetworks,thehyper-parametersspecifythearchitectureoftheneuralnetwork.Examplesofhyper-parametersforneuralnetworksincludethenumberofhiddenlayersandhiddenneurons.
Inthischapter,wewillexaminetwoalgorithmsthatcanactuallymodifyorsuggestastructurefortheneuralnetwork.Pruningworksbyanalyzinghowmucheachneuroncontributestotheoutputoftheneuralnetwork.Ifaparticularneuron’sconnectiontoanotherneurondoesnotsignificantlyaffecttheoutputoftheneuralnetwork,theconnectionwillbepruned.Throughthisprocess,connectionsandneuronsthathaveonlyamarginalimpactontheoutputareremoved.
Theotheralgorithmthatweintroduceinthischapterismodelselection.Whilepruningstartswithanalreadytrainedneuralnetwork,modelselectioncreatesandtrainsmanyneuralnetworkswithdifferenthyper-parameters.Theprogramthenselectsthehyper-parametersproducingtheneuralnetworkthatachievesthebestvalidationscore.
UnderstandingPruning
Pruningisaprocessthatmakesneuralnetworksmoreefficient.Unlikethetrainingalgorithmsalreadydiscussedinthisbook,pruningdoesnotincreasethetrainingerroroftheneuralnetwork.Theprimarygoalofpruningistodecreasetheamountofprocessingrequiredtousetheneuralnetwork.Additionally,pruningcansometimeshavearegularizingeffectbyremovingcomplexityfromtheneuralnetwork.Thisregularizationcansometimesdecreasetheamountofoverfittingintheneuralnetwork.Thisdecreasecanhelptheneuralnetworkperformbetterondatathatwerenotpartofthetrainingset.
Pruningworksbyanalyzingtheconnectionsoftheneuralnetwork.Thepruningalgorithmlooksforindividualconnectionsandneuronsthatcanberemovedfromthe
neuralnetworktomakeitoperatemoreefficiently.Bypruningunneededconnections,theneuralnetworkcanbemadetoexecutefasterandpossiblydecreaseoverfitting.Inthenexttwosections,wewillexaminehowtoprunebothconnectionsandneurons.
PruningConnections
Connectionpruningiscentraltomostpruningalgorithms.Theprogramanalyzestheindividualconnectionsbetweentheneuronstodeterminewhichconnectionshavetheleastimpactontheeffectivenessoftheneuralnetwork.Connectionsarenottheonlythingthattheprogramcanprune.Analyzingtheprunedconnectionswillrevealthattheprogramcanalsopruneindividualneurons.
PruningNeurons
Pruningfocusesprimarilyontheconnectionsbetweentheindividualneuronsoftheneuralnetwork.However,topruneindividualneurons,wemustexaminetheconnectionsbetweeneachneuronandtheotherneurons.Ifoneparticularneuronissurroundedentirelybyweakconnections,thereisnoreasontokeepthatneuron.Ifweapplythecriteriadiscussedintheprevioussection,neuronsthathavenoconnectionsaretheendresultbecausetheprogramhasprunedalloftheneuron’sconnections.Thentheprogramcanprunethistypeofaneuron.
ImprovingorDegradingPerformance
Itispossiblethatpruninganeuralnetworkmayimproveitsperformance.Anymodificationstotheweightmatrixofaneuralnetworkwillalwayshavesomeimpactontheaccuracyoftherecognitionsmadebytheneuralnetwork.Aconnectionthathaslittleornoimpactontheneuralnetworkmayactuallybedegradingtheaccuracywithwhichtheneuralnetworkrecognizespatterns.Removingthisweakconnectionmayimprovetheoveralloutputoftheneuralnetwork.
Unfortunately,pruningcanalsodecreasetheeffectivenessoftheneuralnetwork.Thus,youmustalwaysanalyzetheeffectivenessoftheneuralnetworkbeforeandafterpruning.Sinceefficiencyistheprimarybenefitofpruning,youmustbecarefultoevaluatewhetheranimprovementintheprocessingtimeisworthadecreaseintheneuralnetwork’seffectiveness.Wewillevaluatetheoveralleffectivenessoftheneuralnetworkbothbeforeandafterpruninginoneoftheprogrammingexamplesfromthischapter.Thisanalysiswillgiveusanideaoftheimpactthatthepruningprocesshasontheeffectivenessoftheneuralnetwork.
PruningAlgorithm
Wewillnowreviewexactlyhowpruningtakesplace.Pruningworksbyexaminingtheweightmatricesofapreviouslytrainedneuralnetwork.Thepruningalgorithmwillthenattempttoremoveneuronswithoutdisruptingtheoutputoftheneuralnetwork.Figure11.1showsthealgorithmusedforselectivepruning:
Figure11.1:PruningaNeuralNetwork
Asyoucansee,thepruningalgorithmhasatrial-and-errorapproach.Thepruningalgorithmattemptstoremoveneuronsfromtheneuralnetworkuntilitcannotremoveadditionalneuronswithoutdegradingtheperformanceoftheneuralnetwork.
Tobeginthisprocess,theselectivepruningalgorithmloopsthrougheachofthehiddenneurons.Foreachhiddenneuronencountered,theprogramevaluatestheerrorleveloftheneuralnetworkbothwithandwithoutthespecifiedneuron.Iftheerrorratejumpsbeyondapredefinedlevel,theprogramretainstheneuronandevaluatesthenext.Iftheerrorrate
doesnotimprovesignificantly,theprogramremovestheneuron.
Oncetheprogramhasevaluatedallneurons,itrepeatstheprocess.Thiscyclecontinuesuntiltheprogramhasmadeonepassthroughthehiddenneuronswithoutremovingasingleneuron.Oncethisprocessiscomplete,anewneuralnetworkisachievedthatperformsacceptablyclosetotheoriginal,yetithasfewerhiddenneurons.
ModelSelection
Modelselectionistheprocesswheretheprogrammerattemptstofindasetofhyper-parametersthatproducethebestneuralnetwork,orothermachinelearningmodel.Inthisbook,wehavementionedmanydifferenthyper-parametersthatarethesettingsthatyoumustprovidetotheneuralnetworkframework.Examplesofneuralnetworkhyper-parametersinclude:
ThenumberofhiddenlayersTheorderoftheconvolutional,pooling,anddropoutlayersThetypeofactivationfunctionThenumberofhiddenneuronsThestructureofpoolingandconvolutionallayers
Asyou’vereadthroughthesechaptersthatmentionhyper-parameters,you’veprobablybeenwonderinghowyouknowwhichsettingstouse.Unfortunately,thereisnoeasyanswer.Ifeasymethodsexistedthatdeterminethesesettings,programmerswouldhaveconstructedtheneuralnetworkframeworksthatautomaticallysetthesehyper-parametersforyou.
Whilewewillprovidemoreinsightintohyper-parametersinChapter14,“ArchitectingNeuralNetworks,”youwillstillneedtousethemodelselectionprocessesdescribedinthischapter.Unfortunately,modelselectionisverytime-consuming.Wespent90%ofourtimeperformingmodelselectionduringourlastKagglecompetition.Often,successinmodelingiscloselyrelatedtotheamountoftimeyouhavetospendonmodelselection.
GridSearchModelSelection
Gridsearchisatrial-and-error,brute-forcealgorithm.Forthistechnique,youmustspecifyeverycombinationofthehyper-parametersthatyouwouldliketouse.Youmustbejudiciousinyourselectionbecausethenumberofsearchiterationscanquicklygrow.Typically,youmustspecifythehyper-parametersthatyouwouldliketosearch.Thisspecificationmightlooklikethefollowing:
HiddenNeurons:2to10,stepsize2ActivationFunctions:tanh,sigmoid&ReLU
Thefirstitemstatesthatthegridsearchshouldtryhiddenneuroncountsbetween2and10countingby2,whichresultsinthefollowing:2,4,6,8,and10(5totalpossibilities.)Theseconditemstatesthatweshouldalsotrytheactivationfunctionstanh,sigmoid,andReLUforeachneuroncount.Thisprocessresultsinatotaloffifteeniterationsbecausefivepossibilitiestimesthreepossibilitiesisfifteentotal.Thesepossibilitiesarelistedhere:
Iteration#1:[2][sigmoid]
Iteration#2:[4][sigmoid]
Iteration#3:[6][sigmoid]
Iteration#4:[8][sigmoid]
Iteration#5:[10][sigmoid]
Iteration#6:[2][ReLU]
Iteration#7:[4][ReLU]
Iteration#8:[6][ReLU]
Iteration#9:[8][ReLU]
Iteration#10:[10][ReLU]
Iteration#11:[2][tanh]
Iteration#12:[4][tanh]
Iteration#13:[6][tanh]
Iteration#14:[8][tanh]
Iteration#15:[10][tanh]
Eachsetofpossibilitiesiscalledanaxis.Theseaxesrotatethroughallpossiblecombinationsbeforetheyfinish.Youcanvisualizethisprocessbythinkingofacar’sodometer.Thefarleftdial(oraxis)isspinningthefastest.Itcountsbetween0and9.Onceithits9andneedstogotothenextnumber,itforwardbackto0,andthenextplace,totheleft,rollsforwardbyone.Unlessthatnextplacewasalsoon9,therollovercontinuestotheleft.Atsomepoint,alldigitsontheodometerareat9,andtheentiredevicewouldrollbackoverto0.Whenthisfinalrolloveroccurs,thegridsearchisdone.
Mostframeworksallowtwoaxistypes.Thefirsttypeisanumericrangewithastep.Thesecondtypeisalistofvalues,liketheactivationfunctionsabove.ThefollowingJavascriptexampleallowsyoutotryyourownsetsofaxestoseethenumberofiterationsproduced:
http://www.heatonresearch.com/aifh/vol3/grid_iter.html
Listing11.1showsthepseudocodenecessarytorollthroughalliterationsofseveralsetsofvalues:
Listing11.1:GridSearch
#Thevariableaxescontainsalistofeachaxis.
#Eachaxes(inaxes)isalistofpossiblevalues
#forthataxis.
#Currentindexofeachaxisiszero,createanarray
#ofzeros.
indexes=zeros(len(axes))
done=false
whilenotdone:
#Preparevectorofcurrentiteration’s
#hyper-parameters.
iteration=[]
forifrom0tolen(axes)
iteration.add(axes[i][indexes[i]])
#Performoneiteration,passinginthehyper-parameters
#thatarestoredintheiterationlist.Thisfunction
#shouldtraintheneuralnetworkaccordingtothe
#hyper-parametersandkeepnoteofthebesttrained
#networksofar.
perform_iteration(iteration)
#Rotatetheaxesforwardoneunit,likeacar’s
#odometer.
indexes[0]=indexes[0]+1;
varcounterIdx=0;
#rollforwardtheotherplaces,ifneeded
whilenotdoneandindexes[counterIdx]>=
len(axes[counterIdx]):
indexes[counterIdx]=0
counterIdx=counterIdx+1
ifcounterIdx>=len(axes):
done=true
else:
indexes[counterIdx]=indexes[counterIdx]+1
Thecodeaboveusestwoloopstopassthrougheverypossiblesetofthehyper-parameters.Thefirstloopcontinueswhiletheprogramisstillproducinghyper-parameters.Eachtimethrough,thisloopincreasesthefirsthyper-parametertothenextvalue.Thesecondloopdetectsifthefirsthyper-parameterhasrolledover.Theinnerloopkeepsmovingforwardtothenexthyper-parameteruntilnomorerolloversoccur.Onceallthehyper-parametersrollover,theprocessisdone.
Asyoucansee,thegridsearchcanquicklyresultinalargenumberofiterations.Considerifyouwishedtosearchfortheoptimalnumberofhiddenneuronsonfivelayers,whereyouallowedupto200neuronsoneachlevel.Thisvaluewouldbeequalto200multipliedbyitselffivetimes,or200tothefifthpower.Thisprocessresultsin320billioniterations.Becauseeachiterationinvolvestraininganeuralnetwork,iterationscantakeminutes,hoursorevendaystoexecute.
Whenperforminggridsearches,multi-threadingandgridprocessingcanbebeneficial.Runningtheiterationsthroughathreadpoolcangreatlyspeedupthesearch.Thethread
poolshouldhaveasizeequaltothenumberofcoresonthecomputer’smachine.Thistraitallowsamachinewitheightcorestoworkoneightneuralnetworkssimultaneously.Thetrainingoftheindividualmodelsmustbesinglethreadedwhenyouruntheiterationssimultaneously.Manyframeworkswilluseallavailablecorestotrainasingleneuralnetwork.Whenyouhavealargenumberofneuralnetworkstotrain,youshouldalwaystrainseveralneuralnetworksinparallel,runningthemoneatimesothateachnetworkusesthemachinescores.
RandomSearchModelSelection
Itisalsopossibletousearandomsearchformodelselection.Insteadofsystematicallytryingeveryhyper-parametercombination,therandomsearchmethodchoosesrandomvaluesforhyper-parameters.Fornumericranges,younolongerneedtospecifyastepvalue,therandommodelselectionwillchooseacontinuousrangeoffloatingpointnumbersbetweenyourspecifiedbeginningandendingpoints.Forarandomsearch,theprogrammertypicallyspecifieseitheratimeoraniterationlimit.Thefollowingshowsarandomsearch,usingthesameaxesasabove,butitislimitedtoteniterations:
Iteration#1:[3.298266736790538][sigmoid]
Iteration#2:[9.569985574809834][ReLU]
Iteration#3:[1.241154231596738][sigmoid]
Iteration#4:[9.140498645836487][sigmoid]
Iteration#5:[8.041758658131585][tanh]
Iteration#6:[2.363519841339439][ReLU]
Iteration#7:[9.72388393455185][tanh]
Iteration#8:[3.411276006139815][tanh]
Iteration#9:[3.1166220877785236][sigmoid]
Iteration#10:[8.559433702612296][sigmoid]
Asyoucansee,thefirstaxis,whichisthehiddenneuroncount,isnowtakingonfloating-pointvalues.Youcansolvethisproblembyroundingtheneuroncounttothenearestwholenumber.Itisalsoadvisabletoavoidretestingthesamehyper-parametersmorethanonce.Asaresult,theprogramshouldkeepalistofpreviouslytriedhyper-parameterssothatitdoesn’trepeatanyhyper-parametersthatwerewithasmallrangeofapreviouslytriedset.
ThefollowingURLusesJavascripttoshowrandomsearchinaction:
http://www.heatonresearch.com/aifh/vol3/random_iter.html
OtherModelSelectionTechniques
Modelselectionisaveryactiveareaofresearch,and,asaresult,manyinnovativewaysexisttoperformit.Thinkofthehyper-parametersasavectorofvaluesandtheprocessoffindingthebestneuralnetworkscoreforthosehyper-parametersasanobjectivefunction.Youcanconsiderthesehyper-parametersasanoptimizationproblem.Wehavepreviouslyexaminedmanyoptimizationalgorithmsinearliervolumesofthisbookseries.Thesealgorithmsarethefollowing:
AntColonyOptimization(ACO)GeneticAlgorithmsGeneticProgrammingHillClimbingNelder-MeadParticleSwarmOptimization(PSO)SimulatedAnnealing
WeexaminedmanyofthesealgorithmsindetailinVolumes1and2ofArtificialIntelligenceforHumans.Althoughthelistofalgorithmsislong,therealityisthatmostofthesealgorithmsarenotsuitedformodelselectionbecausetheobjectivefunctionformodelselectioniscomputationallyexpensive.Itmighttakeminutes,hoursorevendaystotrainaneuralnetworkanddeterminehowwellagivensetofhyper-parameterscantrainaneuralnetwork.
Nelder-Meadandsometimeshillclimbingturnouttobethebestoptionsifyouwishtoapplyanoptimizationfunctiontomodelselection.Thesealgorithmsattempttominimizecallstotheobjectivefunction.Callstotheobjectivefunctionareveryexpensiveforaparametersearchbecauseaneuralnetworkmustbetrained.Agoodtechniqueforoptimizationistogenerateasetofhyper-parameterstouseasastartingpointforNelder-MeadandallowNelder-Meadtoimprovethesehyper-parameters.Nelder-Meadisagoodchoiceforahyper-parametersearchbecauseitresultsinarelativelysmallnumberofcallstotheobjectivefunction.
ModelselectionisaverycommonpartofKaggledatasciencecompetitions.Basedoncompetitiondiscussionsandreports,mostparticipantsusegridandrandomsearchesformodelselection..Nelder-Meadisalsopopular.AnothertechniquethatisgaininginpopularityistheuseofBayesianoptimization,asdescribedbySnoek,Larochelle,Hugo&Adams(2012).Animplementationofthisalgorithm,writteninPython,iscalledSpearmint,andyoucanfinditatthefollowingURL:
https://github.com/JasperSnoek/spearmint
Bayesianoptimizationisarelativelynewtechniqueformodelselectiononwhichwehaveonlyrecentlyconductedresearch.Therefore,thiscurrentbookdoesnotcontainamoreprofoundexaminationofit.Futureeditionsmayincludemoreinformationofthistechnique.
ChapterSummary
Asyoulearnedinthischapter,itispossibletopruneneuralnetworks.Pruninganeuralnetworkremovesconnectionsandneuronsinordertomaketheneuralnetworkmoreefficient.Executionspeed,numberofconnections,anderrorareallmeasuresofefficiency.Althoughneuralnetworksmustbeeffectiveatrecognizingpatterns,efficiencyisthemaingoalofpruning.Severaldifferentalgorithmscanpruneaneuralnetwork.Inthischapter,weexaminedtwoofthesealgorithms.Ifyourneuralnetworkisalreadyoperatingsufficientlyfast,youmustevaluatewhetherthepruningisjustified.Evenwhenefficiencyisimportant,youmustweighthetrade-offsbetweenefficiencyandareductionintheeffectivenessofyourneuralnetwork.
Modelselectionplaysasignificantroleinneuralnetworkdevelopment.Hyper-parametersaresettingssuchashiddenneuron,layercount,andactivationfunctionselection.Modelselectionistheprocessoffindingthesetofhyper-parametersthatwillproducethebest-trainedneuralnetwork.Avarietyofalgorithmscansearchthroughthepossiblesettingsofthehyper-parametersandfindthebestset.
Pruningcansometimesleadtoadecreaseinthetendencyforneuralnetworkstooverfit.Thisreductioninoverfittingistypicallyonlyabyproductofthepruningprocess.Algorithmsthatreduceoverfittingarecalledregularizationalgorithms.Althoughpruningwillsometimeshavearegularizingeffect,anentiregroupofalgorithms,calledregularizationalgorithms,existtoreduceoverfitting.Wewillfocusexclusivelyonthesealgorithmsinthenextchapter.
Chapter12:DropoutandRegularizationRegularizationL1&L2RegularizationDropoutLayers
Regularizationisatechniquethatreducesoverfitting,whichoccurswhenneuralnetworksattempttomemorizetrainingdata,ratherthanlearnfromit.Humansarecapableofoverfittingaswell.Beforeweexaminethewaysthatamachineaccidentallyoverfits,wewillfirstexplorehowhumanscansufferfromit.
Humanprogrammersoftentakecertificationexamstoshowtheircompetenceinagivenprogramminglanguage.Tohelppreparefortheseexams,thetestmakersoftenmakepracticeexamsavailable.Consideraprogrammerwhoentersaloopoftakingthepracticeexam,studyingmore,andthentakingthepracticeexamagain.Atsomepoint,theprogrammerhasmemorizedmuchofthepracticeexam,ratherthanlearningthetechniquesnecessarytofigureouttheindividualquestions.Theprogrammerhasnowoverfittothepracticeexam.Whenthisprogrammertakestherealexam,hisactualscorewilllikelybelowerthanwhatheearnedonthepracticeexam.
Acomputercanoverfitaswell.Althoughaneuralnetworkreceivedahighscoreonitstrainingdata,thisresultdoesnotmeanthatthesameneuralnetworkwillscorehighondatathatwasnotinsidethetrainingset.Regularizationisoneofthetechniquesthatcanpreventoverfitting.Anumberofdifferentregularizationtechniquesexist.Mostworkbyanalyzingandpotentiallymodifyingtheweightsofaneuralnetworkasittrains.
L1andL2Regularization
L1andL2regularizationaretwocommonregularizationtechniquesthatcanreducetheeffectsofoverfitting(Ng,2004).Bothofthesealgorithmscaneitherworkwithanobjectivefunctionorasapartofthebackpropagationalgorithm.Inbothcasestheregularizationalgorithmisattachedtothetrainingalgorithmbyaddinganadditionalobjective.
Bothofthesealgorithmsworkbyaddingaweightpenaltytotheneuralnetworktraining.Thispenaltyencouragestheneuralnetworktokeeptheweightstosmallvalues.BothL1andL2calculatethispenaltydifferently.Forgradient-descent-basedalgorithms,suchasbackpropagation,youcanaddthispenaltycalculationtothecalculatedgradients.Forobjective-function-basedtraining,suchassimulatedannealing,thepenaltyisnegativelycombinedwiththeobjectivescore.
BothL1andL2workdifferentlyinthewaythattheypenalizethesizeofaweight.L1willforcetheweightsintoapatternsimilartoaGaussiandistribution;theL2willforcetheweightsintoapatternsimilartoaLaplacedistribution,asdemonstratedbyFigure
12.1:
Figure12.1:L1vsL2
Asyoucansee,L1algorithmismoretolerantofweightsfurtherfrom0,whereastheL2algorithmislesstolerant.WewillhighlightotherimportantdifferencesbetweenL1andL2inthefollowingsections.YoualsoneedtonotethatbothL1andL2counttheirpenaltiesbasedonlyonweights;theydonotcountpenaltiesonbiasvalues.
UnderstandingL1Regularization
YoushoulduseL1regularizationtocreatesparsityintheneuralnetwork.Inotherwords,theL1algorithmwillpushmanyweightconnectionstonear0.Whenaweightisnear0,theprogramdropsitfromthenetwork.Droppingweightedconnectionswillcreateasparseneuralnetwork.
Featureselectionisausefulbyproductofsparseneuralnetworks.Featuresarethevaluesthatthetrainingsetprovidestotheinputneurons.Oncealltheweightsofaninputneuronreach0,theneuralnetworktrainingdeterminesthatthefeatureisunnecessary.Ifyourdatasethasalargenumberofinputfeaturesthatmaynotbeneeded,L1regularizationcanhelptheneuralnetworkdetectandignoreunnecessaryfeatures.
Equation12.1showsthepenaltycalculationperformedbyL1:
Equation12.1:L1ErrorTermObjective
Essentially,aprogrammermustbalancetwocompetinggoals.Heorshemustdecidethegreatervalueofachievingalowscorefortheneuralnetworkorregularizingtheweights.Bothresultshavevalue,buttheprogrammerhastochoosetherelativeimportance.Ifregularizationisthemaingoal,theλ(lambda)valuedeterminesthattheL1objectiveismoreimportantthantheneuralnetwork’serror.Avalueof0meansL1regularizationisnotconsideredatall.Inthiscase,alownetworkerrorwouldhavemoreimportance.Avalueof0.5meansL1regularizationishalfasimportantastheerrorobjective.TypicalL1valuesarebelow0.1(10%).
ThemaincalculationperformedbyL1isthesummingoftheabsolutevalues(asindicatedbytheverticalbars)ofalltheweights.Thebiasvaluesarenotsummed.
Ifyouareusinganoptimizationalgorithm,suchassimulatedannealing,youcansimplycombinethevaluereturnedbyEquation12.1tothescore.Youshouldcombinethisvaluetothescoreinsuchawaysothatithasanegativeeffect.Ifyouaretryingtominimizethescore,thenyoushouldaddtheL1value.Similarly,ifyouaretryingtomaximizethescore,thenyoushouldsubtracttheL1value.
IfyouareusingL1regularizationwithagradient-descent-basedtrainingalgorithm,suchasbackpropagation,youneedtouseaslightlydifferenterrorterm,asshownbyEquation12.2:
Equation12.2:L1ErrorTerm
Equation12.2isnearlythesameasEquation12.1exceptthatwedividebyn.Thevaluenrepresentsthenumberoftrainingsetevaluations.Forexample,iftherewere100trainingsetelementsandthreeoutputneurons,nwouldbe300.Wederivethisnumberbecausetheprogramhasthreevaluestoevaluateforeachofthose100elements.ItisnecessarytodividebynbecausetheprogramappliesEquation12.2ateverytrainingevaluation.ThischaracteristiccontrastswithEquation12.1,whichisappliedoncepertrainingiteration.
TouseEquation12.2,weneedtotakeitspartialderivativewithrespecttotheweight.Equation12.3showsthepartialderivativeofEquation12.2:
Equation12.3:L1WeightPartialDerivative
Tousethisgradient,weaddthisvaluetoeveryweightgradientcalculatedbythegradient-descentalgorithm.Thisadditionisonlyperformedforweightvalues;thebiasesareleftalone.
UnderstandingL2Regularization
YoushoulduseL2regularizationwhenyouarelessconcernedaboutcreatingaspacenetworkandaremoreconcernedaboutlowweightvalues.Thelowerweightvalueswilltypicallyleadtolessoverfitting.
Equation12.4showsthepenaltycalculationperformedbyL2:
Equation12.4:L2ErrorTermObjective
LiketheL1algorithm,theλ(lambda)valuedetermineshowimportanttheL2objectiveiscomparedtotheneuralnetwork’serror.TypicalL2valuesarebelow0.1(10%).ThemaincalculationperformedbyL2isthesummingofthesquaresofalloftheweights.Thebiasvaluesarenotsummed.
Ifyouareusinganoptimizationalgorithm,suchassimulatedannealing,youcansimplycombinethevaluereturnedbyEquation12.4tothescore.Youshouldcombinethisvaluewiththescoreinsuchawaysothatithasanegativeeffect.Ifyouaretryingtominimizethescore,thenyoushouldaddtheL2value.Similarly,ifyouaretryingtomaximizethescore,thenyoushouldsubtracttheL2value.
IfyouareusingL2regularizationwithagradient-descent-basedtrainingalgorithm,suchasbackpropagation,youneedtouseaslightlydifferenterrorterm,asshownbyEquation12.5:
Equation12.5:L2ErrorTerm
Equation12.5isnearlythesameasEquation12.4,exceptthat,unlikeL1,wetakethesquaresoftheweights.TouseEquation12.5,weneedtotakethepartialderivativewithrespecttotheweight.Equation12.6showsthepartialderivativeofEquation12.6:
Equation12.6:L2WeightPartialDerivative
Tousethisgradient,youneedtoaddthisvaluetoeveryweightgradientcalculatedbythegradient-descentalgorithm.Thisadditionisonlyperformedonweightvalues;thebiasesareleftalone.
DropoutLayers
Hinton,Srivastava,Krizhevsky,Sutskever,&Salakhutdinov(2012)introducedthedropoutregularizationalgorithm.AlthoughdropoutworksinadifferentwaythanL1andL2,itaccomplishesthesamegoal—thepreventionofoverfitting.However,thealgorithmgoesaboutthetaskbyactuallyremovingneuronsandconnections—atleasttemporarily.UnlikeL1andL2,noweightpenaltyisadded.Dropoutdoesnotdirectlyseektotrainsmallweights.
Dropoutworksbycausinghiddenneuronsoftheneuralnetworktobeunavailableduringpartofthetraining.Droppingpartoftheneuralnetworkcausestheremainingportiontobetrainedtostillachieveagoodscoreevenwithoutthedroppedneurons.Thisdecreasescoadaptionbetweenneurons,whichresultsinlessoverfitting.
DropoutLayer
Mostneuralnetworkframeworksimplementdropoutasaseparatelayer.Dropoutlayersfunctionasaregular,denselyconnectedneuralnetworklayer.Theonlydifferenceisthatthedropoutlayerswillperiodicallydropsomeoftheirneuronsduringtraining.Youcanusedropoutlayersonregularfeedforwardneuralnetworks.Infact,theycanalsobecomelayersinconvolutionalLeNET-5networkslikewestudiedinChapter10,“ConvolutionalNeuralNetworks.”
Theusualhyper-parametersforadropoutlayerarethefollowing:
NeuronCountActivationFunctionDropoutProbability
Theneuroncountandactivationfunctionhyper-parametersworkexactlythesamewayastheircorrespondingparametersinthedenselayertypementionedinChapter10,“ConvolutionalNeuralNetworks.”Theneuroncountsimplyspecifiesthenumberofneuronsinthedropoutlayer.Thedropoutprobabilityindicatesthelikelihoodofaneurondroppingoutduringthetrainingiteration.Justasitdoesforadenselayer,theprogramspecifiesanactivationfunctionforthedropoutlayer.
ImplementingaDropoutLayer
Theprogramimplementsadropoutlayerasadenselayerthatcaneliminatesomeofitsneurons.Contrarytopopularbeliefaboutthedropoutlayer,theprogramdoesnotpermanentlyremovethesediscardedneurons.Adropoutlayerdoesnotloseanyofitsneuronsduringthetrainingprocess,anditwillstillhaveexactlythesamenumberofneuronsaftertraining.Inthisway,theprogramonlytemporarilymaskstheneuronsratherthandroppingthem.
Figure12.2showshowadropoutlayermightbesituatedwithotherlayers:
Figure12.2:DropoutLayer
Thediscardedneuronsandtheirconnectionsareshownasdashedlines.Theinputlayerhastwoinputneuronsaswellasabiasneuron.Thesecondlayerisadenselayerwiththreeneuronsaswellasabiasneuron.Thethirdlayerisadropoutlayerwithsixregularneuronseventhoughtheprogramhasdropped50%ofthem.Whiletheprogramdropstheseneurons,itneithercalculatesnortrainsthem.However,thefinalneuralnetworkwillusealloftheseneuronsfortheoutput.Aspreviouslymentioned,theprogramonlytemporarilydiscardstheneurons.
Duringsubsequenttrainingiterations,theprogramchoosesdifferentsetsofneuronsfromthedropoutlayer.Althoughwechoseaprobabilityof50%fordropout,thecomputerwillnotnecessarilydropthreeneurons.Itisasifweflippedacoinforeachofthedropoutcandidateneuronstochooseifthatneuronwasdroppedout.Youmustknowthattheprogramshouldneverdropthebiasneuron.Onlytheregularneuronsonadropoutlayerarecandidates.
Theimplementationofthetrainingalgorithminfluencestheprocessofdiscardingneurons.Thedropoutsetfrequentlychangesoncepertrainingiterationorbatch.Theprogramcanalsoprovideintervalswhereallneuronsarepresent.Someneuralnetworkframeworksgiveadditionalhyper-parameterstoallowyoutospecifyexactlytherateofthisinterval.
Whydropoutiscapableofdecreasingoverfittingisacommonquestion.Theansweristhatdropoutcanreducethechanceofacodependencydevelopingbetweentwoneurons.Twoneuronsthatdevelopacodependencywillnotbeabletooperateeffectivelywhenoneisdroppedout.Asaresult,theneuralnetworkcannolongerrelyonthepresenceofevery
neuron,andittrainsaccordingly.Thischaracteristicdecreasesitsabilitytomemorizetheinformationpresentedtoit,therebyforcinggeneralization.
Dropoutalsodecreasesoverfittingbyforcingabootstrappingprocessupontheneuralnetwork.Bootstrappingisaverycommonensembletechnique.WewilldiscussensemblingingreaterdetailinChapter16,“ModelingwithNeuralNetworks.”Basically,ensemblingisatechniqueofmachinelearningthatcombinesmultiplemodelstoproduceabetterresultthanthoseachievedbyindividualmodels.Ensembleisatermthatoriginatesfromthemusicalensemblesinwhichthefinalmusicproductthattheaudiencehearsisthecombinationofmanyinstruments.
Bootstrappingisoneofthemostsimpleensembletechniques.Theprogrammerusingbootstrappingsimplytrainsanumberofneuralnetworkstoperformexactlythesametask.However,eachoftheseneuralnetworkswillperformdifferentlybecauseofsometrainingtechniquesandtherandomnumbersusedintheneuralnetworkweightinitialization.Thedifferenceinweightscausestheperformancevariance.Theoutputfromthisensembleofneuralnetworksbecomestheaverageoutputofthememberstakentogether.Thisprocessdecreasesoverfittingthroughtheconsensusofdifferentlytrainedneuralnetworks.
Dropoutworkssomewhatlikebootstrapping.Youmightthinkofeachneuralnetworkthatresultsfromadifferentsetofneuronsbeingdroppedoutasanindividualmemberinanensemble.Astrainingprogresses,theprogramcreatesmoreneuralnetworksinthisway.However,dropoutdoesnotrequirethesameamountofprocessingasdoesbootstrapping.Thenewneuralnetworkscreatedaretemporary;theyexistonlyforatrainingiteration.Thefinalresultisalsoasingleneuralnetwork,ratherthananensembleofneuralnetworkstobeaveragedtogether.
UsingDropout
Inthischapter,wewillcontinuetoevolvethebook’sMNISThandwrittendigitsexample.Weexaminedthisdatasetinthebookintroductionanduseditinseveralexamples.
Theexampleforthischapterusesthetrainingsettofitadropoutneuralnetwork.Theprogramsubsequentlyevaluatesthetestsetonthistrainednetworktoviewtheresults.Bothdropoutandnon-dropoutversionsoftheneuralnetworkhaveresultstoexamine.
Thedropoutneuralnetworkusedthefollowinghyper-parameters:
ActivationFunction:ReLUInputLayer:784(28x28)HiddenLayer1:1000DropoutLayer:500units,50%HiddenLayer2:250OutputLayer:10(becausethereare10digits)
Weselectedtheabovehyper-parametersthroughexperimentation.Byroundingthenumberofinputneuronsuptothenextevenunit,wechoseafirsthiddenlayerof1000.Thenextthreelayersconstrainedthisamountbyhalfeachtime.Placingthedropoutlayerbetweenthetwohiddenlayersprovidedthebestimprovementintheerrorrate.Wealsotriedplacingitbothbeforehiddenlayer1andafterhiddenlayer2.Mostoftheoverfittingoccurredbetweenthetwohiddenlayers.
Weusedthefollowinghyper-parametersfortheregularneuralnetwork.Thisprocessisessentiallythesameasthedropoutnetworkexceptthatanadditionalhiddenlayerreplacesthedropoutlayer.
ActivationFunction:ReLUInputLayer:784(28x28)HiddenLayer1:1000HiddenLayer2:500HiddenLayer3:250OutputLayer:10(becausethereare10digits)
Theresultsareshownhere:
Relu:
Bestvalidlosswas0.068229atepoch17.
Incorrect170/10000(1.7000000000000002%)
ReLU+Dropout:
Bestvalidlosswas0.065753atepoch5.
Incorrect120/10000(1.2%)
Asyoucansee,dropoutneuralnetworkachievedabettererrorratethantheReLUonlyneuralnetworkfromearlierinthebook.Byreducingtheamountofoverfitting,thedropoutnetworkgotabetterscore.Youshouldalsonoticethat,althoughthenon-dropoutnetworkdidachieveabettertrainingscore,thisresultisnotgood.Itindicatesoverfitting.Ofcourse,theseresultswillvary,dependingontheplatformused.
ChapterSummary
Weintroducedseveralregularizationtechniquesthatcanreduceoverfitting.Whentheneuralnetworkmemorizestheinputandexpectedoutput,overfittingoccursbecausetheprogramhasnotlearnedtogeneralize.Manydifferentregularizationtechniquescanforcetheneuralnetworktolearntogeneralize.WeexaminedL1,L2,anddropout.L1andL2worksimilarlybyimposingpenaltiesforweightsthataretoolarge.Thepurposeofthesepenaltiesistoreducecomplexityintheneuralnetwork.Dropouttakesanentirelydifferentapproachbyrandomlyremovingvariousneuronsandforcingthetrainingtocontinuewithapartialneuralnetwork.
TheL1algorithmpenalizeslargeweightsandforcesmanyoftheweightstoapproach
0.Weconsidertheweightsthatcontainazerovaluetobedroppedfromtheneuralnetwork.Thisreductionproducesasparseneuralnetwork.Ifallweightedconnectionsbetweenaninputneuronandthenextlayerareremoved,youcanassumethatthefeatureconnectedtothatinputneuronisunimportant.Featureselectionischoosinginputfeaturesbasedontheirimportancetotheneuralnetwork.TheL2algorithmpenalizeslargeweights,butitdoesnottendtoproduceneuralnetworksthatareassparseasthoseproducedbytheL1algorithm.
Dropoutrandomlydropsneuronsinadesignateddropoutlayer.Theneuronsthatweredroppedfromthenetworkarenotgoneastheywereinpruning.Instead,thedroppedneuronsaretemporarilymaskedfromtheneuralnetwork.Thesetofdroppedneuronschangesduringeachtrainingiteration.Dropoutforcestheneuralnetworktocontinuefunctioningwhenneuronsareremoved.Thismakesitdifficultfortheneuralnetworktomemorizeandoverfit.
Sofar,wehaveexploredonlyfeedforwardneuralnetworksinthisvolume.Inthistypeofnetwork,theconnectionsonlymoveforwardfromtheinputlayertohiddenlayersandultimatelytotheoutputlayer.Recurrentneuralnetworksallowbackwardconnectionstopreviouslayers.Wewillanalyzethistypeofneuralnetworkinthenextchapter.
Additionally,wehavefocusedprimarilyonusingneuralnetworkstorecognizepatterns.Wecanalsoteachneuralnetworkstopredictfuturetrends.Byprovidinganeuralnetworkwithaseriesoftime-basedvalues,itcanpredictsubsequentvalues.Inthenextchapter,wewillalsodemonstratepredictiveneuralnetworks.Werefertothistypeofneuralnetworkasatemporalneuralnetwork.Recurrentneuralnetworkscanoftenmaketemporalpredictions.
Chapter13:TimeSeriesandRecurrentNetworks
TimeSeriesElmanNetworksJordanNetworksDeepRecurrentNetworks
Inthischapter,wewillexaminetimeseriesencodingandrecurrentnetworks,twotopicsthatarelogicaltoputtogetherbecausetheyarebothmethodsfordealingwithdatathatspansovertime.Timeseriesencodingdealswithrepresentingeventsthatoccurovertimetoaneuralnetwork.Therearemanydifferentmethodstoencodedatathatoccurovertimetoaneuralnetwork.Thisencodingisnecessarybecauseafeedforwardneuralnetworkwillalwaysproducethesameoutputvectorforagiveninputvector.Recurrentneuralnetworksdonotrequireencodingoftimeseriesdatabecausetheyareabletoautomaticallyhandledatathatoccurovertime.
Thevariationintemperatureduringtheweekisanexampleoftimeseriesdata.Forinstance,ifweknowthattoday’stemperatureis25degrees,andtomorrow’stemperatureis27degrees,therecurrentneuralnetworksandtimeseriesencodingprovideanotheroptiontopredictthecorrecttemperaturefortheweek.Conversely,atraditionalfeedforwardneuralnetworkwillalwaysrespondwiththesameoutputforagiveninput.Ifafeedforwardneuralnetworkistrainedtopredicttomorrow’stemperature,itshouldrespond27for25.Thefactthatitwillalwaysoutput27whengiven25mightbeahindrancetoitspredictions.Surelythetemperatureof27willnotalwaysfollow25.Itwouldbebetterfortheneuralnetworktoconsiderthetemperaturesforaseriesofdaysbeforethedaybeingpredicted.Perhapsthetemperatureoverthelastweekmightallowustopredicttomorrow’stemperature.Therefore,recurrentneuralnetworksandtimeseriesencodingrepresenttwodifferentapproachestotheproblemofrepresentingdataovertimetoaneuralnetwork.
Sofartheneuralnetworksthatwe’veexaminedhavealwayshadforwardconnections.Theinputlayeralwaysconnectstothefirsthiddenlayer.Eachhiddenlayeralwaysconnectstothenexthiddenlayer.Thefinalhiddenlayeralwaysconnectstotheoutputlayer.Thismannertoconnectlayersisthereasonthatthesenetworksarecalled“feedforward.”Recurrentneuralnetworksarenotsorigid,asbackwardconnectionsarealsoallowed.Arecurrentconnectionlinksaneuroninalayertoeitherapreviouslayerortheneuronitself.Mostrecurrentneuralnetworkarchitecturesmaintainstateintherecurrentconnections.Feedforwardneuralnetworksdon’tmaintainanystate.Arecurrentneuralnetwork’sstateactsasasortofshort-termmemoryfortheneuralnetwork.Consequently,arecurrentneuralnetworkwillnotalwaysproducethesameoutputforagiveninput.
TimeSeriesEncoding
Aswesawinpreviouschapters,neuralnetworksareparticularlygoodatrecognizingpatterns,whichhelpsthempredictfuturepatternsindata.Werefertoaneuralnetworkthatpredictsfuturepatternsasapredictive,ortemporal,neuralnetwork.Thesepredictiveneuralnetworkscananticipatefutureevents,suchasstockmarkettrendsandsunspotcycles.
Manydifferentkindsofneuralnetworkscanpredict.Inthissection,thefeedforwardneuralnetworkwillattempttolearnpatternsindatasoitcanpredictfuturevalues.Likeallproblemsappliedtoneuralnetworks,predictionisamatterofintelligentlydetermininghowtoconfigureinputandinterpretoutputneuronsforaproblem.Becausethetypeoffeedforwardneuralnetworksinthisbookalwaysproducethesameoutputforagiveninput,weneedtomakesurethatweencodetheinputcorrectly.
Awidevarietyofmethodscanencodetimeseriesdataforaneuralnetwork.Theslidingwindowalgorithmisoneofthesimplestandmostpopularencodingalgorithms.However,morecomplexalgorithmsallowthefollowingconsiderations:
WeightingoldervaluesaslessimportantthannewerSmoothing/averagingovertimeOtherdomain-specific(e.g.finance)indicators
Wewillfocusontheslidingwindowalgorithmencodingmethodfortimeseries.Theslidingwindowalgorithmworksbydividingthedataintotwowindowsthatrepresentthepastandthefuture.Youmustspecifythesizesofbothwindows.Forexample,ifyouwanttopredictfuturepriceswiththedailyclosingpriceofastock,youmustdecidehowfarintothepastandfuturethatyouwishtoexamine.Youmightwanttopredictthenexttwodaysusingthelastfiveclosingprices.Inthiscase,youwouldhaveaneuralnetworkwithfiveinputneuronsandtwooutputneurons.
EncodingDataforInputandOutputNeurons
Considerasimpleseriesofnumbers,suchasthesequenceshownhere:
1,2,3,4,3,2,1,2,3,4,3,2,1
Aneuralnetworkthatpredictsnumbersfromthissequencemightusethreeinputneuronsandasingleoutputneuron.Thefollowingtrainingsethasapredictionwindowofsize1andapastwindowsizeof3:
[1,2,3]->[4]
[2,3,4]->[3]
[3,4,3]->[2]
[4,3,2]->[1]
Asyoucansee,theneuralnetworkispreparedtoreceiveseveraldatasamplesinasequence.Theoutputneuronthenpredictshowthesequencewillcontinue.Theideaisthatyoucannowfeedanysequenceofthreenumbers,andtheneuralnetworkwillpredictthefourthnumber.Eachdatapointiscalledatimeslice.Therefore,eachinputneuronrepresentsaknowntimeslice.Theoutputneuronsrepresentfuturetimeslices.
Itisalsopossibletopredictmorethanonevalueintothefuture.Thefollowingtrainingsethasapredictionwindowofsize2andapastwindowsizeof3:
[1,2,3]->[4,3]
[2,3,4]->[3,2]
[3,4,3]->[2,1]
[4,3,2]->[1,2]
Thelasttwoexampleshaveonlyasinglestreamofdata.Itispossibletousemultiplestreamsofdatatopredict.Forexample,youmightpredictthepricewiththepriceofastockanditsvolume.Considerthefollowingtwostreams:
Stream#1:1,2,3,4,3,2,1,2,3,4,3,2,1
Stream#2:10,20,30,40,30,20,10,20,30,40,30,20,10
Youcanpredictstream#1withstream#1and#2.Yousimplyneedtoaddthestream#2valuesnexttothestream#1values.Atrainingsetcanperformthiscalculation.Thefollowingsetshowsapredictionwindowofsize1andapastwindowsizeof3:
[1,10,2,20,3,30]->[4]
[2,20,3,30,4,40]->[3]
[3,30,4,40,3,30]->[2]
[4,40,3,30,2,20]->[1]
Thissametechniqueworksforanynumberofstreams.Inthiscase,stream#1helpstopredictitself.Forexample,youcanusethestockpricesofIBMandAppletopredictMicrosoft.Thistechniqueusesthreestreams.Thestreamthatwe’repredictingdoesn’tneedtobeamongthestreamsprovidingthedatatoformtheprediction.
PredictingtheSineWave
Theexampleinthissectionisrelativelysimple.Wepresentaneuralnetworkthatpredictsthesinewave,whichismathematicallypredictable.However,programmerscaneasilyunderstandthesinewave,anditvariesovertime.Thesequalitiesmakeitagoodintroductiontopredictiveneuralnetworks.
Youcanseethesinewavebyplottingthetrigonometricsinefunction.Figure13.1showsthesinewave:
Figure13.1:Thesinewave
Thesinewavefunctiontrainstheneuralnetwork.Backpropagationwilladjusttheweightstoemulatethesinewave.Whenyoufirstexecutethesinewaveexample,youwillseetheresultsofthetrainingprocess.Typicaloutputfromthesinewavepredictor’strainingprocessfollows:
Iteration#1Error:0.48120350975475823Iteration#2Error:
0.36753445768855236Iteration#3Error:0.3212066601426759
Iteration#4Error:0.2952410514715732Iteration#5Error:
0.2780102928778258Iteration#6Error:0.26556861969786527
Iteration#7Error:0.25605359706505776Iteration#8Er236
IntroductiontoNeuralNetworkswithJava,SecondEdition
ror:0.24842242500053566Iteration#9Error:0.24204767544134156Iteration
#10Error:0.23653845782593882
...
Iteration#4990Error:0.02319397662897425Iteration#4991Error:
0.02319310934886356Iteration#4992Error:0.023192242246688515
Iteration#4993Error:0.02319137532183077Iteration#4994Error:
0.023190508573672858Iteration#4995Error:0.02318964200159761
Iteration#4996Error:0.02318877560498862Iteration#4997Error:
0.02318790938322986Iteration#4998Error:0.023187043335705867
Iteration#4999Error:0.023186177461801745
Inthebeginning,theerrorrateisfairlyhighat48%.Bytheseconditeration,thisratequicklybeginstofallto36.7%.Bythetimethe4,999thiterationhasoccurred,theerrorratehasfallento2.3%.Theprogramisdesignedtostopbeforehittingthe5,000thiteration.Thissucceedsinreducingtheerrorratetolessthan0.03.
Additionaltrainingwouldproduceabettererrorrate;however,bylimitingtheiterations,theprogramisabletofinishinonlyafewminutesonaregularcomputer.ThisprogramtookabouttwominutestoexecuteonanIntelI7computer.
Oncethetrainingiscomplete,thesinewaveispresentedtotheneuralnetworkforprediction.Youcanseetheoutputfromthispredictionhere:
5:Actual=0.76604:Predicted=0.7892166200864351:Difference=2.32%6:A
ctual=0.86602:Predicted=0.8839210963512845:Difference=1.79%7:Ac
tual=0.93969:Predicted=0.934526031234053:Difference=0.52%8:Act
ual=0.9848:Predicted=0.9559577688326862:Difference=2.88%9:Actu
al=1.0:Predicted=0.9615566601973113:Difference=3.84%10:Actual=
0.9848:Predicted=0.9558060932656686:Difference=2.90%11:Actual=
0.93969:Predicted=0.9354447787244102:Difference=0.42%12:Actual
=0.86602:Predicted=0.8894014978439005:Difference=2.34%13:Actua
l=0.76604:Predicted=0.801342405700056:Difference=3.53%14:Actua
l=0.64278:Predicted=0.6633506809125252:Difference=2.06%15:Actu
al=0.49999:Predicted=0.4910483600917853:Difference=0.89%16:Act
ual=0.34202:Predicted=0.31286152780645105:Difference=2.92%17:A
ctual=0.17364:Predicted=0.14608325263568134:Difference=2.76%
18:Actual=0.0:Predicted=-0.008360016796238434:Difference=0.84%
19:Actual=-0.17364:Predicted=-0.15575381460132823:Difference=1.79%
20:Actual=-0.34202:Predicted=-0.3021775158559559:Difference=3.98%
...
490:Actual=-0.64278:Predicted=-0.6515076637590029:Difference=0.87%
491:Actual=-0.76604:Predicted=-0.8133333939237001:Difference=4.73%
492:Actual=-0.86602:Predicted=-0.9076496572125671:Difference=4.16%
493:Actual=-0.93969:Predicted=-0.9492579517460149:Difference=0.96%
494:Actual=-0.9848:Predicted=-0.9644567437192423:Difference=2.03%
495:Actual=-1.0:Predicted=-0.9664801515670861:Difference=3.35%
496:Actual=-0.9848:Predicted=-0.9579489752650393:Difference=2.69%
497:Actual=-0.93969:Predicted=-0.9340105440194074:Difference=0.57%
498:Actual=-0.86602:Predicted=-0.8829925066754494:Difference=1.70%
499:Actual=-0.76604:Predicted=-0.7913823031308845:Difference=2.53%
Asyoucansee,wepresentboththeactualandpredictedvaluesforeachelement.Wetrainedtheneuralnetworkforthefirst250elements;however,theneuralnetworkisabletopredictbeyondthefirst250.Youwillalsonoticethatthedifferencebetweentheactualvaluesandthepredictedvaluesrarelyexceeds3%.
Slidingwindowisnottheonlywaytoencodetimeseries.Othertimeseriesencodingalgorithmscanbeveryusefulforspecificdomains.Forexample,manytechnicalindicatorsexistthathelptofindpatternsinthevalueofsecuritiessuchasstocks,bonds,andcurrencypairs.
SimpleRecurrentNeuralNetworks
Recurrentneuralnetworksdonotforcetheconnectionstoflowonlyfromonelayertothenext,frominputlayertooutputlayer.Arecurrentconnectionoccurswhenaconnectionisformedbetweenaneuronandoneofthefollowingothertypesofneurons:
TheneuronitselfAneurononthesamelevelAneurononapreviouslevel
Recurrentconnectionscannevertargettheinputneuronsorthebiasneurons.
Theprocessingofrecurrentconnectionscanbechallenging.Becausetherecurrentlinkscreateendlessloops,theneuralnetworkmusthavesomewaytoknowwhentostop.Aneuralnetworkthatenteredanendlessloopwouldnotbeuseful.Topreventendlessloops,wecancalculatetherecurrentconnectionswiththefollowingthreeapproaches:
ContextneuronsCalculatingoutputoverafixednumberofiterationsCalculatingoutputuntilneuronoutputstabilizes
Werefertoneuralnetworksthatusecontextneuronsasasimplerecurrentnetwork(SRN).Thecontextneuronisaspecialneurontypethatremembersitsinputandprovidesthatinputasitsoutputthenexttimethatwecalculatethenetwork.Forexample,ifwegaveacontextneuron0.5asinput,itwouldoutput0.Contextneuronsalwaysoutput0ontheirfirstcall.However,ifwegavethecontextneurona0.6asinput,theoutputwouldbe0.5.Weneverweighttheinputconnectionstoacontextneuron,butwecanweighttheoutputfromacontextneuronjustlikeanyotherconnectioninanetwork.Figure13.2showsatypicalcontextneuron:
Figure13.2:ContextNeuron
Contextneuronsallowustocalculateaneuralnetworkinasinglefeedforwardpass.Contextneuronsusuallyoccurinlayers.Alayerofcontextneuronswillalwayshavethesamenumberofcontextneuronsasneuronsinitssourcelayer,asdemonstratedbyFigure13.3:
Figure13.3:ContextLayer
Asyoucanseefromtheabovelayer,twohiddenneuronsthatarelabeledhidden1andhidden2directlyconnecttothetwocontextneurons.Thedashedlinesontheseconnectionsindicatethatthesearenotweightedconnections.Theseweightlessconnectionsareneverdense.Iftheseconnectionsweredense,hidden1wouldbeconnectedtobothhidden1andhidden2.However,thedirectconnectionsimplyjoinseachhiddenneurontoitscorrespondingcontextneuron.Thetwocontextneuronsformdense,weightedconnectionstothetwohiddenneurons.Finally,thetwohiddenneuronsalsoformdenseconnectionstotheneuronsinthenextlayer.Thetwocontextneuronswouldformtwoconnectionstoasingleneuroninthenextlayer,fourconnectionstotwoneurons,sixconnectionstothreeneurons,andsoon.
Youcancombinecontextneuronswiththeinput,hidden,andoutputlayersofaneuralnetworkinmanydifferentways.Inthenexttwosections,weexploretwocommonSRNarchitectures.
ElmanNeuralNetworks
In1990,Elmanintroducedaneuralnetworkthatprovidespatternrecognitiontotimeseries.Thisneuralnetworktypehasoneinputneuronforeachstreamthatyouareusingtopredict.Thereisoneoutputneuronforeachtimesliceyouaretryingtopredict.Asingle-hiddenlayerispositionedbetweentheinputandoutputlayer.Alayerofcontextneuronstakesitsinputfromthehiddenlayeroutputandfeedsbackintothesamehiddenlayer.Consequently,thecontextlayersalwayshavethesamenumberofneuronsasthehiddenlayer,asdemonstratedbyFigure13.4:
Figure13.4:ElmanSRN
TheElmanneuralnetworkisagoodgeneral-purposearchitectureforsimplerecurrentneuralnetworks.Youcanpairanyreasonablenumberofinputneuronstoanynumberofoutputneurons.Usingnormalweightedconnections,thetwocontextneuronsarefullyconnectedwiththetwohiddenneurons.Thetwocontextneuronsreceivetheirstatefromthetwonon-weightedconnections(dashedlines)fromeachofthetwohiddenneurons.
JordanNeuralNetworks
In1993,Jordanintroducedaneuralnetworktocontrolelectronicsystems.ThisstyleofSRNissimilartoElmannetworks.However,thecontextneuronsarefedfromtheoutputlayerinsteadofthehiddenlayer.WealsorefertothecontextunitsinaJordannetworkasthestatelayer.Theyhavearecurrentconnectiontothemselveswithnoothernodesonthisconnection,asseeninFigure13.5:
Figure13.5:JordanSRN
TheJordanneuralnetworkrequiresthesamenumberofcontextneuronsandoutputneurons.Therefore,ifwehaveoneoutputneuron,theJordannetworkwillhaveasingle
contextneuron.Thisequalitycanbeproblematicifyouhaveonlyasingleoutputneuronbecauseyouwillbeabletohavejustonesingle-contextneuron.
TheElmanneuralnetworkisapplicabletoawiderarrayofproblemsthantheJordannetworkbecausethelargehiddenlayercreatesmorecontextneurons.Asaresult,theElmannetworkcanrecallmorecomplexpatternsbecauseitcapturesthestateofthehiddenlayerfromthepreviousiteration.Thisstateisneverbipolarsincethehiddenlayerrepresentsthefirstlineoffeaturedetectors.
Additionally,ifweincreasethesizeofthehiddenlayertoaccountforamorecomplexproblem,wealsogetmorecontextneuronswithanElmannetwork.TheJordannetworkdoesn’tproducethiseffect.TocreatemorecontextneuronswiththeJordannetwork,wemustaddmoreoutputneurons.Wecannotaddoutputneuronswithoutchangingthedefinitionoftheproblem.
WhentouseaJordannetworkisacommonquestion.Programmersoriginallydevelopedthisnetworktypeforroboticsresearch.Neuralnetworksthataredesignedforroboticstypicallyhaveinputneuronsconnectedtosensorsandoutputneuronsconnectedtoactuators(typicallymotors).Becauseeachmotorhasitsownoutputneuron,neuralnetworksforrobotswillgenerallyhavemoreoutputneuronsthanregressionneuralnetworksthatpredictasinglevalue.
BackpropagationthroughTime
YoucantrainSRNswithavarietyofmethods.BecauseSRNsareneuralnetworks,youcantraintheirweightswithanyoptimizationalgorithm,suchassimulatedannealing,particleswarmoptimization,Nelder-Meadorothers.Regularbackpropagation-basedalgorithmscanalsotrainoftheSRN.Mozer(1995),Robinson&Fallside(1987)andWerbos(1988)eachinventedanalgorithmspecificallydesignedforSRNs.Programmersrefertothisalgorithmasbackpropagationthroughtime(BPTT).Sjoberg,Zhang,Ljung,etal.(1995)determinedthatbackpropagationthroughtimeprovidessuperiortrainingperformancethangeneraloptimizationalgorithms,suchassimulatedannealing.Backpropagationthroughtimeisevenmoresensitivetolocalminimathanstandardbackpropagation.
BackpropagationthroughtimeworksbyunfoldingtheSRNtobecomearegularneuralnetwork.TounfoldtheSRN,weconstructachainofneuralnetworksequaltohowfarbackintimewewishtogo.Westartwithaneuralnetworkthatcontainstheinputsforthecurrenttime,knownast.Nextwereplacethecontextwiththeentireneuralnetwork,uptothecontextneuron’sinput.Wecontinueforthedesirednumberoftimeslicesandreplacethefinalcontextneuronwitha0.Figure13.6illustratesthisprocessfortwotimeslices.
Figure13.6:UnfoldingtoTwoTimeSlices
Thisunfoldingcancontinuedeeper;Figure13.7showsthreetimeslices:
Figure13.7:UnfoldingtoTwoTimeSlices
YoucanapplythisabstractconcepttotheactualSRNs.Figure13.8illustratesatwo-input,two-hidden,one-outputElmanneuralnetworkunfoldedtotwotimeslices:
Figure13.8:ElmanUnfoldedtoTwoTimeSlices
Asyoucansee,thereareinputsforbotht(currenttime)andt-1(onetimesliceinthepast).Thebottomneuralnetworkstopsatthehiddenneuronsbecauseyoudon’tneedeverythingbeyondthehiddenneuronstocalculatethecontextinput.Thebottomnetworkstructurebecomesthecontexttothetopnetworkstructure.Ofcourse,thebottomstructurewouldhavehadacontextaswellthatconnectstoitshiddenneurons.However,becausetheoutputneuronabovedoesnotcontributetothecontext,onlythetopnetwork(currenttime)hasone.
ItisalsopossibletounfoldaJordanneuralnetwork.Figure13.9showsatwo-input,two-hidden,one-outputJordannetworkunfoldedtotwotimeslices.
Figure13.9:JordanUnfoldedtoTwoTimeSlices
UnliketheElmannetwork,youmustcalculatetheentireJordannetworktodeterminethecontext.Asaresult,wecancalculatetheprevioustimeslice(bottomnetwork)allthewaytotheoutputneuron.
TotraintheSRN,wecanuseregularbackpropagationtotraintheunfoldednetwork.However,attheendoftheiteration,weaveragetheweightsofallfoldstoobtaintheweightsfortheSRN.Listing13.1describestheBPTTalgorithm:
Listing13.1:BackpropagationThroughTime(BPTT):
defbptt(a,y)
#a[t]istheinputattimet.y[t]istheoutput
..unfoldthenetworktocontainkinstancesoff
..seeabovefigure..
whilestoppingcriterianomet:
#xisthecurrentcontext
x=[]
fortfrom0ton–1:
#tistime.nisthelengthofthetrainingsequence
..setthenetworkinputstox,a[t],a[t+1],...,a[t+k-1]
p=..forward-propagationoftheinputs
..overthewholeunfoldednetwork
#error=target-prediction
e=y[t+k]-p
..Back-propagatetheerror,e,backacross
..thewholeunfoldednetwork
..Updatealltheweightsinthenetwork
..Averagetheweightsineachinstanceofftogether,
..sothateachfisidentical
#computethecontextforthenexttime-step
x=f(x)
GatedRecurrentUnits
Althoughrecurrentneuralnetworkshaveneverbeenaspopularastheregularfeedforwardneuralnetworks,activeresearchonthemcontinues.Chung,Hyun&Bengio(2014)introducedthegatedrecurrentunit(GRU)toallowrecurrentneuralnetworkstofunctioninconjunctionwithdeepneuralnetworkbysolvingsomeinherentlimitationsofrecurrentneuralnetworks.GRUsareneuronsthatprovideasimilarroletothecontextneuronsseenpreviouslyinthischapter.
ItisdifficulttotrainRNNstocapturelong-termdependenciesbecausethegradientstendtoeithervanish(mostofthetime)orexplode(rarely,butwithsevereeffects),asdemonstratedbyChung,Hyun&Bengio(2015).
Asofthe2015publicationofthisbook,GRUsarelessthanayearold.BecauseofthecuttingedgenatureofGRUs,mostmajorneuralnetworkframeworksdonotcurrentlyincludethem.IfyouwouldliketoexperimentwithGRUs,thePythonTheano-basedframeworkKerasincludesthem.YoucanfindtheframeworkatthefollowingURL:
https://github.com/fchollet/keras
ThoughweusuallyuseLasange,KerasisoneofmanyTheano-basedframeworksforPython,anditisalsooneofthefirsttosupportGRUs.Thissectioncontainsabrief,high-levelintroductiontoGRU,andwewillupdatethebook’sexamplesasneededtosupportthistechnologyasitbecomesavailable.Refertothebook’sexamplecodeforup-to-dateinformationonexampleavailabilityforGRU.
AGRUusestwogatestoovercometheselimitations,asshowninFigure13.10:
Figure13.10:GatedRecurrentUnit(GRU)
Thegatesareindicatedbyz,theupdategate,andr,theresetgate.Thevalueshandtilde-hrepresenttheactivation(output)andcandidateactivation.Itisimportanttonotethattheswitchesspecifyranges,ratherthansimplybeingonoroff.
TheprimarydifferencebetweentheGRUandtraditionalrecurrentneuralnetworksisthattheentirecontextvaluedoesnotchangeitsvalueeachiterationasitdoesintheSRN.Rather,theupdategategovernsthedegreeofupdatetothecontextactivationthatoccurs.Additionally,theprogramprovidesaresetgatethatallowsthecontexttobereset.
ChapterSummary
Inthischapter,weintroducedseveralmethodsthatcanhandletimeseriesdatawithneuralnetworks.Afeedforwardneuralnetworkproducesthesameoutputwhenprovidedthesameinput.Asaresult,feedforwardneuralnetworksaresaidtobedeterministic.Thisqualitydoesnotallowafeedforwardneuralnetworktheabilitytoproduceoutput,givenaseriesofinputs.Ifyourapplicationmustprovideoutputbasedonaseriesofinputs,youhavetwochoices.Youcanencodethetimeseriesintoaninputfeaturevectororusearecurrentneuralnetwork.
Encodingatimeseriesisawayofcapturingtimeseriesinformationinafeaturevectorthatisfedtoafeedforwardneuralnetwork.Anumberofmethodsencodetimeseriesdata.Wefocusedonslidingwindowencoding.Thismethodspecifiestwowindows.Thefirstwindowdetermineshowfarintothepasttouseforprediction.Thesecondwindowdetermineshowfarintothefuturetopredict.
Recurrentneuralnetworksareanothermethodtodealwithtimeseriesdata.Encodingisnotnecessarywitharecurrentneuralnetworkbecauseitisabletorememberpreviousinputstotheneuralnetwork.Thisshort-termmemoryallowstheneuralnetworktobeabletoseepatternsintime.Simplerecurrentnetworksuseacontextneurontorememberthestatefrompreviouscomputations.WeexaminedElmanandJordanSRNs.Additionally,weintroducedaverynewneurontypecalledthegatedrecurrentunit(GRU).Thisneuron
typedoesnotimmediatelyupdateitscontextvalueliketheElmanandJordannetworks.Twogatesgovernthedegreeofupdate.
Hyper-parametersdefinethestructureofaneuralnetworkandultimatelydetermineitseffectivenessforaparticularproblem.Inthepreviouschaptersofthisbook,weintroducedhyper-parameterssuchasthenumberofhiddenlayersandneurons,theactivationfunctions,andothergoverningattributesofneuralnetworks.Determiningthecorrectsetofhyper-parametersisoftenadifficulttaskoftrialanderror.However,someautomatedprocessescanmakethisprocesseasier.Additionally,somerulesofthumbcanhelparchitecttheseneuralnetworks.Wecoverthesepointers,aswellasautomatedprocesses,inthenextchapter.
Chapter14:ArchitectingNeuralNetworks
Hyper-parametersLearningRate&MomentumHiddenStructureActivationFunctions
Hyper-parameters,asmentionedinpreviouschapters,arethenumeroussettingsformodelssuchasneuralnetworks.Activationfunctions,hiddenneuroncounts,layerstructure,convolution,max-poolinganddropoutareallexamplesofneuralnetworkhyper-parameters.Findingtheoptimalsetofhyper-parameterscanseemadauntingtask,and,indeed,itisoneofthemosttime-consumingtasksfortheAIprogrammer.However,donotfear,wewillprovideyouwithasummaryofthecurrentresearchonneuralnetworkarchitectureinthischapter.Wewillalsoshowyouhowtoconductexperimentstohelpdeterminetheoptimalarchitectureforyourownnetworks.
Wewillmakearchitecturalrecommendationsintwoways.First,wewillreportonrecommendationsfromscientificliteratureinthefieldofAI.Theserecommendationswillincludecitationssothatyoucanexaminetheoriginalpaper.However,wewillstrivetopresentthekeyconceptofthearticleinanapproachablemanner.Thesecondwaywillbethroughexperimentation.Wewillrunseveralcompetingarchitecturesandreporttheresults.
Youneedtorememberthatafewhardandfastrulesdonotdictatetheoptimalarchitectureforeveryproject.Everydatasetisdifferent,and,asaresult,theoptimalneuralnetworkforeverydatasetisalsodifferent.Thus,youmustalwaysperformsomeexperimentationtodetermineagoodarchitectureforyournetwork.
EvaluatingNeuralNetworks
Neuralnetworksstartwithrandomweights.Additionally,sometrainingalgorithmsuserandomvaluesaswell.Allconsidered,we’redealingwithquiteabitofrandomnessinordertomakecomparisons.Randomnumberseedsareacommonsolutiontothisissue;however,aconstantseeddoesnotprovideanequalcomparison,giventhatweareevaluatingneuralnetworkswithdifferentneuroncounts.
Let’scompareaneuralnetworkwith32connectionsagainstanothernetworkwith64connections.Whiletheseedguaranteesthatthefirst32connectionsretainthesamevalue,therearenow32additionalconnectionsthatwillhavenewrandomvalues.Furthermore,those32weightsinthefirstnetworkmightnotbeinthesamelocationsinthesecondnetworkiftherandomseedismaintainedbetweenonlythetwoinitialweightsets.
Tocomparearchitectures,wemustperformseveraltrainingrunsandaveragethefinalresults.Becausetheseextratrainingrunsaddtothetotalruntimeoftheprogram,excessivenumbersofrunswillquicklybecomeimpractical.Itcanalsobebeneficialtochooseatrainingalgorithmthatisdeterministic(onethatdoesnotuserandomnumbers).Theexperimentsthatwewillperforminthischapterwillusefivetrainingrunsandtheresilientpropagation(RPROP)trainingalgorithm.RPROPisdeterministic,andfiverunsareanarbitrarychoicethatprovidesareasonablelevelofconsistency.UsingtheXavierweightinitializationalgorithm,introducedinChapter4,“FeedforwardNeuralNetworks,”willalsohelpprovideconsistentresults.
TrainingParameters
Trainingalgorithmsthemselveshaveparametersthatyoumusttune.Wedon’tconsidertheparametersrelatedtotrainingashyper-parametersbecausetheyarenotevidentafteraneuralnetworkhasbeentrained.Youcanexamineatrainedneuralnetworktodetermineeasilywhathyper-parametersarepresent.Asimpleexaminationofthenetworkrevealstheneuroncountsandactivationfunctioninuse.However,determiningtrainingparameterssuchaslearningrateandmomentumisnotpossible.Bothtrainingparametersandhyper-parametersgreatlyaffecttheerrorratesthattheneuralnetworkcanobtain.However,wecanusetrainingparametersonlyduringtheactualtraining.
Thethreemostcommontrainingparametersforneuralnetworksarelistedhere:
LearningRateMomentumBatchSize
Notalllearningalgorithmshavetheseparameters.Additionally,youcanvarythevalueschosenfortheseparametersaslearningprogresses.Wediscussthesetrainingparametersinthesubsequentsections.
LearningRate
Thelearningrateallowsustodeterminehowfareachiterationoftrainingwilltaketheweightvalues.Someproblemsareverysimpletosolve,andahightrainingratewillyieldaquicksolution.Otherproblemsaremoredifficult,andaquicklearningmightdisregardagoodsolution.Otherthantheruntimeofyourprogram,thereisnodisadvantageinchoosingasmalllearningrate.Figure14.1showshowalearningratemightfareonbothasimple(unimodal)andcomplex(multimodal)problem:
Figure14.1:LearningRates
Theabovetwochartsshowtherelationshipbetweenweightandthescoreofanetwork.Astheprogramincreasesordecreasesasingleweight,thescorechanges.Aunimodalproblemistypicallyeasytosolvebecauseitsgraphhasonlyonebump,oroptimalsolution.Inthiscase,weconsideragoodscoretobealowerrorrate.
Amultimodalproblemhasmanybumps,orpossiblegoodsolutions.Iftheproblemissimple(unimodal),afastlearningrateisoptimalbecauseyoucanchargeupthehilltoagreatscore.However,hastemakeswasteonthesecondchart,asthelearningratefailstofindthetwooptimums.
Kamiyama,Iijima,Taguchi,Mitsui,etal.(1992)statedthatmostliteratureusealearningrateof0.2andamomentumof0.9.Oftenthislearningrateandmomentumcanbegoodstartingpoints.Infact,manyexamplesdousethesevalues.TheresearcherssuggestthatEquation14.1hasastronglikelihoodofattainingbetterresults.
Equation14.1:SettingLearningRateandMomentum
Thevariableα(alpha)isthemomentum;ε(epsilon)isthelearningrate,andKisaconstantrelatedtothehiddenneurons.Theirresearchsuggeststhatthetuningofmomentum(discussedinthenextsection)andlearningratearerelated.WedefinetheconstantKbythenumberofhiddenneurons.SmallernumbersofhiddenneuronsshouldusealargerK.Inourownexperimentations,wedonotusetheequationdirectlybecauseitisdifficulttochooseaconcretevalueofK.ThefollowingcalculationsshowseverallearningratesbasedonlearningrateandK.
k=0.500000,alpha=0.200000->epsilon=0.400000
k=0.500000,alpha=0.300000->epsilon=0.350000
k=0.500000,alpha=0.400000->epsilon=0.300000
k=1.000000,alpha=0.200000->epsilon=0.800000
k=1.000000,alpha=0.300000->epsilon=0.700000
k=1.000000,alpha=0.400000->epsilon=0.600000
k=1.500000,alpha=0.200000->epsilon=1.200000
k=1.500000,alpha=0.300000->epsilon=1.050000
k=1.500000,alpha=0.400000->epsilon=0.900000
ThelowervaluesofKrepresenthigherhiddenneuroncounts;thereforethehiddenneuroncountisdecreasingasyoumovedownthelist.Asyoucansee,forallmomentums(α,alpha)of0.2,thesuggestedlearningrate(ε,epsilon)increasesasthehiddenneuroncountsdecrease.Thelearningrateandmomentumhaveaninverserelationship.Asyouincreaseone,youshoulddecreasetheother.However,thehiddenneuroncountcontrolshowquicklymomentumandlearningrateshoulddiverge.
Momentum
Momentumisalearningpropertythatcausestheweightchangetocontinueinitscurrentdirection,evenifthegradientindicatesthattheweightchangeshouldreversedirection.Figure14.2illustratesthisrelationship:
Figure14.2:MomentumandaLocalOptima
Apositivegradientencouragestheweighttodecrease.Theweighthasfollowedthenegativegradientupthehillbutnowhassettledintoavalley,oralocaloptima.Thegradientnowmovesto0andpositiveastheothersideofthelocaloptimaishit.Momentumallowstheweighttocontinueinthisdirectionandpossiblyescapefromthelocal-optimavalleyandpossiblyfindthelowerpointtotheright.
Tounderstandexactlyhowlearningrateandmomentumareimplemented,recallEquation6.6,fromChapter6,“BackpropagationTraining,”thatisrepeatedasEquation14.2forconvenience:
Equation14.2:WeightandMomentumApplied
Thisequationshowshowwecalculatethechangeinweightfortrainingiterationt.Thischangeisthesumoftwotermsthatareeachgovernedbythelearningrateε(epsilon)andmomentumα(alpha).Thegradientistheweight’spartialderivativeoftheerrorrate.Thesignofthegradientdeterminesifweshouldincreaseordecreasethegradient.Thelearningratesimplytellsbackpropagationthepercentageofthisgradientthattheprogramshouldapplytotheweightchange.Theprogramalwaysappliesthischangetotheoriginalweightandthenretainsitforthenextiteration.Themomentumα(alpha)subsequentlydeterminesthepercentageofthepreviousiteration’sweightchangethattheprogramshouldapplytothisiteration.Momentumallowsthepreviousiteration’sweightchangetocarrythroughtothecurrentiteration.Asaresult,theweightchangemaintainsitsdirection.Thisprocessessentiallygivesit“momentum.”
Jacobs(1988)discoveredthatlearningrateshouldbedecreasedastrainingprogresses.Additionally,aspreviouslydiscussed,Kamiyama,etal.(1992)assertedthatmomentumshouldbeincreasedasthelearningrateisdecayed.Adecreasinglearningrate,coupledwithanincreasingmomentum,isaverycommonpatterninneuralnetworktraining.Thehighlearningrateallowstheneuralnetworktobeginexploringalargerareaofthesearchspace.Decreasingthelearningrateforcesthenetworktostopexploringandbeginexploitingamorelocalregionofthesearchspace.Increasingmomentumatthispointhelpsguardagainstlocalminimainthissmallersearchregion.
BatchSize
Thebatchsizespecifiesthenumberoftrainingsetelementsthatyoumustcalculatebeforetheweightsareactuallyupdated.Theprogramsumsallofthegradientsforasinglebatchbeforeitupdatestheweights.Abatchsizeof1indicatesthattheweightsareupdatedforeachtrainingsetelement.Werefertothisprocessasonlinetraining.Theprogramoftensetsthebatchsizetothesizeofthetrainingsetforfullbatchtraining.
Agoodstartingpointisabatchsizeequalto10%oftheentiretrainingset.Youcanincreaseordecreasethebatchsizetoseeitseffectontrainingefficiency.Usuallyaneuralnetworkwillhavevastlyfewerweightsthantrainingsetelements.Asaresult,cuttingthebatchsizebyahalf,orevenafourth,willnothaveadrasticeffectontheruntimeofaniterationinstandardbackpropagation.
GeneralHyper-Parameters
Inadditiontothetrainingparametersjustdiscussed,wemustalsoconsiderthehyper-parameters.Theyaresignificantlymoreimportantthantrainingparametersbecausetheydeterminetheneuralnetworksultimatelearningcapacity.Aneuralnetworkwithareducedlearningcapacitycannotovercomethisdeficiencywithfurthertraining.
ActivationFunctions
Currently,theprogramutilizestwoprimarytypesofactivationfunctionsinsideofaneuralnetwork:
Sigmoidal:Logistic(sigmoid)&HyperbolicTangent(tanh)Linear:ReLU
Thesigmoidal(s-shaped)activationfunctionshavebeenamainstayofneuralnetworks,buttheyarenowlosinggroundtotheReLUactivationfunction.Thetwomostcommons-shapedactivationfunctionsarethenamesakesigmoidactivationfunctionandthehyperbolictangentactivationfunction.Thenamecancauseconfusionbecausesigmoidrefersbothtoanactualactivationfunctionandtoaclassofactivationfunctions.Theactualsigmoidactivationfunctionhasarangebetween0and1,whereasthehyperbolictangentfunctionhasarangeof-1and1.Wewillfirsttacklehyperbolictangentversussigmoid(theactivationfunction).Figure14.3showstheoverlayofthesetwoactivations:
Figure14.3:SigmoidandTanh
Asyoucanseefromthefigure,thehyperbolictangentstretchesoveramuchlargerrangethantanh.Yourchoiceofthesetwoactivationswillaffectthewaythatyounormalizeyourdata.Ifyouareusinghyperbolictangentattheoutputlayerofyourneuralnetwork,youmustnormalizetheexpectedoutcomebetween-1and1.Similarly,ifyouareusingthesigmoidfunctionfortheoutputlayer,youmustnormalizetheexpectedoutcomebetween-1and1.Youshouldnormalizetheinputto-1to1forbothoftheseactivationfunctions.Thex-values(input)above+1willsaturateto+1(y-values)forbothsigmoidandhyperbolictangent.Asx-valuesgobelow-1,thesigmoidactivationfunctionsaturatestoy-valuesof0,andhyperbolictangentsaturatestoy-valuesof-1.
Thesaturationofsigmoidtovaluesof0inthenegativedirectioncanbeproblematicfortraining.Asaresult,Kalman&Kwasny(1992)recommendhyperbolictangentinallsituationsinsteadofsigmoid.Thisrecommendationcorrespondswithmanyliteraturesources.However,thisargumentonlyextendstothechoicebetweensigmoidalactivationfunctions.AgrowingbodyofrecentresearchfavorstheReLUactivationfunctioninallcasesoverthesigmoidalactivationfunctions.
Zeileretal.(2014),Maas,Hannun,Awni&Ng(2013)andGlorot,Bordes&Bengio(2013)allrecommendtheReLUactivationfunctionoveritssigmoidalcounterparts.“Chapter9,“DeepLearning,”includestheadvantagesoftheReLUactivationfunction.Inthissection,wewillexamineanexperimentthatcomparestheReLUtothesigmoid,weusedaneuralnetworkwithahiddenlayerof1,000neurons.WeranthisneuralnetworkagainsttheMNISTdataset.Obviously,weadjustedthenumberofinputandoutputneuronstomatchtheproblem.Weraneachactivationfunctionfivetimeswithdifferentrandomweightsandkeptthebestresults:
Sigmoid:
Bestvalidlosswas0.068866atepoch43.
Incorrect192/10000(1.92%)
ReLU:
Bestvalidlosswas0.068229atepoch17.
Incorrect170/10000(1.7000000000000002%)
Theaccuracyratesforeachoftheaboveneuralnetworksonavalidationset.Asyoucansee,theReLUactivationfunctiondidindeedhavethelowesterrorrateandachieveditinfewertrainingiterations/epochs.Ofcourse,theseresultswillvary,dependingontheplatformused.
HiddenNeuronConfigurations
Hiddenneuronconfigurationshavebeenafrequentsourceofquestions.Neuralnetworkprogrammersoftenwonderexactlyhowtostructuretheirnetworks.Asofthewritingofthisbook,aquickscanofStackOverflowshowsover50questionsrelatedtohiddenneuronconfigurations.Youcanfindthequestionsatthefollowinglink:
http://goo.gl/ruWpcb
Althoughtheanswersmayvary,mostofthemsimplyadvisethattheprogrammer“experimentandfindout.”Accordingtotheuniversalapproximationtheorem,asingle-hidden-layerneuralnetworkcantheoreticallylearnanypattern(Hornik,1991).Consequently,manyresearcherssuggestonlysingle-hidden-layerneuralnetworks.Althoughasingle-hidden-layerneuralnetworkcanlearnanypattern,theuniversalapproximationtheoremdoesnotstatethatthisprocessiseasyforaneuralnetwork.Nowthatwehaveefficienttechniquestotraindeepneuralnetworks,theuniversalapproximationtheoremismuchlessimportant.
Toseetheeffectsofhiddenneuronsandneuroncounts,wewillperformanexperimentthatwilllookatone-layerandtwo-layerneuralnetworks.Wewilltryeverycombinationofhiddenneuronsuptotwo50-neuronlayers.ThisneuralnetworkwilluseaReLUactivationfunctionandRPROP.Thisexperimenttookover30hourstorunonanIntelI7quad-core.Figure14.4showsaheatmapoftheresults:
Figure14.4:HeatMapofTwo-LayerNetwork(firstexperiment)
Thebestconfigurationreportedbytheexperimentwas35neuronsinhiddenlayer1,and15neuronsinhiddenlayer2.Theresultsofthisexperimentwillvarywhenrepeated.Theabovediagramshowsthebest-trainednetworksinthelower-leftcorner,asindicatedbythedarkersquares.Thisindicatesthatthenetworkfavorsalargefirsthiddenlayerwithsmallersecondhiddenlayers.Theheatmapshowstherelationshipsbetweenthedifferentconfigurations.Weachievedbetterresultswithsmallerneuroncountsonthesecondhiddenlayer.Thisoccurredbecausetheneuroncountsconstrictedtheinformationflowtotheoutputlayer.Thisapproachisconsistentwithresearchintoauto-encodersinwhichsuccessivelysmallerlayersforcetheneuralnetworktogeneralizeinformation,ratherthanoverfit.Ingeneral,basedontheexperimenthere,weadviseusingatleasttwohiddenlayerswithsuccessivelysmallerlayers.
LeNet-5Hyper-Parameters
TheLeNet-5convolutionalneuralnetworksintroduceadditionallayertypesthatbringmorechoicesintheconstructionofneuralnetworks.Boththeconvolutionalandmax-poolinglayerscreateotherchoicesforhyper-parameters.Chapter10,“ConvolutionalNeuralNetworks”containsacompletelistofhyper-parametersthattheLeNet-5networkintroduces.Inthissection,wewillreviewLeNet-5architecturalrecommendationsrecentlysuggestedinscientificpapers.
MostliteratureonLeNet-5networkssupportstheuseofamax-poollayertofolloweveryconvolutionallayer.Ideally,severalconvolutional/max-poollayersreducetheresolutionateachstep.Chapter6,“ConvolutionalNeuralNetworks”includesthisdemonstration.However,veryrecentliteratureseemstoindicatethatmax-poollayersshouldnotbeusedatall.
OnNovember7,2014,thewebsiteRedditfeaturedDr.GeoffreyHintonforan“askmeanything(AMA)”session.Dr.Hintonistheforemostresearcherindeeplearningandneuralnetworks.DuringtheAMAsession,Dr.Hintonwasaskedaboutmax-poollayers.Youcanreadhiscompleteresponsehere:
https://goo.gl/TgBakL
Overall,Dr.Hintonbeginshisanswersaying,“Thepoolingoperationusedinconvolutionalneuralnetworksisabigmistake,andthefactthatitworkssowellisadisaster.”Hethenproceedswithatechnicaldescriptionofwhyyoushouldneverusemax-pooling.Atthetimeofthisbook’spublication,hisresponseisfairlyrecentandcontroversial.Thereforewesuggestthatyoutrytheconvolutionalneuralnetworksbothwithandwithoutmax-poollayers,astheirfuturelooksuncertain.
ChapterSummary
Selectingagoodsetofhyper-parametersisoneofthemostdifficulttasksfortheneuralnetworkprogrammer.Thenumberofhiddenneurons,activationfunctions,andlayerstructuresareallexamplesofneuralnetworkhyper-parametersthattheprogrammermustadjustandfine-tune.Allthesehyper-parameterscanaffecttheoverallcapacityoftheneuralnetworktolearnpatterns.Asaresult,youmustchoosethemcorrectly.
MostcurrentliteraturesuggestsusingtheReLUactivationfunctioninplaceofthesigmoidal(s-shaped)activationfunctions.Ifyouaregoingtouseasigmoidalactivation,mostliteraturerecommendsthatyouusethehyperbolictangentactivationfunctioninsteadofthesigmoidalactivationfunction.TheReLUactivationfunctionismorecompatiblewithdeepneuralnetworks.
Thenumberofhiddenlayersandneuronsisalsoanimportanthyper-parameterforneuralnetworks.Itisgenerallyadvisablethatsuccessivehiddenlayerscontainasmallernumberofneuronsthantheimmediatelypreviouslayer.Thisadjustmenthastheeffectofconstrainingthedatafromtheinputsandforcingtheneuralnetworktogeneralizeandnotmemorize,whichresultsinoverfitting.
Wedonotconsidertrainingparametersashyper-parametersbecausetheydonotaffectthestructureoftheneuralnetwork.However,youstillmustchooseapropersetoftrainingparameters.Thelearningrateandmomentumaretwoofthemostcommontrainingparametersforneuralnetworks.Generally,youshouldinitiallysetthelearningratehighanddecreaseitastrainingcontinues.Youshouldmovethemomentumvalueinverselywiththelearningrate.
Inthischapter,weexaminedhowtostructureneuralnetworks.Whileweprovidedsomegeneralrecommendations,thedatasetgenerallydrivesthearchitectureoftheneuralnetwork.Consequently,youmustanalyzethedataset.Wewillintroducethet-SNEdimensionreductionalgorithminthenextchapter.Thisalgorithmwillallowyoutovisualizegraphicallyyourdatasetandseeissuesthatoccurwhileyouarecreatinganeuralnetworkforthatdataset.
Chapter15:VisualizationConfusionMatricesPCAt-SNE
Wefrequentlyreceivethefollowingquestionaboutneuralnetworks:“I’vecreatedaneuralnetwork,butwhenItrainit,myerrornevergoestoanacceptablelevel.WhatshouldIdo?”Thefirststepinthisinvestigationistodetermineifoneofthefollowingcommonerrorshasoccurred.
CorrectnumberofinputandoutputneuronsDatasetnormalizedcorrectlySomefataldesigndecisionoftheneuralnetwork
Obviously,youmusthavethecorrectnumberofinputneuronstomatchhowyourdataarenormalized.Likewise,youshouldhaveasingle-outputneuronforregressionproblemsorusuallyoneoutputneuronperclassforaclassificationproblem.Youshouldnormalizeinputdatatofittheactivationfunctionthatyouuse.Inasimilarway,fatalmistakes,suchasnohiddenlayeroralearningrateof0,cancreateabadsituation.
However,onceyoueliminatealltheseerrors,youmustlooktoyourdata.Forclassificationproblems,yourneuralnetworkmayhavedifficultiesdifferentiatingbetweencertainpairsofclasses.Tohelpyouresolvethisissue,somevisualizationalgorithmsexistthatallowyoutoseetheproblemsthatyourneuralnetworkmightencounter.Thetwovisualizationspresentedinthischapterwillshowthefollowingissueswithdata:
ClassesthatareeasilyconfusedforothersNoisydataDissimilaritybetweenclasses
Wedescribeeachissueinthesubsequentsectionsandoffersomepotentialsolutions.Wewillpresentthesepotentialsolutionsintheformoftwoalgorithmsofincreasingcomplexity.Notonlyisthetopicofvisualizationimportantfordataanalysis,itwasalsochosenasatopicbythereadersofthisbook,whichearneditsinitialfundingthroughaKickstartercampaign.Theproject’soriginal653backerschosevisualizationfromamongseveralcompetingprojecttopics.Asaresult,wewillpresenttwovisualizations.BothexampleswillusetheMNISThandwrittendigitsdatasetthatwehaveexaminedinpreviouschaptersofthisbook.
ConfusionMatrix
AneuralnetworktrainedfortheMNISTdatasetshouldbeabletotakeahandwrittendigitandpredictwhatdigitwasactuallywritten.Somedigitsaremoreeasilyconfusedforothers.Anyclassificationneuralnetworkhasthepossibilityofmisclassifyingdata.Aconfusionmatrixcanmeasurethesemisclassifications.
ReadingaConfusionMatrix
Aconfusionmatrixisalwayspresentedasasquaregrid.Thenumberofrowsandcolumnswillbothbeequaltothenumberofclassesinyourproblem.ForMNIST,thiswillbea10x10grid,asshownbyFigure15.1:
Figure15.1:MNISTConfusionMatrix
Aconfusionmatrixusesthecolumnstorepresentpredictions.Therowsrepresentwhatwouldhavebeenacorrectprediction.Ifyoulookatrow0column0,youwillseethenumber1,432.Thisresultmeansthattheneuralnetworkcorrectlypredicteda“0”1,432times.Ifyoulookatrow3column2,youwillseethata“2”waspredicted49timeswhenitshouldhavebeena“3.”Theproblemoccurredbecauseit’seasytomistakeahandwritten“3”fora“2,”especiallywhenapersonwithbadpenmanshipwritesthenumbers.Theconfusionmatrixletsyouseewhichdigitsarecommonlymistakenforeachother.Anotherimportantaspectoftheconfusionmatrixisthediagonalfrom(0,0)to(9,9).Iftheprogramtrainstheneuralnetworkproperly,thelargestnumbersshouldbeinthediagonal.Thus,aperfectlytrainedneuralnetworkwillonlyhavenumbersinthediagonal.
GeneratingaConfusionMatrix
Youcancreateaconfusionmatrixwiththefollowingsteps:
Separatethedatasetintotrainingandvalidation.Trainaneuralnetworkonthetrainingset.Settheconfusionmatrixtoallzeros.Loopovereveryelementinthevalidationset.Foreveryelement,increasethecell:row=expected,column=predicted.Reporttheconfusionmatrix.
Listing15.1showsthisprocessinthefollowingpseudocode:
Listing15.1:ComputeaConfusionMatrix
#x-containsdatasetinputs
#y-containsdatasetexpectedvalues(ordinals,notstrings)
defconfusion_matrix(x,y,network):
#Createsquarematrixequaltonumberofclassifications
confusion=matrix(network.num_classes,network.num_classes)
#Loopovereveryelement
forifrom0tolen(x):
prediction=net.compute(x[i])
target=y[i]
confusion[target][prediction]=confusion[target][prediction]+1
#Returnresult
returnconfusion
Confusionmatricesareoneoftheclassicvisualizationsforclassificationdataproblems.Youcanusethemwithanyclassificationproblem,notjustneuralnetworks.
t-SNEDimensionReduction
Thet-DistributedStochasticNeighborEmbedding(t-SNE)isatypeofdimensionalityreductionalgorithmthatprogrammersfrequentlyuseforvisualization.Wewillfirstdefinedimensionreductionandshowitsadvantagesforvisualizationandproblemsimplification.
Thedimensionsofadatasetarethenumberofinput(x)valuesthattheprogramusestomakepredictions.Theclassicirisdatasethasfourdimensionsbecausewemeasuretheirisflowersinfourdimensions.Chapter4,“FeedforwardNetworks,”hasanexplanationoftheirisdataset.TheMNISTdigitsareimagesof28x28grayscalepixels,whichresultinatotalof784inputneurons(28x28).Asaresult,theMNISTdatasethas784dimensions.
Fordimensionalityreduction,weneedtoaskthefollowingquestion:“Dowereallyneed784dimensionsorcouldweprojectthisdatasetintofewerdimensions?”Projectionsareverycommonincartography.Earthexistsinatleastthreedimensionsthatwecandirectlyobserve.Theonlytruethree-dimensionalmapofEarthisaglobe.However,globesareinconvenienttostoreandtransport.Aslongasitstillcontainstheinformationthatwerequire,aflat(2D)representationofEarthisusefulforspaceswhereaglobewillnotfit.Wecanprojecttheglobeona2Dsurfaceinmanyways.Figure15.2showstheLambertprojection(fromWikipedia)ofEarth:
Figure15.2:LambertProjection(cone)
JohannHeinrichLambertintroducedtheLambertprojectionin1772.Conceptually,thisprojectionworksbyplacingaconeoversomeregionoftheglobeandprojectingtheimageontotheglobe.Oncetheconeisunrolled,youhaveaflat2Dmap.Accuracyisbetternearthetipoftheconeandworsenstowardsthebaseofthecone.TheLambertprojectionisnottheonlywaytoprojecttheglobeandproduceamap,Figure15.3showsthepopularMercatorprojection:
Figure15.3:MercatorProjection(cylinder)
GerardusMercatorpresentedtheMercatorprojectionin1569.Thisprojectionworksbyessentiallywrappingacylinderabouttheglobeattheequator.Accuracyisbestattheequatorandworsensnearthepoles.YoucanseethischaracteristicbyexaminingtherelativesizeofGreenlandinbothprojections.Alongwiththetwoprojectionsjustmentioned,manyothertypesexist.EachisdesignedtoshowEarthinwaysthatareusefulfordifferentapplications.
Theprojectionsabovearenotstrictly2Dbecausetheycreateatypeofthirddimensionwithotheraspectslikecolor.Themapprojectionscanconveyadditionalinformationsuchasaltitude,groundcover,orevenpoliticaldivisionswithcolor.Computerprojectionsalsoutilizecolor,aswewilldiscoverinthenextsection.
t-SNEasaVisualization
IfwecanreducetheMNIST764dimensionsdowntotwoorthreewithadimensionreductionalgorithm,thenwecanvisualizethedataset.Reducingtotwodimensionsispopularbecauseanarticleorabookcaneasilycapturethevisualization.Itisimportanttorememberthata3Dvisualizationisnotactually3D,astrue3Ddisplaysareextremelyrare,asofthewritingofthisbook.A3Dvisualizationwillberenderedontoa2Dmonitor.Asaresult,itisnecessaryto“fly”throughthespaceandseehowpartsofthevisualizationreallyappear.Thisflightthroughspaceisverysimilartoacomputervideogamewhereyoudonotseeallaspectsofasceneuntilyouflycompletelyaroundtheobjectbeingviewed.Evenintherealworld,youcannotseeboththefrontandbackofanobjectyouareholding—itisnecessarytorotatetheobjectwithyourhandstoseeallsides.
KarlPearsonin1901inventedoneofthemostcommondimensionalityreductionalgorithms.Principalcomponentanalysis(PCA)createsanumberofprincipalcomponentsthatmatchthenumberofdimensionstobereduced.Fora2Dreduction,therewouldbetwoprincipalcomponents.Conceptually,PCAisattemptingtopackthehigher-dimensionalitemsintotheprincipalcomponentsthatmaximizetheamountofvariabilityinthedata.Byensuringthatthedistantvaluesinhigh-dimensionalspaceremaindistant,PCAcancompleteitsfunction.Figure15.4showsaPCAreductionoftheMNISTdigitstotwodimensions:
Figure15.4:2DPCAVisualizationofMNIST
Thefirstprincipalcomponentisthex-axis(leftandright).Asyoucansee,thematrixpositionsthebluedots(0’s)atthefarleft,andthereddots(1’s)areplacedtowardstheright.Handwritten1’sand0’saretheeasiesttodifferentiate—theyhavethehighest
variability.Thesecondprincipalcomponentisthey-axis(upanddown).Onthetop,youhavegreen(2’s)andbrown(3’s),whichlooksomewhatsimilar.Onthebottomarepurple(4’s),gray(9’s)andblack(7’s),whichalsolooksimilar.Yetthevariabilitybetweenthesetwogroupsishigh—itiseasiertotell2’sand3’sfrom4’s,9’sand7’s.
Colorisveryimportanttotheaboveimage.Ifyouarereadingthisbookinablack-and-whiteform,thisimagemaynotmakeasmuchsense.ThecolorrepresentsthedigitthatPCAclassified.YoumustnotethatPCAandt-SNEarebothunsupervised;therefore,theydonotknowtheidentitiesoftheinputvectors.Inotherwords,theydon’tknowwhichdigitwasselected.TheprogramaddsthecolorssothatwecanseehowwellPCAclassifiedthedigits.Iftheabovediagramisblackandwhiteinyourversion,youcanseethattheprogramdidnotplacethedigitsintomanydistinctgroups.WecanthereforeconcludethatPCAdoesnotworkwellasaclusteringalgorithm.
Theabovefigureisalsoverynoisybecausethedotsoverlapinlargeregions.Themostwell-definedregionisblue,wherethe“1”digitsreside.Youcanalsoseethatpurple(4),black(7),andgray(9)areeasytoconfuse.Additionally,brown(3),green(2),andyellow(8)canbemisleading.
PCAanalyzesthepair-wisedistancesofalldatapointsandpreserveslargedistances.Aspreviouslystated,iftwopointsaredistantinPCA,theywillremaindistant.However,wehavetoquestiontheimportanceofdistance.ConsiderFigure15.5thatshowstwopointsthatappeartobesomewhatclose:
Figure15.5:ApparentClosenessonaSpiral
Thepointsinquestionarethetwored,solidpointsthatareconnectedbyaline.The
twopoints,whenconnectedbyastraightline,aresomewhatclose.However,iftheprogramfollowsthepatterninthedata,thepointsareactuallyfarapart,asindicatedbythesolidspirallinethatfollowsallofthepoints.PCAwouldattempttokeepthesetwopointscloseastheyappearinFigure15.5.Thet-SNEalgorithminventedbyvanderMaaten&Hinton(2008),workssomewhatdifferently.Figure15.6showsthet-SNEvisualizationforthesamedatasetasfeaturedforPCA:
Figure15.6:2DPCAVisualizationofMNIST
Thet-SNEfortheMNISTdigitsshowsamuchclearervisualforthedifferentdigits.Again,theprogramaddscolortoindicatewherethedigitslanded.However,eveninblackandwhite,youwouldseesomedivisionsbetweenclusters.Digitslocatednearertoeachothersharesimilarities.Theamountofnoiseisreducedgreatly,butyoucanstillseesomereddots(0’s)sprinkledintheyellowcluster(8’s)andcyancluster(6’s),aswellasotherclusters.YoucanproduceavisualizationforaKaggledatasetusingthet-SNEalgorithm.WewillexaminethisprocessinChapter16,“ModelingwithNeuralNetworks.”
Implementationsoft-SNEexistformostmodernprogramminglanguages.LaurensvanderMaaten’shomepagecontainsalistatthefollowingURL:
http://lvdmaaten.github.io/tsne/
t-SNEBeyondVisualization
Althought-SNEisprimarilyanalgorithmforreducingdimensionsforvisualization,featureengineeringalsoutilizesit.Thealgorithmcanevenserveasamodelcomponent.Featureengineeringoccurswhenyoucreateadditionalinputfeatures.Averysimpleexampleoffeatureengineeringiswhenyouconsiderhealthinsuranceapplicants,andyoucreateanadditionalfeaturecalledBMI,basedonthefeaturesweightandheight,asseeninequation15.1:
Equation15.1:BMICalculation
BMIissimplyacalculatedfieldthatallowshumanstocombineheightandweighttodeterminehowhealthysomeoneis.Suchfeaturescansometimeshelpneuralnetworksaswell.Youcanbuildsomeadditionalfeatureswithadatapoint’slocationineither2Dor3Dspace.
InChapter16,“ModelingwithNeuralNetworks,”wewilldiscussbuildingneuralnetworksfortheOttoGroupKagglechallenge.SeveralKaggletop-tensolutionsforthiscompetitionusedfeaturesthatwereengineeredwitht-SNE.Forthischallenge,youhadtoorganizedatapointsintonineclasses.Thedistancebetweenanitemandthenearestneighborofeachofthenineclassesona3Dt-SNEprojectionwasabeneficialfeature.Tocalculatethisfeature,wesimplymaptheentiretrainingsetintot-SNEspaceandobtainthe3Dt-SNEcoordinatesforeachfeature.ThenwegenerateninefeatureswiththeEuclideandistancebetweenthecurrentdatapointanditsnearestneighborofeachofthesenineclasses.Finally,theprogramaddstheseninefieldstothe92fieldsalreadybeingpresentedtotheneuralnetwork.
Asavisualizationoraspartoftheinputtoanothermodel,thet-SNEalgorithmprovidesagreatdealofinformationtotheprogram.Theprogrammercanusethisinformationtoseehowthedataarestructured,andthemodelgainsmoredetailsonthestructureofthedata.Mostimplementationsoft-SNEalsocontainadaptionsforlargedatasetsorforveryhighdimensions.Beforeyouconstructaneuralnetworktoanalyzedata,youshouldconsiderthet-SNEvisualization.Afteryoutraintheneuralnetworktoanalyzeitsresults,youcanusetheconfusionmatrix.
ChapterSummary
Visualizationisanimportantpartofneuralnetworkprogramming.Eachdatasetpresentsuniquechallengestoamachinelearningalgorithmoraneuralnetwork.Visualizationcanexposethesechallenges,allowingyoutodesignyourapproachtoaccountforknownissuesinthedataset.Wedemonstratedtwovisualizationtechniquesinthischapter.
Theconfusionmatrixisaverycommonvisualizationformachinelearningclassification.Itisalwaysasquarematrixwithrowsandcolumnsequaltothenumberofclassesintheproblem.Therowsrepresenttheexpectedvalues,andthecolumnsrepresentthevaluethattheneuralnetworkactuallyclassified.Thediagonal,wheretherowandcolumnnumbersareequal,representsthenumberoftimestheneuralnetworkcorrectlyclassifiedthatparticularclass.Awell-trainedneuralnetworkwillhavethelargestnumbersalongthediagonal.Theothercellscountthenumberoftimesamisclassificationoccurredbetweeneachexpectedclassandactualvalue.
Althoughyouusuallyruntheconfusionmatricesaftertheprogramgeneratesaneuralnetwork,youcanrunthedimensionreductionvisualizationsbeforehandtoexposesomechallengesthatmightbepresentinyourdataset.Youcanreducethedimensionsofyourdatasetto2Dor3Dwiththet-SNEalgorithm.However,itbecomeslesseffectiveindimensionshigherthan3D.Withthe2Ddimensionreduction,youcancreateinformativescatterplotsthatwillshowtherelationshipbetweenseveralclasses.
Inthenextchapter,wewillpresentaKagglechallengeasawaytosynthesizemanyofthetopicspreviouslydiscussed.Wewillusethet-SNEvisualizationasaninitial.Additionally,wewilldecreasetheneuralnetwork’stendencytooverfitwiththeuseofdropoutlayers.
Chapter16:ModelingwithNeuralNetworks
DataScienceKaggleEnsembleLearning
Inthischapter,wepresentacapstoneprojectonmodeling,abusiness-orientedapproachforartificialintelligence,andsomeaspectsofdatascience.DrewConway(2013),aleadingdatascientist,characterizesdatascienceastheintersectionofhackingskills,mathandstatisticsknowledge,andsubstantiveexpertise.Figure16.1depictsthisdefinition:
Figure16.1:Conway’sDataScienceVennDiagram
Hackingskillsareessentiallyasubsetofcomputerprogramming.Althoughthedatascientistdoesnotnecessarilyneedtheinfrastructureknowledgeofaninformationtechnology(IT)professional,thesetechnicalskillswillpermithimorhertocreateshort,effectiveprogramsforprocessingdata.Inthefieldofdatascience,werefertoinformationprocessingasdatawrangling.
Mathandstatisticsknowledgecoversstatistics,probability,andotherinferentialmethods.Substantiveknowledgedescribesthebusinessknowledgeaswellasthecomprehensionofactualdata.Ifyoucombineonlytwoofthesetopics,youdon’thaveallthecomponentsfordatascience,asFigure16.1illustrates.Inotherwords,thecombinationofstatisticsandsubstantiveexpertiseissimplytraditionalresearch.Thosetwoskillsalonedon’tencompassthecapabilities,suchasmachinelearning,requiredfordatascience.
Thisbookseriesdealswithhackingskillsandmathandstatisticalknowledge,twoofthecirclesinFigure16.1.Additionally,itteachesyoutocreateyourownmodels,whichismorepertinenttothefieldofcomputersciencethandatascience.Substantiveexpertiseismoredifficulttoobtainbecauseitisdependentontheindustrythatutilizesthedatascienceapplications.Forexample,ifyouwanttoapplydatascienceintheinsuranceindustry,substantiveknowledgereferstotheactualbusinessoperationsofthesecompanies.
Toprovideadatasciencecapstoneproject,wewillusetheKaggleOttoGroupProductClassificationChallenge.Kaggleisaplatformforcompetitivedatascience.YoucanfindtheOttoGroupProductClassificationChallengeatthefollowingURL:
https://www.kaggle.com/c/otto-group-product-classification-challenge
TheOttoGroupwasthefirst(andcurrentlyonly)non-tutorialKagglecompetitioninwhichwe’vecompeted.Afterobtainingatop10%finish,weachievedoneofthecriteriafortheKaggleMasterdesignation.TobecomeaKaggleMaster,onemustplaceinthetop10ofacompetitiononceandinthetop10%oftwoothercompetitions.Figure16.2showstheresultsofourcompetitionentryontheleaderboard:
Figure16.2:ResultsintheOttoGroupProductClassificationChallenge
Theabovelineshowsseveralpiecesofinformation.
Wewereinposition331of3514(9.4%).Wedroppedthreespotsinthefinalday.Ourmulti-classloglossscorewas0.42881.Wemade52submissions,uptoMay18,2015.
WewillbrieflydescribetheOttoGroupProductClassificationChallenge.Foracompletedescription,refertotheKagglechallengewebsite(foundabove).TheOttoGroup,theworld’slargestmailordercompanyandcurrentlyoneofthebiggeste-commercecompanies,introducedthischallenge.Becausethegrouphasmanyproductssoldovernumerouscountries,theywantedtoclassifytheseproductsintoninecategorieswith93features(columns).These93columnsrepresentedcountsandwereoften0.
Thedatawerecompletelyredacted(hidden).Thecompetitorsdidnotknowtheninecategoriesnordidtheyknowthemeaningbehindthe93features.Theyknewonlythatthefeatureswereintegercounts.LikemostKagglecompetitions,thischallengeprovidedthe
competitorswithatestandtrainingdataset.Forthetrainingdataset,thecompetitorsreceivedtheoutcomes,orcorrectanswers.Forthetestset,theygotonlythe93features,andtheyhadtoprovidetheoutcome.
Thecompetitiondividedthetestandtrainingsetsinthefollowingway:
TestData:144KrowsTrainingData:61Krows
Duringthecompetition,participantsdidnotsubmittheiractualmodelstoKaggle.Instead,theysubmittedtheirmodel’spredictionsbasedonthetestdata.Asaresult,theycouldhaveusedanyplatformtomakethesepredictions.Forthiscompetitiontherewereninecategories,sothecompetitorssubmittedanine-numbervectorthatheldtheprobabilityofeachoftheseninecategoriesbeingthecorrectanswer.
Theanswerinthevectorthatheldthehighestprobabilitywasthechosenclass.Asyoucanobserve,thiscompetitionwasnotlikeamultiple-choicetestinschoolwherestudentsmustsubmittheiranswerasA,B,C,orD.Instead,Kagglecompetitorshadtosubmittheiranswersinthefollowingway:
A:80%probabilityB:16%probabilityC:2%probabilityD:2%probability
CollegeexamswouldnotbesohorrendousifstudentscouldsubmitanswerslikethoseintheKagglecompetition.Inmanymultiple-choicetests,studentshaveconfidenceabouttwooftheanswersandeliminatetheremainingtwo.TheKaggle-likemultiple-choicetestwouldallowstudentstoassignaprobabilitytoeachanswer,andtheycouldachieveapartialscore.Intheaboveexample,ifAwerethecorrectanswer,studentswouldearn80%ofthepoints.
Nevertheless,theactualKagglescoreisslightlymorecomplex.Theprogramgradestheanswerswithalogarithm-basedscale,andparticipantsfaceheavypenaltiesiftheyhavealowerprobabilityonthecorrectanswer.YoucanseetheKaggleformatfromthefollowingCSVfilesubmission:
1,0.0003,0.2132,0.2340,0.5468,6.2998e-05,0.0001,0.0050,0.0001,4.3826e-05
2,0.0011,0.0029,0.0010,0.0003,0.0001,0.5207,0.0013,0.4711,0.0011
3,3.2977e-06,4.1419e-06,7.4524e-06,2.6550e-06,5.0014e-07,0.9998,5.2621e-
06,0.0001,6.6447e-06
4,0.0001,0.6786,0.3162,0.0039,3.3378e-05,4.1196e-05,0.0001,0.0001,0.0006
5,0.1403,0.0002,0.0002,6.734e-05,0.0001,0.0027,0.0009,0.0297,0.8255
Asyoucansee,eachlinestartswithanumberthatspecifiesthedataitemthatisbeinganswered.Thesampleaboveshowstheanswersforitemsonethroughfive.Thenextninevaluesaretheprobabilitiesforeachoftheproductclasses.Theseprobabilitiesmustaddupto1.0(100%).
LessonsfromtheChallenge
HavingsuccessinKagglerequiresyoutounderstandthefollowingtopicsandthecorrespondingtools:
DeepLearning-UsingH2OandLasagneGradientBoostingMachines(GBM)-UsingXGBOOSTEnsembleLearning-UsingNumPyFeatureEngineering-UsingNumPyandScikit-LearnGPUisreallyimportantfordeeplearning.Itisbesttouseadeeplearningpackagethatsupportsit,suchasH2O,TheanoorLasagne.Thet-SNEvisualizationisawesomeforhigh-dimensionvisualizationandcreatingfeatures.Ensemblingisveryimportant.
Foroursubmission,weusedPythonwithScikit-Learn.However,youcanuseanylanguagecapableofgeneratingaCSVfile.Kaggledoesnotactuallyrunyourcode;theyscoreasubmissionfile.ThetwomostcommonlyusedprogramminglanguagesforKaggleareRandPython.Bothoftheselanguageshavestrongdatascienceframeworksavailableforthem.Risactuallyadomain-specificlanguage(DSL)forstatisticalanalysis.
Duringthischallenge,welearnedthemostaboutGBMparametertuningandensemblelearning.GBMshavequiteafewhyper-parameterstotune,andwebecameproficientattuningaGBM.TheindividualscoresforourGBMswereinlinewiththoseofthetop10%oftheteams.However,thesolutioninthischapterwilluseonlydeeplearning.GBMisbeyondthescopeofthisbook.Inafuturevolumeoreditionofthisseries,weplantoexamineGBM.
Althoughcomputerprogrammersanddatascientistsmighttypicallyutilizeasinglemodellikeneuralnetworks,participantsinKaggleneedtousemultiplemodelstobesuccessfulinthecompetition.Theseensembledmodelsproducebetterresultsthaneachofthemodelscouldgenerateindependently.
Weworkedwitht-SNE,examinedinChapter15,“Visualization,”forthefirsttimeinthiscompetition.Thismodelworkslikeprincipalcomponentanalysis(PCA)inthatitiscapableofreducingdimensions.However,thedatapointsseparateinsuchawaythatthevisualizationisoftenclearerthanPCA.Theprogramachievestheclearvisualizationbyusingastochasticnearestneighborprocess.Figure16.3showsthedatafromtheOttoGroupProductClassificationChallengevisualizedint-SNE:
Figure16.3:Challenget-SNE
TheWinningApproachtotheChallenge
Kaggleisverycompetitive.Ourprimaryobjectiveasweenteredthechallengewastolearn.However,wealsohopedtorankinthetop10%bytheendinordertoreachoneofthestepsinbecomingaKagglemaster.Earningatop10%wasdifficult;inthelastfewweeksofthechallenge,othercompetitorsknockedusoutofthebracketalmostdaily.Thelastthreedayswereespeciallyturbulent.Beforewerevealoursolution,wewillshowyouthewinningone.Thefollowingdescriptionisbasedontheinformationpublicallypostedaboutthewinningsolution.
ThewinnersoftheOttoGroupProductClassificationChallengewereGilbertoTitericz&StanislavSemenov.Theycompetedasateamandusedathree-levelensemble,asseeninFigure16.4:
Figure16.4:ChallengeWinningEnsemble
Wewillprovideonlyahigh-leveloverviewoftheirapproach.YoucanfindthefulldescriptionatthefollowingURL:
https://goo.gl/fZrJA0
ThewinningapproachemployedboththeRandPythonprogramminglanguages.Level1usedatotalof33differentmodels.Eachofthese33modelsprovideditsoutputtothreemodelsinlevel2.Additionally,theprogramgeneratedeightcalculatedfeatures.Anengineeredfeatureisonethatiscalculatedbasedontheothers.Asimpleexampleofanengineeredfeaturemightbebodymassindex(BMI),whichiscalculatedbasedonanindividual’sheightandweight.TheBMIvalueprovidesinsightsthatheightandweightalonemightnot.
Thesecondlevelcombinedthefollowingthreemodeltypes:
XGBoost–GradientboostingLasangeNeuralNetwork–DeeplearningADABoostExtraTrees
Thesethreeusedtheoutputof33modelsandeightfeaturesasinput.Theoutputfromthesethreemodelswasthesamenine-numberprobabilityvectorpreviouslydiscussed.Itwasasifeachmodelwerebeingusedindependently,therebyproducinganine-numbervectorthatwouldhavebeensuitableasananswersubmissiontoKaggle.Theprogramaveragedtogethertheseoutputvectorswiththethirdlayer,whichwassimplyaweighting.Asyoucansee,thewinnersofthechallengeusedalargeandcomplexensemble.MostofthewinningsolutionsinKagglefollowedasimilarpattern.
Acompletediscussiononexactlyhowtheyconstructedthismodelisbeyondthescopeofthisbook.Quitehonestly,suchadiscussionisalsobeyondourowncurrentknowledgeofensemblelearning.AlthoughthesecomplexensemblesareveryeffectiveforKaggle,theyarenotalwaysnecessaryforgeneraldatasciencepurposes.Thesetypesofmodelsaretheblackestofblackboxes.Itisimpossibletoexplainthereasonsbehindthemodel’spredictions.
However,learningaboutthesecomplexmodelsisfascinatingforresearch,andfuturevolumesofthisserieswilllikelyincludemoreinformationaboutthesestructures.
OurApproachtotheChallenge
Sofar,we’veworkedonlywithsinglemodelsystems.Thesemodelsthatcontainensemblesthatare“builtin”,suchasrandomforestsandgradientboostingmachines(GBM).However,itispossibletocreatehigher-levelensemblesofthesemodels.Weusedatotalof20models,whichincludedtendeepneuralnetworksandtengradientboostingmachines.Ourdeepneuralnetworksystemprovidedoneprediction,andthegradientboostingmachinesprovidedtheother.Theprogramblendedthesetwopredictionswithasimpleratio.Thenwenormalizedtheresultingpredictionvectorsothatthesumequaled1.0(100%).Figure16.5showstheensemblemodel:
Figure16.5:OurChallengeGroupEntry
Youcanfindourentry,writteninPython,atthefollowingURL:
https://github.com/jeffheaton/kaggle-otto-group
ModelingwithDeepLearning
Tostaywithinthescopeofthisbook,wewillpresentasolutiontotheKagglecompetitionbasedonourentry.Becausegradientboostingmachines(GBM)arebeyondthesubjectmatterofthisbook,wewillfocusonusingadeepneuralnetwork.Tointroduceensemblelearning,wewillusebaggingtocombinetentrainedneuralnetworkstogether.Ensemblemethods,suchasbagging,willusuallycausetheaggregateoftenneuralnetworkstoscorebetterthanasingleneuron.Ifyouwouldliketousegradientboostingmachinesandreplicateoursolution,seethelinkprovidedaboveforthesourcecode.
NeuralNetworkStructure
Forthisneuralnetwork,weusedadeeplearningstructurecomposedofdenselayersanddropoutlayers.Becausethisstructurewasnotanimagenetwork,wedidnotuseconvolutionallayersormax-poollayers.Theselayertypesrequiredthatinputneuronsincloseproximityhavesomerelevancetoeachother.However,the93inputvaluesthatcomprisedthedatasetmightnothavebeenrelevant.Figure16.6showsthestructureofthedeepneuralnetwork:
Figure16.6:DeepNeuralNetworkfortheChallenge
Asyoucansee,theinputlayeroftheneuralnetworkhad93neuronsthatcorrespondedtothe93inputcolumnsinthedataset.Threehiddenlayershad256,128and64neuronseach.Additionally,twodropoutlayerseachhadlayersof256and128neuronsandadropoutprobabilityof20%.Theoutputwasasoftmaxlayerthatclassifiedthenineoutputgroups.Wenormalizedtheinputdatatotheneuralnetworktotaketheirz-scores.
Ourstrategywastousetwodropoutlayerstuckedbetweenthreedenselayers.Wechoseapowerof2forthefirstdenselayer.Inthiscaseweused2tothepowerof8(256).Thenwedividedby2toobtaineachofthenexttwodenselayers.Thisprocessresultedin256,128andthen64.Thepatternofusingapowerof2forthefirstlayerandtwomoredenselayersdividingby2,workedwell.Astheexperimentscontinued,wetriedotherpowersof2inthefirstdenselayer.
Wetrainedthenetworkwithstochasticgradientdescent(SGD).Theprogramdividedthetrainingdataintoavalidationsetandatrainingset.TheSGDtrainingusedonlythetrainingdataset,butitmonitoredthevalidationset’serror.Wetraineduntilourvalidationset’serrordidnotimprovefor200iterations.Atthispoint,thetrainingstopped,andtheprogramselectedthebest-trainedneuralnetworkoverthose200iterations.Werefertothisprocessasearlystopping,andithelpstopreventoverfitting.Whenaneuralnetworkisnolongerimprovingthescoreonthevalidationset,overfittingislikelyoccurring.
Runningtheneuralnetworkproducesthefollowingoutput:
Input(None,93)produces93outputs
dense0(None,256)produces256outputs
dropout0(None,256)produces256outputs
dense1(None,128)produces128outputs
dropout1(None,128)produces128outputs
dense2(None,64)produces64outputs
output(None,9)roduces9outputs
epochtrainlossvalidlosstrain/valvalidacc
-----------------------------------------------------
11.070190.710041.507230.73697
20.780020.664151.174470.74626
30.725600.641771.130610.75000
40.702950.627891.119550.75353
50.677800.617591.097500.75724
...
4100.404100.507850.795720.80963
4110.408760.509300.802600.80645
Earlystopping.
Bestvalidlosswas0.495116atepoch211.
Wrotesubmissiontofilelas-submit.csv.
Wrotesubmissiontofilelas-val.csv.
BaggedLASmodel:1,score:0.49511558950601003,currentmlog:
0.379456064667434,baggedmlog:0.379456064667434
Earlystopping.
Bestvalidlosswas0.502459atepoch221.
Wrotesubmissiontofilelas-submit.csv.
Wrotesubmissiontofilelas-val.csv.
BaggedLASmodel:2,score:0.5024587499599558,currentmlog:
0.38050303230483773,baggedmlog:0.3720715012362133
epochtrainlossvalidlosstrain/valvalidacc
-----------------------------------------------------
11.070710.705421.517850.73658
20.774580.664991.164790.74670
...
3700.414590.506960.817790.80760
3710.408490.508730.802960.80642
3720.413830.508550.813760.80787
Earlystopping.
Bestvalidlosswas0.500154atepoch172.
Wrotesubmissiontofilelas-submit.csv.
Wrotesubmissiontofilelas-val.csv.
BaggedLASmodel:3,score:0.5001535314594113,currentmlog:
0.3872396776865103,baggedmlog:0.3721509601621992
...
BaggedLASmodel:4,score:0.4984386022067697,currentmlog:
0.39710688423724777,baggedmlog:0.37481605169768967
...
Ingeneral,theneuralnetworkgraduallydecreasesitstrainingandvalidationerror.Ifyourunthisexample,youmightseedifferentoutput,basedontheprogramminglanguagefromwhichtheexampleoriginates.TheaboveoutputisfromPythonandtheLasange/NoLearnframeworks.
Itisimportanttounderstandwhythereisavalidationerrorandatrainingerror.Mostneuralnetworktrainingalgorithmswillseparatethetrainingsetintoatrainingandvalidationset.Thissplitmightbe80%fortrainingand20%forvalidation.Theneuralnetworkwillusethe80%totrain,andthenitreportsthaterrorasthetrainingerror.Youcanalsousethevalidationsettogenerateanerror,whichisthevalidationerror.Becauseitrepresentstheerroronthedatathatarenottrainedwiththeneuralnetwork,thevalidationerroristhemostimportantmeasure.Astheneuralnetworktrains,thetrainingerrorwillcontinuetodropeveniftheneuralnetworkisoverfitting.However,oncethevalidationerrorstopsdropping,theneuralnetworkisprobablybeginningtooverfit.
BaggingMultipleNeuralNetworks
Baggingisasimpleyeteffectivemethodtoensemblemultiplemodelstogether.Theexampleprogramforthischaptertrainstenneuralnetworksindependently.EachneuralnetworkwillproduceitsownsetofnineprobabilitiesthatcorrespondtothenineclassesprovidedbyKaggle.BaggingsimplytakestheaverageofeachofthesenineKaggle-providedclasses.Listing16.1providesthepseudocodetoperformthebagging:
Listing16.1:BaggingNeuralNetwork
#Finalresultsisamatrixwithrows=torowsintrainingset
#Columns=numberofoutcomes(1forregression,orclasscountfor
classification)
final_results=[][]
forifrom1to5:
network=train_neural_network()
results=evaluate_network(network)
final_results=final_results+results
#Taketheaverage
final_weights=weights/5
WeperformedthebaggingonthetestdatasetprovidedbyKaggle.Althoughthetestprovidedthe93columns,itdidnottellustheclassesthatitsupplied.WehadtoproduceafilethatcontainedtheIDoftheitemforwhichwewereansweringandthenthenineprobabilities.Oneachrow,theprobabilitiesshouldsumto1.0(100%).Ifwesubmittedafilethatdidnotsumto1.0,Kagglewouldhavescaledourvaluessothattheydidsumto1.0.
Toseetheeffectsofbagging,wesubmittedtwotestfilestoKaggle.Thefirsttestfilewasthefirstneuralnetworkthatwetrained.Thesecondtestfilewasthebaggedaverageofallten.Theresultswereasfollows:
BestSingleNetwork:0.3794FiveBaggedNetworks:0.3717
Asyoucansee,thebaggednetworksachievedabetterscorethanasingleneuralnetwork.Thecompleteresultsareshownhere:
BaggedLASmodel:1,score:0.4951,currentmlog:0.3794,baggedmlog:
0.3794
BaggedLASmodel:2,score:0.5024,currentmlog:0.3805,baggedmlog:
0.3720
BaggedLASmodel:3,score:0.5001,currentmlog:0.3872,baggedmlog:
0.3721
BaggedLASmodel:4,score:0.4984,currentmlog:0.3971,baggedmlog:
0.3748
BaggedLASmodel:5,score:0.4979,currentmlog:0.3869,baggedmlog:
0.3717
Asyoucansee,thefirstneuralnetworkhadamulti-classlogloss(mlog)errorof0.3794.ThemlogmeasurewasdiscussedinChapter5,“Training&Evaluation.”Thebaggedscorewasthesamebecausewehadonlyonenetwork.Theamazingparthappenswhenwebaggedthesecondnetworktothefirst.Thecurrentscoresofthefirsttwonetworkswere0.3794and0.3804.However,whenwebaggedthemtogether,wehad0.3720,whichwaslowerthanbothnetworks.Averagingtheweightsofthesetwo
networksproducedanewnetworkthatwasbetterthanboth.Ultimately,wesettledonabaggedscoreof0.3717,whichwasbetterthananyoftheprevioussinglenetwork(current)scores.
ChapterSummary
Inthefinalchapterofthisbook,weshowedhowtoapplydeeplearningtoareal-worldproblem.WetrainedadeepneuralnetworktoproduceasubmissionfilefortheKaggleOttoGroupProductClassificationChallenge.Weuseddenseanddropoutlayerstocreatethisneuralnetwork.
Wecanutilizeensemblestocombineseveralmodelsintoone.Usually,theresultingensemblemodelwillachievebetterscoresthantheindividualensemblemethods.WealsoexaminedhowtobagtenneuralnetworkstogetherandgenerateaKagglesubmissionCSV.
Afteranalyzingneuralnetworksanddeeplearninginthisfinalchapteraswellasthepreviouschapters,wehopethatyouhavelearnednewandusefulinformation.Ifyouhaveanycommentsaboutthisvolume,wewouldlovetohearfromyou.Inthefuture,weplantocreateadditionaleditionsofthevolumestoincludemoretechnologies.Therefore,wewouldbeinterestedindiscoveringyourpreferencesonthetechnologiesthatyouwouldlikeustoexploreinfutureeditions.Youcancontactusthroughthefollowingwebsite:
http://www.jeffheaton.com
AppendixA:ExamplesDownloadingExamplesStructureofExampleDownloadKeepingUpdated
ArtificialIntelligenceforHumans
Theseexamplesarepartofaseriesofbooksthatiscurrentlyunderdevelopment.Checkthewebsitetoseewhichvolumeshavebeencompletedandareavailable:
http://www.heatonresearch.com/aifh
Thefollowingvolumesareplannedforthisseries:
Volume0:IntroductiontotheMathofAIVolume1:FundamentalAlgorithmsVolume2:Nature-InspiredAlgorithmsVolume3:DeepLearningandNeuralNetworks
LatestVersions
Inthisappendix,wedescribehowtoobtaintheArtificialIntelligenceforHumans(AIFH)bookseriesexamples.
Thisareaisprobablythemostdynamicofthebook.Computerlanguagesarealwayschangingandaddingnewversions.Wewillupdatetheexamplesasitbecomesnecessary,fixingbugsandmakingcorrections.Asaresult,makesurethatyouarealwaysusingthelatestversionofthebookexamples.
Becausethisareaissodynamic,thisfilemaybecomeoutdated.Youcanalwaysfindthelatestversionatthefollowinglocation:
https://github.com/jeffheaton/aifh
ObtainingtheExamples
Weprovidethebook’sexamplesinmanyprogramminglanguages.CoreexamplepacksexistforJava,C#,C/C++,Python,andRformostvolumes.Volume3,asofpublication,includesJava,C#,andPython.Otherlanguages,suchasRandC/C++areplanned.Wemayhaveaddedotherlanguagessincepublication.Thecommunitymayhaveaddedotherlanguagesaswell.YoucanfindallexamplesattheGitHubrepository:
https://github.com/jeffheaton/aifh
Youhaveyourchoiceoftwodifferentwaystodownloadtheexamples.
DownloadZIPFile
GitHubprovidesaniconthatallowsyoutodownloadaZIPfilethatcontainsalloftheexamplecodefortheseries.AsingleZIPfilehasalloftheexamplesfortheseries.Asaresult,wefrequentlyupdatethecontentsofthisZIP.Ifyouarestartinganewvolume,itisimportantthatyouverifythatyouhavethelatestcopy.YoucanperformthedownloadfromthefollowingURL:
https://github.com/jeffheaton/aifh
YoucanseethedownloadlinkinFigureA.1:
FigureA.1:GitHub
ClonetheGitRepository
Youcanobtainalltheexampleswiththesourcecontrolprogramgitifitisinstalledonyoursystem.Thefollowingcommandclonestheexamplestoyourcomputer:(Cloningsimplyreferstotheprocessofcopyingtheexamplefiles.)
gitclonehttps://github.com/jeffheaton/aifh.git
Youcanalsopullthelatestupdateswiththefollowingcommand:
gitpull
Ifyouwouldlikeanintroductiontogit,refertothefollowingURL:
http://git-scm.com/docs/gittutorial
ExampleContents
TheentireArtificialIntelligenceforHumansseriesiscontainedinonedownloadthatisazipfile.
Onceyouopentheexamplesfile,youwillseethecontentsinFigureA.2:
FigureA.2:ExamplesDownload
Thelicensefiledescribesthelicenseforthebookexamples.AlloftheexamplesforthisseriesarereleasedundertheApachev2.0license,afreeandopen-sourcesoftware(FOSS)license.Inotherwords,wedoretainacopyrighttothefiles.However,youcanfreelyreusethesefilesinbothcommercialandnon-commercialprojectswithoutfurtherpermission.
Althoughthebooksourcecodeisprovidedfree,thebooktextisnotprovidedfree.Thesebooksarecommercialproductsthatwesellthroughavarietyofchannels.Consequently,youmaynotredistributetheactualbooks.ThisrestrictionincludesthePDF,MOBI,EPUBandanyotherformatofthebook.However,weprovideallbooksinDRM-freeform.Weappreciateyoursupportofthispolicybecauseitcontributestothefuturegrowthofthesebooks.
ThedownloadalsoincludesaREADMEfile.TheREADME.mdisa“markdown”filethatcontainsimagesandformatting.Thisfilecanbereadeitherasastandardtextfileorinamarkdownviewer.TheGitHubbrowserautomaticallyformatsMDfiles.FormoreinformationonMDfiles,refertothefollowingURL:
https://help.github.com/articles/github-flavored-markdown
YouwillfindaREADMEfileinmanyfoldersofthebook’sexamples.TheREADMEfileintheexamplesroot(seenabove)hasinformationaboutthebookseries.
Youwillalsonoticetheindividualvolumefoldersinthedownload.Thesearenamedvol1,vol2,vol3,etc.Youmaynotseeallofthevolumesinthedownloadbecausetheyhavenotyetbeenwritten.Allofthevolumeshavethesameformat.Forexample,ifyouopenVolume3,youwillseethecontentslistedinFigureA.3.Othervolumeswillhavea
similarlayout,dependingonthelanguagesthatareadded.
FigureA.3:InsideVolume3(othervolumeshavesamestructure)
Again,youseetheREADMEfilethatcontainsinformationuniquetothisparticularvolume.ThemostimportantinformationinthevolumelevelREADMEfilesisthecurrentstatusoftheexamples.Thecommunityoftencontributesexamplepacks.Asaresult,someoftheexamplepacksmaynotbecomplete.TheREADMEforthevolumewillletyouknowthisimportantinformation.ThevolumeREADMEalsocontainstheFAQforavolume.
Youshouldalsoseeafilenamed“aifh_vol3.RMD”.ThisfilecontainstheRmarkdownsourcecodethatweusedtocreatemanychartsinthebook.WeproducednearlyallthegraphsandchartsinthebookwiththeRprogramminglanguage.Thefileultimatelyallowsyoutoseetheequationsbehindthepictures.Nevertheless,wedonottranslatethisfiletootherprogramminglanguages.WeutilizeRsimplyfortheproductionofthebook.Ifweusedanotherlanguage,likePython,toproducesomeofthecharts,youwouldseea“charts.py”alongwiththeRcode.
Additionally,thevolumecurrentlyhasexamplesforC#,Java,andPython.However,youmayseethatweaddotherlanguages.So,alwayschecktheREADMEfileforthelatestinformationonlanguagetranslations.
FigureA.4showsthecontentsofatypicallanguagepack:
FigureA.4:TheJavaLanguagePack
PayattentiontotheREADMEfiles.TheREADMEfilesinalanguagefolderareimportantbecauseyouwillfindinformationabouttheJavaexamples.Ifyouhavedifficultyusingthebook’sexampleswithaparticularlanguage,theREADMEfileshouldbeyourfirststeptosolvingtheproblem.TheotherfilesintheaboveimagearealluniquetoJava.TheREADMEfiledescribesthesefilesinmuchgreaterdetail.
ContributingtotheProject
Ifyouwouldliketotranslatetheexamplestoanewlanguageorifyouhavefoundanerrorinthebook,youcanhelp.ForktheprojectandpushacommitrevisiontoGitHub.Wewillcredityouamongthegrowingnumberofcontributors.
Theprocessbeginswithafork.YoucreateanaccountonGitHubandforktheAIFHproject.ThisstepcreatesanewprojectthathasacopyoftheAIFHfiles.YouwillthencloneyournewprojectthroughGitHub.Onceyoumakeyourchanges,yousubmita“pullrequest.”Whenwereceivethisrequest,wewillevaluateyourchanges/additionsandmergeitwiththemainproject.
YoucanfindamoredetailedarticleoncontributingthroughGitHubatthisURL:
https://help.github.com/articles/fork-a-repo
ReferencesThissectionliststhereferencematerialsforthisbook.
Ackley,H.,Hinton,E.,&Sejnowski,J.(1985).AlearningalgorithmforBoltzmannmachines.CognitiveScience,147-169.
Bergstra,J.,Breuleux,O.,Bastien,F.,Lamblin,P.,Pascanu,R.,Desjardins,G.Bengio,Y.(2010,June).Theano:aCPUandGPUmathexpressioncompiler.InProceedingsofthepythonforscientificcomputingconference(SciPy).(OralPresentation)
Broomhead,D.,&Lowe,D.(1988).Multivariablefunctionalinterpolationandadaptivenetworks.ComplexSystems,2,321-355.
Chung,J.,Gulcehre,C.,Cho,K.,&Bengio,Y.(2014).Empiricalevaluationofgatedrecurrentneuralnetworksonsequencemodeling.CoRR,abs/1412.3555.
Elman,J.L.(1990).Findingstructureintime.CognitiveScience,14(2),179-211.
Fukushima,K.(1980).Neocognitron:Aself-organizingneuralnetworkmodelforamechanismofpatternrecognitionunaffectedbyshiftinposition.BiologicalCybernetics,36,193-202.
Garey,M.R.,&Johnson,D.S.(1990).Computersandintractability;aguidetothetheoryofnp-completeness.NewYork,NY,USA:W.H.Freeman&Co.
Glorot,X.,Bordes,A.,&Bengio,Y.(2011).Deepsparserectifierneuralnetworks.InG.J.Gordon,D.B.Dunson,&M.Dudk(Eds.),Aistats(Vol.15,p.315-323).JMLR.org.
Hebb,D.(2002).Theorganizationofbehavior:aneuropsychologicaltheory.MahwahN.J.:L.ErlbaumAssociates.
Hinton,G.E.,Srivastava,N.,Krizhevsky,A.,Sutskever,I.,&Salakhutdinov,R.(2012).Improvingneuralnetworksbypreventingco-adaptationoffeaturedetectors.CoRR,abs/1207.0580.
Hopfield,J.J.(1988).Neurocomputing:Foundationsofresearch.InJ.A.Anderson&E.Rosenfeld(Eds.),(pp.457-464).Cambridge,MA,USA:MITPress.
Hopfield,J.J.,&Tank,D.W.(1985).“Neural”computationofdecisionsinoptimizationproblems.BiologicalCybernetics,52,141-152.
Hornik,K.(1991,March).Approximationcapabilitiesofmultilayerfeedforwardnetworks.NeuralNetworks,4(2),251-257.
Jacobs,R.A.(1988).Increasedratesofconvergencethroughlearningrateadaptation.NeuralNetworks,1(4),295-307.
Jacobs,R.,&Jordan,M.(1993,Mar).Learningpiecewisecontrolstrategiesinamodularneuralnetworkarchitecture.IEEETransactionsonSystems,ManandCybernetics,23(2),337-345.
Jordan,M.I.(1986).Serialorder:Aparalleldistributedprocessingapproach(Tech.Rep.No.ICSReport8604).InstituteforCognitiveScience,UniversityofCalifornia,San
Diego.
Kalman,B.,&Kwasny,S.(1992,Jun).WhyTANH:choosingasigmoidalfunction.InNeuralnetworks,1992.IJCNN,InternationalJointConferenceonNeuralNetworks(Vol.4,p.578-581vol.4).
Kamiyama,N.,Iijima,N.,Taguchi,A.,Mitsui,H.,Yoshida,Y.,&Sone,M.(1992,Nov).Tuningoflearningrateandmomentumonback-propagation.InSingaporeICCS/ISITA‘92.‘Communicationsonthemove’(p.528-532,vol.2).
Keogh,E.,Chu,S.,Hart,D.,&Pazzani,M.(1993).Segmentingtimeseries:Asurveyandnovelapproach.Inaneditedvolume,dataminingintimeseriesdatabases.PublishedbyWorldScientificPublishingCompany(pp.1-22).
Kohonen,T.(1988).Neurocomputing:Foundationsofresearch.InJ.A.Anderson&E.Rosenfeld(Eds.),(pp.509-521).Cambridge,MA,USA:MITPress.
Krizhevsky,A.,Sutskever,I.,&Hinton,G.E.(n.d.).Imagenetclassificationwithdeepconvolutionalneuralnetworks.InAdvancesinneuralinformationprocessingsystems(p.2012).
LeCun,Y.,Bottou,L.,Bengio,Y.,&Haner,P.(1998).Gradient-basedlearningappliedtodocumentrecognition.InProceedingsoftheIEEE(pp.2278-2324).
Maas,A.L.,Hannun,A.Y.,&Ng,A.Y.(2013).Rectifiernonlinearitiesimproveneuralnetworkacousticmodels.InInternationalconferenceonmachinelearning(ICML).
vanderMaaten,L.,&Hinton,G.(n.d.).Visualizinghigh-dimensionaldatausingt-SNE.JournalofMachineLearningResearch(JMLR),9,2579-2605.
Marquardt,D.(1963).Analgorithmforleast-squaresestimationofnonlinearparameters.SIAMJournalonAppliedMathematics,11(2),431-441.
Matviykiv,O.,&Faitas,O.(2012).Dataclassificationofspectrumanalysisusingneuralnetwork.LvivPolytechnicNationalUniversity.
McCulloch,W.,&Pitts,W.(1943,December21).Alogicalcalculusoftheideasimmanentinnervousactivity.BulletinofMathematicalBiology,5(4),115-133.
Mozer,M.C.(1995).Backpropagation.InY.Chauvin&D.E.Rumelhart(Eds.),(pp.137{169).Hillsdale,NJ,USA:L.ErlbaumAssociatesInc.
Nesterov,Y.(2004).Introductorylecturesonconvexoptimization:abasiccourse.KluwerAcademicPublishers.
Ng,A.Y.(2004).Featureselection,l1vs.l2regularization,androtationalinvariance.InProceedingsofthetwentyfirstinternationalconferenceonmachinelearning(pp.78-).NewYork,NY,USA:ACM.
Neal,R.M.(1992,July).Connectionistlearningofbeliefnetworks.ArtificialIntelligence,56(1),71-113.
Riedmiller,M.,&Braun,H.(1993).Adirectadaptivemethodforfasterbackpropagationlearning:TheRPROPalgorithm.InIEEEinternationalconferenceonneuralnetworks(pp.586-591).
Robinson,A.J.,&Fallside,F.(1987).Theutilitydrivendynamicerrorpropagationnetwork(Tech.Rep.No.CUED/F-INFENG/TR.1).Cambridge:CambridgeUniversityEngineeringDepartment.
Rumelhart,D.E.,Hinton,G.E.,&Williams,R.J.(1988).Neurocomputing:Foundationsofresearch.InJ.A.Anderson&E.Rosenfeld(Eds.),(pp.696-699).Cambridge,MA,USA:MITPress.
Schmidhuber,J.(2012).Multi-columndeepneuralnetworksforimageclassification.InProceedingsofthe2012IEEEconferenceoncomputervisionandpatternrecognition(cvpr)(pp.3642-3649).Washington,DC,USA:IEEEComputerSociety.
Sjberg,J.,Zhang,Q.,Ljung,L.,Benveniste,A.,Deylon,B.,yvesGlorennec,P.,Juditsky,A.(1995).Nonlinearblack-boxmodelinginsystemidentification:aunifiedoverview.Automatica,31,1691-1724.
Snoek,J.,Larochelle,H.,&Adams,R.P.(2012).Practicalbayesianoptimizationofmachinelearningalgorithms.InF.Pereira,C.Burges,L.Bottou,&K.Weinberger(Eds.),Advancesinneuralinformationprocessingsystems25(pp.2951{2959).CurranAssociates,Inc.
Stanley,K.O.,&Miikkulainen,R.(2002).Evolvingneuralnetworksthroughaugmentingtopologies.EvolutionaryComputation,10(2),99-127.
Stanley,K.O.,DAmbrosio,D.B.,&Gauci,J.(2009,April).Ahypercubebasedencodingforevolvinglarge-scaleneuralnetworks.ArtificialLife,15(2),185-212.
Teh,Y.W.,&Hinton,G.E.(2000).Rate-codedrestrictedBoltzmannmachinesforfacerecognition.InT.K.Leen,T.G.Dietterich,&V.Tresp(Eds.),Nips(p.908-914).MITPress.
Werbos,P.J.(1988).Generalizationofbackpropagationwithapplicationtoarecurrentgasmarketmodel.NeuralNetworks,1.
Zeiler,M.D.,Ranzato,M.,Monga,R.,Mao,M.Z.,Yang,K.,Le,Q.V.,Hinton,G.E.(2013).Onrectifiedlinearunitsforspeechprocessing.InICASSP(p.3517-3521).IEEE.