Artificial Intelligence for Humans, Volume 3: Deep Learning and Neural Networks

Title AIFH,Volume3:DeepLearningandNeuralNetworksAuthor JeffHeatonPublished December31,2015Copyright Copyright2015byHeatonResearch,Inc.,AllRightsReserved.FileCreated SunNov0815:28:13CST2015ISBN 978-1505714340Price 9.99USD

Donotmakeillegalcopiesofthisebook

ThiseBookiscopyrightedmaterial,andpublicdistributionisprohibited.IfyoudidnotreceivethisebookfromHeatonResearch(http://www.heatonresearch.com),oranauthorizedbookseller,pleasecontactHeatonResearch,Inc.topurchasealicensedcopy.DRMfreecopiesofourbookscanbepurchasedfrom:

http://www.heatonresearch.com/book

Ifyoupurchasedthisbook,thankyou!YourpurchaseofthisbookssupportstheEncogMachineLearningFramework.http://www.encog.org

http://www.heatonresearch.com/book

http://www.encog.org

Publisher:HeatonResearch,Inc.ArtificialIntelligenceforHumans,Volume3:NeuralNetworksandDeepLearningDecember,2015Author:JeffHeatonEditor:TracyHeatonISBN:978-1505714340Edition:1.0

Copyright©2015byHeatonResearchInc.,1734ClarksonRd.#107,Chesterfield,MO63017-4976.Worldrightsreserved.Theauthor(s)createdreusablecodeinthispublicationexpresslyforreusebyreaders.HeatonResearch,Inc.grantsreaderspermissiontoreusethecodefoundinthispublicationordownloadedfromourwebsitesolongas(author(s))areattributedinanyapplicationcontainingthereusablecodeandthesourcecodeitselfisneverredistributed,postedonlinebyelectronictransmission,soldorcommerciallyexploitedasastand-aloneproduct.Asidefromthisspecificexceptionconcerningreusablecode,nopartofthispublicationmaybestoredinaretrievalsystem,transmitted,orreproducedinanyway,including,butnotlimitedtophotocopy,photograph,magnetic,orotherrecord,withoutprioragreementandwrittenpermissionofthepublisher.

HeatonResearch,Encog,theEncogLogoandtheHeatonResearchlogoarealltrademarksofHeatonResearch,Inc.,intheUnitedStatesand/orothercountries.

TRADEMARKS:HeatonResearchhasattemptedthroughoutthisbooktodistinguishproprietarytrademarksfromdescriptivetermsbyfollowingthecapitalizationstyleusedbythemanufacturer.

Theauthorandpublisherhavemadetheirbesteffortstopreparethisbook,sothecontentisbaseduponthefinalreleaseofsoftwarewheneverpossible.Portionsofthemanuscriptmaybebaseduponpre-releaseversionssuppliedbysoftwaremanufacturer(s).Theauthorandthepublishermakenorepresentationorwarrantiesofanykindwithregardtothecompletenessoraccuracyofthecontentshereinandacceptnoliabilityofanykindincludingbutnotlimitedtoperformance,merchantability,fitnessforanyparticularpurpose,oranylossesordamagesofanykindcausedorallegedtobecauseddirectlyorindirectlyfromthisbook.

SOFTWARELICENSEAGREEMENT:TERMSANDCONDITIONS

Themediaand/oranyonlinematerialsaccompanyingthisbookthatareavailablenoworinthefuturecontainprogramsand/ortextfiles(the“Software”)tobeusedinconnectionwiththebook.HeatonResearch,Inc.herebygrantstoyoualicensetouseanddistributesoftwareprogramsthatmakeuseofthecompiledbinaryformofthisbook’ssourcecode.Youmaynotredistributethesourcecodecontainedinthisbook,withoutthewrittenpermissionofHeatonResearch,Inc.Yourpurchase,acceptance,oruseoftheSoftwarewillconstituteyouracceptanceofsuchterms.

TheSoftwarecompilationisthepropertyofHeatonResearch,Inc.unlessotherwiseindicatedandisprotectedbycopyrighttoHeatonResearch,Inc.orothercopyrightowner(s)asindicatedinthemediafiles(the“Owner(s)”).YouareherebygrantedalicensetouseanddistributetheSoftwareforyourpersonal,noncommercialuseonly.Youmay

notreproduce,sell,distribute,publish,circulate,orcommerciallyexploittheSoftware,oranyportionthereof,withoutthewrittenconsentofHeatonResearch,Inc.andthespecificcopyrightowner(s)ofanycomponentsoftwareincludedonthismedia.

IntheeventthattheSoftwareorcomponentsincludespecificlicenserequirementsorend-useragreements,statementsofcondition,disclaimers,limitationsorwarranties(“End-UserLicense”),thoseEnd-UserLicensessupersedethetermsandconditionshereinastothatparticularSoftwarecomponent.Yourpurchase,acceptance,oruseoftheSoftwarewillconstituteyouracceptanceofsuchEnd-UserLicenses.

Bypurchase,useoracceptanceoftheSoftwareyoufurtheragreetocomplywithallexportlawsandregulationsoftheUnitedStatesassuchlawsandregulationsmayexistfromtimetotime.

SOFTWARESUPPORT

ComponentsofthesupplementalSoftwareandanyoffersassociatedwiththemmaybesupportedbythespecificOwner(s)ofthatmaterialbuttheyarenotsupportedbyHeatonResearch,Inc..InformationregardinganyavailablesupportmaybeobtainedfromtheOwner(s)usingtheinformationprovidedintheappropriateREADMEfilesorlistedelsewhereonthemedia.

Shouldthemanufacturer(s)orotherOwner(s)ceasetooffersupportordeclinetohonoranyoffer,HeatonResearch,Inc.bearsnoresponsibility.ThisnoticeconcerningsupportfortheSoftwareisprovidedforyourinformationonly.HeatonResearch,Inc.isnottheagentorprincipaloftheOwner(s),andHeatonResearch,Inc.isinnowayresponsibleforprovidinganysupportfortheSoftware,norisitliableorresponsibleforanysupportprovided,ornotprovided,bytheOwner(s).

WARRANTY

HeatonResearch,Inc.warrantstheenclosedmediatobefreeofphysicaldefectsforaperiodofninety(90)daysafterpurchase.TheSoftwareisnotavailablefromHeatonResearch,Inc.inanyotherformormediathanthatenclosedhereinorpostedtowww.heatonresearch.com.Ifyoudiscoveradefectinthemediaduringthiswarrantyperiod,youmayobtainareplacementofidenticalformatatnochargebysendingthedefectivemedia,postageprepaid,withproofofpurchaseto:

HeatonResearch,Inc.CustomerSupportDepartment1734ClarksonRd#107Chesterfield,MO63017-4976Web:www.heatonresearch.comE-Mail:[email protected]

DISCLAIMER

HeatonResearch,Inc.makesnowarrantyorrepresentation,eitherexpressedorimplied,withrespecttotheSoftwareoritscontents,quality,performance,merchantability,orfitnessforaparticularpurpose.InnoeventwillHeatonResearch,Inc.,itsdistributors,ordealersbeliabletoyouoranyotherpartyfordirect,indirect,special,incidental,consequential,orotherdamagesarisingoutoftheuseoforinabilitytousetheSoftwareor

itscontentsevenifadvisedofthepossibilityofsuchdamage.IntheeventthattheSoftwareincludesanonlineupdatefeature,HeatonResearch,Inc.furtherdisclaimsanyobligationtoprovidethisfeatureforanyspecificdurationotherthantheinitialposting.

Theexclusionofimpliedwarrantiesisnotpermittedbysomestates.Therefore,theaboveexclusionmaynotapplytoyou.Thiswarrantyprovidesyouwithspecificlegalrights;theremaybeotherrightsthatyoumayhavethatvaryfromstatetostate.ThepricingofthebookwiththeSoftwarebyHeatonResearch,Inc.reflectstheallocationofriskandlimitationsonliabilitycontainedinthisagreementofTermsandConditions.

SHAREWAREDISTRIBUTION

ThisSoftwaremayusevariousprogramsandlibrariesthataredistributedasshareware.Copyrightlawsapplytobothsharewareandordinarycommercialsoftware,andthecopyrightOwner(s)retainsallrights.Ifyoutryasharewareprogramandcontinueusingit,youareexpectedtoregisterit.Individualprogramsdifferondetailsoftrialperiods,registration,andpayment.Pleaseobservetherequirementsstatedinappropriatefiles.

ThisbookisdedicatedtomymomMary,

thankyouforallthelove

andencouragementovertheyears.

.

IntroductionSeriesIntroductionExampleComputerLanguagesPrerequisiteKnowledgeFundamentalAlgorithmsOtherResourcesStructureofthisBook

Thisbookisthethirdinaseriescoveringselecttopicsinartificialintelligence(AI),alargefieldofstudythatencompassesmanysub-disciplines.Inthisintroduction,wewillprovidesomebackgroundinformationforreaderswhomightnothavereadVolume1or2.ItisnotnecessarytoreadVolume1or2beforethisbook.Weintroduceneededinformationfrombothvolumesinthefollowingsections.

SeriesIntroduction

Thisseriesofbooksintroducesthereadertoavarietyofpopulartopicsinartificialintelligence.BynomeansarethesevolumesintendedtobeanexhaustiveAIresource.However,eachbookpresentsaspecificareaofAItofamiliarizethereaderwithsomeofthelatesttechniquesinthisfieldofcomputerscience.

Inthisseries,weteachartificialintelligenceconceptsinamathematicallygentlemanner,whichiswhywenamedtheseriesArtificialIntelligenceforHumans.Asaresult,wealwaysfollowthetheorieswithreal-worldprogrammingexamplesandpseudocodeinsteadofrelyingsolelyonmathematicalformulas.Still,wemaketheseassumptions:

Thereaderisproficientinatleastoneprogramminglanguage.Thereaderhasabasicunderstandingofcollegealgebra.Thereaderdoesnotnecessarilyhavemuchexperiencewithformulasfromcalculus,linearalgebra,differentialequations,andstatistics.Wewillintroducetheseformulaswhennecessary.

Finally,thebook’sexampleshavebeenportedtoanumberofprogramminglanguages.Readerscanadapttheexamplestothelanguagethatfitstheirparticularprogrammingneeds.

ProgrammingLanguages

Althoughthebook’stextstaysatthepseudocodelevel,weprovideexamplepacksforJava,C#andPython.TheScalaprogramminglanguagehasacommunity-suppliedport,andreadersarealsoworkingonportingtheexamplestoadditionallanguages.So,yourfavoritelanguagemighthavebeenportedsincethisprinting.Checkthebook’sGitHubrepositoryformoreinformation.Wehighlyencouragereadersofthebookstohelpporttootherlanguages.Ifyouwouldliketogetinvolved,AppendixAhasmoreinformationtogetyoustarted.

OnlineLabs

ManyoftheexamplesfromthisseriesuseJavaScriptandareavailabletorunonline,usingHTML5.MobiledevicesmustalsohaveHTML5capabilitytoruntheprograms.Youcanfindallonlinelabmaterialsatthefollowingwebsite:

http://www.aifh.org

Theseonlinelabsallowyoutoexperimentwiththeexamplesevenasyoureadthee-bookfromamobiledevice.

CodeRepositories

AllofthecodeforthisprojectisreleasedundertheApacheOpenSourceLicensev2andcanbefoundatthefollowingGitHubrepository:

https://github.com/jeffheaton/aifh

Ifyoufindsomethingbroken,misspelled,orotherwisebotchedasyouworkwiththeexamples,youcanforktheprojectandpushacommitrevisiontoGitHub.Youwillalsoreceivecreditamongthegrowingnumberofcontributors.RefertoAppendixAformoreinformationoncontributingcode.

BooksPlannedfortheSeries

Thefollowingvolumesareplannedforthisseries:

Volume0:IntroductiontotheMathofAIVolume1:FundamentalAlgorithmsVolume2:Nature-InspiredAlgorithmsVolume3:DeepLearningandNeuralNetworks

http://www.aifh.org


WewillproduceVolumes1,2,and3inorder.Volume0isaplannedprequelthatwewillcreateneartheendoftheseries.Whileallthebookswillincludetherequiredmathematicalformulastoimplementtheprograms,theprequelwillrecapandexpandonalltheconceptsfromtheearliervolumes.WealsointendtoproducemorebooksonAIafterthepublicationofVolume3.

Ingeneral,youcanreadthebooksinanyorder.Eachbook’sintroductionwillprovidesomebackgroundmaterialfrompreviousvolumes.Thisorganizationallowsyoutojumpquicklytothevolumethatcontainsyourareaofinterest.Ifyouwanttosupplementyourknowledgeatalaterpoint,youcanreadthepreviousvolume.

OtherResources

ManyotherresourcesontheInternetwillbeveryusefulasyoureadthroughthisseriesofbooks.

ThefirstresourceisKhanAcademy,anonprofit,educationalwebsitethatprovidesvideostodemonstratemanyareasofmathematics.Ifyouneedadditionalreviewonanymathematicalconceptinthisbook,KhanAcademyprobablyhasavideoonthatinformation.

http://www.khanacademy.org/

ThesecondresourceistheNeuralNetworkFAQ.Thistext-onlyresourcehasagreatdealofinformationonneuralnetworksandotherAItopics.

http://www.faqs.org/faqs/ai-faq/neural-nets/

AlthoughtheinformationinthisbookisnotnecessarilytiedtoEncog,theEncoghomepagehasafairamountofgeneralinformationonmachinelearning.


NeuralNetworksIntroduction

Neuralnetworkshavebeenaroundsincethe1940s,and,asaresult,theyhavequiteabitofhistory.Thisbookwillcoverthehistoricaspectsofneuralnetworksbecauseyouneedtoknowsomeoftheterminology.Agoodexampleofthishistoricprogressistheactivationfunction,whichscalesvaluespassingthroughneuronsintheneuralnetwork.Alongwiththresholdactivationfunctions,researchersintroducedneuralnetworks,andthisadvancementgavewaytosigmoidalactivationfunctions,thentohyperbolictangentfunctionsandnowtotherectifiedlinearunit(ReLU).WhilemostcurrentliteraturesuggestsusingtheReLUactivationfunctionexclusively,youneedtounderstandsigmoidalandhyperbolictangenttoseethebenefitsofReLU.

http://www.khanacademy.org/

http://www.faqs.org/faqs/ai-faq/neural-nets/


Wheneverpossible,wewillindicatewhicharchitecturalcomponentofaneuralnetworktouse.Wewillalwaysidentifythearchitecturalcomponentsnowacceptedastherecommendedchoiceoverolderclassicalcomponents.WewillbringmanyofthesearchitecturalelementstogetherandprovideyouwithsomeconcreterecommendationsforstructuringyourneuralnetworksinChapter14,“ArchitectingNeuralNetworks.”

Neuralnetworkshaverisenfromtheashesofdiscreditseveraltimesintheirhistory.McCulloch,W.andPitts,W.(1943)firstintroducedtheideaofaneuralnetwork.However,theyhadnomethodtotraintheseneuralnetworks.Programmershadtocraftbyhandtheweightmatricesoftheseearlynetworks.Becausethisprocesswastedious,neuralnetworksfellintodisuseforthefirsttime.

Rosenblatt,F.(1958)providedamuch-neededtrainingalgorithmcalledbackpropagation,whichautomaticallycreatestheweightmatricesofneuralnetworks.Itfact,backpropagationhasmanylayersofneuronsthatsimulatethearchitectureofanimalbrains.However,backpropagationisslow,and,asthelayersincrease,itbecomesevenslower.Itappearedasiftheadditionofcomputationalpowerinthe1980sandearly1990shelpedneuralnetworksperformtasks,butthehardwareandtrainingalgorithmsofthiseracouldnoteffectivelytrainneuralnetworkswithmanylayers,and,forthesecondtime,neuralnetworksfellintodisuse.

ThethirdriseofneuralnetworksoccurredwhenHinton(2006)providedaradicalnewwaytotraindeepneuralnetworks.Therecentadvancesinhigh-speedgraphicsprocessingunits(GPU)allowedprogrammerstotrainneuralnetworkswiththreeormorelayersandledtoaresurgenceinthistechnologyasprogrammersrealizedthebenefitsofdeepneuralnetworks.

Inordertoestablishthefoundationfortherestofthebook,webeginwithananalysisofclassicneuralnetworks,whicharestillusefulforavarietyoftasks.Ouranalysisincludesconcepts,suchasself-organizingmaps(SOMs),Hopfieldneuralnetworks,andBoltzmannmachines.Wealsointroducethefeedforwardneuralnetworkandshowseveralwaystotrainit.

Afeedforwardneuralnetworkwithmanylayersbecomesadeepneuralnetwork.Thebookcontainsmethods,suchasGPUsupport,totraindeepnetworks.Wealsoexploretechnologiesrelatedtodeeplearning,suchasdropout,regularization,andconvolution.Finally,wedemonstratethesetechniquesthroughseveralreal-worldexamplesofdeeplearning,suchaspredictivemodelingandimagerecognition.

Ifyouwouldliketoreadingreaterdetailaboutthethreephasesofneuralnetworktechnology,thefollowingarticlepresentsagreatoverview:

http://chronicle.com/article/The-Believers/190147/

http://chronicle.com/article/The-Believers/190147/

TheKickstarterCampaign

In2013,welaunchedthisseriesofbooksafterasuccessfulKickstartercampaign.Figure1showsthehomepageoftheKickstarterprojectforVolume3:

Figure1:TheKickstarterCampaign

YoucanvisittheoriginalKickstarteratthefollowinglink:

https://goo.gl/zW4dht

WewouldliketothankalloftheKickstarterbackersoftheproject.Withoutyoursupport,thisseriesmightnotexist.Wewouldliketoextendahugethankyoutothosewhobackedatthe$250andbeyondlevel:

Figure2:GoldLevelBackers

Itwillbegreatdiscussingyourprojectswithyou.Thankyouagainforyoursupport.

Wewouldalsoliketoextendaspecialthankstothosebackerswhosupportedthebookatthe$100andhigherlevels.Theyarelistedhereintheorderthattheybacked:

https://goo.gl/zW4dht

Figure3:SilverLevelBackers

Aspecialthankyoutomywife,TracyHeaton,whoeditedtheprevioustwovolumes.

Therehavebeenthreevolumessofar;therepeatbackershavebeenveryvaluabletothiscampaign!Itisamazingtomehowmanyrepeatbackersthereare!

Thankyou,everyone—youarethebest!

http://www.heatonresearch.com/ThankYou/

Figure4:RepeatBackers1/4

http://www.heatonresearch.com/ThankYou/




BackgroundInformation

YoucanreadArtificialIntelligenceforHumansinanyorder.However,thisbookdoesexpandonsometopicsintroducedinVolumes1and2.Thegoalofthissectionistohelpyouunderstandwhataneuralnetworkisandhowtouseit.Mostpeople,evennon-programmers,haveheardofneuralnetworks.Manysciencefictionstorieshaveplotsthatarebasedonideasrelatedtoneuralnetworks.Asaresult,sci-fiwritershavecreatedaninfluentialbutsomewhatinaccurateviewoftheneuralnetwork.

Mostlaypeopleconsiderneuralnetworkstobeatypeofartificialbrain.Accordingtothisview,neuralnetworkscouldpowerrobotsorcarryonintelligentconversationswithhumanbeings.However,thisnotionisacloserdefinitionofartificialintelligence(AI)thanofneuralnetworks.AlthoughAIseekstocreatetrulyintelligentmachines,thecurrentstateofcomputersisfarbelowthisgoal.Humanintelligencestilltrumpscomputerintelligence.

NeuralnetworksareasmallpartofAI.Astheycurrentlyexist,neuralnetworkscarryoutminiscule,highlyspecifictasks.Unlikethehumanbrain,computer-basedneuralnetworksarenotgeneral-purposecomputationaldevices.Furthermore,thetermneuralnetworkcancreateconfusionbecausethebrainisanetworkofneuronsjustasAIusesneuralnetworks.Toavoidthisproblem,wemustmakeanimportantdistinction.

Weshouldreallycallthehumanbrainabiologicalneuralnetwork(BNN).MosttextsdonotbothertomakethedistinctionbetweenaBNNandartificialneuralnetworks(ANNs).Ourbookfollowsthispattern.Whenwerefertoneuralnetworks,we’redealingwithANNs.WearenottalkingaboutBNNswhenweusethetermneuralnetwork.

Biologicalneuralnetworksandartificialneuralnetworkssharesomeverybasicsimilarities.Forinstance,biologicalneuralnetworkshaveinspiredthemathematicalconstructsofartificialneuralnetworks.Biologicalplausibilitydescribesvariousartificialneuralnetworkalgorithms.Thistermdefineshowcloseanartificialneuralnetworkalgorithmistoabiologicalneuralnetwork.

Aspreviouslymentioned,programmersdesignneuralnetworkstoexecuteonesmalltask.Afullapplicationwilllikelyuseneuralnetworkstoaccomplishcertainpartsoftheapplication.However,theentireapplicationwillnotbeimplementedasaneuralnetwork.Itmayconsistofseveralneuralnetworksofwhicheachhasaspecifictask.

Patternrecognitionisataskthatneuralnetworkscaneasilyaccomplish.Forthistask,youcancommunicateapatterntoaneuralnetwork,anditcommunicatesapatternbacktoyou.Atthehighestlevel,atypicalneuralnetworkcanperformonlythisfunction.Althoughsomenetworkarchitecturesmightachievemore,thevastmajorityofneuralnetworksworkthisway.Figure8illustratesaneuralnetworkatthislevel:

Figure8:ATypicalNeuralNetwork

Asyoucansee,theaboveneuralnetworkacceptsapatternandreturnsapattern.Neuralnetworksoperatesynchronouslyandwillonlyoutputwhenithasinput.Thisbehaviorisnotlikethatofahumanbrain,whichdoesnotoperatesynchronously.Thehumanbrainrespondstoinput,butitwillproduceoutputanytimeitfeelslikeit!

NeuralNetworkStructure

Neuralnetworksconsistoflayersofsimilarneurons.Mosthaveatleastaninputlayerandanoutputlayer.Theprogrampresentstheinputpatterntotheinputlayer.Thentheoutputpatternisreturnedfromtheoutputlayer.Whathappensbetweentheinputandanoutputlayerisablackbox.Byblackbox,wemeanthatyoudonotknowexactlywhyaneuralnetworkoutputswhatitdoes.Atthispoint,wearenotyetconcernedwiththeinternalstructureoftheneuralnetwork,ortheblackbox.Manydifferentarchitecturesdefinetheinteractionbetweentheinputandoutputlayer.Later,wewillexaminesomeofthesearchitectures.

Theinputandoutputpatternsarebotharraysoffloating-pointnumbers.Considerthearraysinthefollowingways:

NeuralNetworkInput:[-0.245,.283,0.0]

NeuralNetworkOutput:[0.782,0.543]

Theaboveneuralnetworkhasthreeneuronsintheinputlayer,andtwoneuronsareintheoutputlayer.Thenumberofneuronsintheinputandoutputlayersdoesnotchange,evenifyourestructuretheinterioroftheneuralnetwork.

Toutilizetheneuralnetwork,youmustexpressyourproblemsothattheinputoftheproblemisanarrayoffloating-pointnumbers.Likewise,thesolutiontotheproblemmustbeanarrayoffloating-pointnumbers.Ultimately,thisexpressionistheonlyprocessthatthatneuralnetworkscanperform.Inotherwords,theytakeonearrayandtransformitintoasecond.Neuralnetworksdonotloop,callsubroutines,orperformanyoftheothertasksyoumightthinkofwithtraditionalprogramming.Neuralnetworkssimplyrecognizepatterns.

Youmightthinkofaneuralnetworkasahashtableintraditionalprogrammingthatmapskeystovalues.Itactssomewhatlikeadictionary.Youcanconsiderthefollowingasatypeofhashtable:

“hear”->“toperceiveorapprehendbytheear”

“run”->“togofasterthanawalk”“write”->“toform(ascharactersorsymbols)onasurfacewithaninstrument(asapen)”

Thistablecreatesamappingbetweenwordsandprovidestheirdefinitions.Programminglanguagesusuallycallthisahashmaporadictionary.Thishashtableusesakeyoftypestringtoreferenceanothervaluethatisalsoofthesametypestring.Ifyou’venotworkedwithhashtablesbefore,theysimplymaponevaluetoanother,andtheyareaformofindexing.Inotherwords,thedictionaryreturnsavaluewhenyouprovideitwithakey.Mostneuralnetworksfunctioninthismanner.Oneneuralnetworkcalledbidirectionalassociativememory(BAM)allowsyoutoprovidethevalueandreceivethekey.

Programminghashtablescontainkeysandvalues.Thinkofthepatternsenttotheinputlayeroftheneuralnetworkasthekeytothehashtable.Likewise,thinkofthevaluereturnedfromthehashtableasthepatternthatisreturnedfromtheoutputlayeroftheneuralnetwork.Althoughthecomparisonbetweenahashtableandaneuralnetworkisappropriatetohelpyouunderstandtheconcept,youneedtorealizethattheneuralnetworkismuchmorethanahashtable.

Whatwouldhappenwiththeprevioushashtableifyouweretoprovideawordthatisnotakeyinthemap?Toanswerthequestion,wewillpassinthekeyof“wrote.”Forthisexample,ahashtablewouldreturnnull.Itwouldindicateinsomewaythatitcouldnotfindthespecifiedkey.However,neuralnetworksdonotreturnnull;theyfindtheclosestmatch.Notonlydotheyfindtheclosestmatch,theywillmodifytheoutputtoestimatethemissingvalue.Soifyoupassedin“wrote”totheaboveneuralnetwork,youwouldlikelyreceivewhatyouwouldhaveexpectedfor“write.”Youwouldlikelygettheoutputfromoneoftheotherkeysbecausenotenoughdataexistfortheneuralnetworktomodifytheresponse.Thelimitednumberofsamples(inthiscase,therearethree)causesthisresult.

Theabovemappingraisesanimportantpointaboutneuralnetworks.Aspreviouslystated,neuralnetworksacceptanarrayoffloating-pointnumbersandreturnanotherarray.Thisbehaviorprovokesthequestionabouthowtoputstring,ortextual,valuesintotheaboveneuralnetwork.Althoughasolutionexists,dealingwithnumericdataratherthanstringsismucheasierfortheneuralnetwork.

Infact,thisquestionrevealsoneofthemostdifficultaspectsofneuralnetworkprogramming.Howdoyoutranslateyourproblemintoafixed-lengtharrayoffloating-pointnumbers?Intheexamplesthatfollow,youwillseethecomplexityofneuralnetworks.

ASimpleExample

Incomputerprogramming,itiscustomarytoprovidea“HelloWorld”applicationthatsimplydisplaysthetext“HelloWorld.”Ifyouhavepreviouslyreadaboutneuralnetworks,youhavenodoubtseenexampleswiththeexclusiveor(XOR)operator,whichisoneofthe“HelloWorld”applicationsofneuralnetworkprogramming.Laterinthissection,wewilldescribemorecomplexscenariosthanXOR,butitisagreatintroduction.WeshallbeginbylookingattheXORoperatorasthoughitwereahashtable.IfyouarenotfamiliarwiththeXORoperator,itworkssimilarlytotheAND/ORoperators.ForanANDtobetrue,bothsidesmustbetrue.ForanORtobetrue,eithersidemustbetrue.ForanXORtobetrue,bothofthesidesmustbedifferentfromeachother.ThefollowingtruthtablerepresentsanXOR:

FalseXORFalse=False

TrueXORFalse=True

FalseXORTrue=True

TrueXORTrue=False

Tocontinuethehashtableexample,youwouldrepresenttheabovetruthtableasfollows:

[0.0,0.0]->[0.0]

[1.0,0.0]->[1.0]

[0.0,1.0]->[1.0]

[1.0,1.0]->[0.0]

Thesemappingsshowinputandtheidealexpectedoutputfortheneuralnetwork.

Training:SupervisedandUnsupervised

Whenyouspecifytheidealoutput,youareusingsupervisedtraining.Ifyoudidnotprovideidealoutputs,youwouldbeusingunsupervisedtraining.Supervisedtrainingteachestheneuralnetworktoproducetheidealoutput.Unsupervisedtrainingusuallyteachestheneuralnetworktoplacetheinputdataintoanumberofgroupsdefinedbytheoutputneuroncount.

Bothsupervisedandunsupervisedtrainingareiterativeprocesses.Forsupervisedtraining,eachtrainingiterationcalculateshowclosetheactualoutputistotheidealoutputandexpressesthisclosenessasanerrorpercent.Eachiterationmodifiestheinternalweightmatricesoftheneuralnetworktodecreasetheerrorratetoanacceptablylowlevel.

Unsupervisedtrainingisalsoaniterativeprocess.However,calculatingtheerrorisnotaseasy.Becauseyouhavenoexpectedoutput,youcannotmeasurehowfartheunsupervisedneuralnetworkisfromyouridealoutput.Thus,youhavenoidealoutput.As

aresult,youwilljustiterateforafixednumberofiterationsandtrytousethenetwork.Iftheneuralnetworkneedsmoretraining,theprogramprovidesit.

Anotherimportantaspecttotheabovetrainingdataisthatyoucantakeitinanyorder.Theresultoftwozeros,withXORapplied(0XOR0)isgoingtobe0,regardlessofwhichcasethatyouused.Thischaracteristicisnottrueofallneuralnetworks.FortheXORoperator,wewouldprobablyuseatypeofneuralnetworkcalledafeedforwardneuralnetworkinwhichtheorderofthetrainingsetdoesnotmatter.Laterinthisbook,wewillexaminerecurrentneuralnetworksthatdoconsidertheorderofthetrainingdata.Orderisanessentialcomponentofasimplerecurrentneuralnetwork.

Previously,yousawthatthesimpleXORoperatorutilizedtrainingdata.Nowwewillanalyzeasituationwithmorecomplextrainingdata.

MilesperGallon

Ingeneral,neuralnetworkproblemsinvolveasetofdatathatyouusetopredictvaluesforlatersetsofdata.Theselatersetsofdataresultafteryou’vealreadytrainedyourneuralnetwork.Thepowerofaneuralnetworkistopredictoutcomesforentirelynewdatasetsbasedonknowledgelearnedfrompastdatasets.Consideracardatabasethatcontainsthefollowingfields:

CarWeightEngineDisplacementCylinderCountHorsePowerHybridorGasolineMilesperGallon

Althoughweareoversimplifyingthedata,thisexampledemonstrateshowtoformatdata.Assumingyouhavecollectedsomedataforthesefields,youshouldbeabletoconstructaneuralnetworkthatcanpredictonefieldvalue,basedontheotherfieldvalues.Forthisexample,wewilltrytopredictmilespergallon.

Aspreviouslydemonstrated,wewillneedtodefinethisproblemintermsofaninputarrayoffloating-pointnumbersmappedtoanoutputarrayoffloating-pointnumbers.However,theproblemhasoneadditionalrequirement.Thenumericrangeoneachofthesearrayelementsshouldbebetween0and1or-1and1.Thisrangeiscallednormalization.Ittakesreal-worlddataandturnsitintoaformthattheneuralnetworkcanprocess.

First,wedeterminehowtonormalizetheabovedata.Considertheneuralnetworkformat.Wehavesixtotalfields.Wewanttousefiveofthesefieldstopredictthesixth.Consequently,theneuralnetworkwouldhavefiveinputneuronsandoneoutputneuron.

Yournetworkwouldresemblethefollowing:

InputNeuron1:CarWeight

InputNeuron2:EngineDisplacementInputNeuron3:CylinderCountInputNeuron4:HorsePowerInputNeuron5:HybridorGasolineOutputNeuron1:MilesperGallon

Wealsoneedtonormalizethedata.Toaccomplishthisnormalization,wemustthinkofreasonablerangesforeachofthesevalues.Wewillthentransforminputdataintoanumberbetween0and1thatrepresentsanactualvalue’spositionwithinthatrange.Considerthisexamplewiththereasonablerangesforthefollowingvalues:

CarWeight:100-5000lbs.EngineDisplacement:0.1to10litersCylinderCount:2-12HorsePower:1-1000HybridorGasoline:trueorfalseMilesperGallon:1-500

Giventoday’scars,theserangesmaybeonthelargeend.However,thischaracteristicwillallowminimalrestructuringtotheneuralnetworkinthefuture.Wealsowanttoavoidhavingtoomuchdataattheextremeendsoftherange.

Toillustratethisrange,wewillconsidertheproblemofnormalizingaweightof2,000pounds.Thisweightis1,900intotherange(2000–100).Thesizeoftherangeis4,900pounds(5000-100).Thepercentoftherangesizeis0.38(1,900/4,900).Therefore,wewouldfeedthevalueof0.38totheinputneuroninordertorepresentthisvalue.Thisprocesssatisfiestherangerequirementof0to1foraninputneuron.

Thehybridorregularvalueisatrue/false.Torepresentthisvalue,wewilluse1forhybridand0forregular.Wesimplynormalizeatrue/falseintotwovalues.

Nowthatyou’veseensomeoftheusesforneuralnetworks,itistimetodeterminehowtoselecttheappropriateneuralnetworkforyourspecificproblem.Inthesucceedingsection,weprovidearoadmapforthevariousneuralnetworksthatareavailable.

ANeuralNetworkRoadmap

Thisvolumecontainsawidearrayofneuralnetworktypes.Wewillpresenttheseneuralnetworksalongwithexamplesthatwillshowcaseeachneuralnetworkinaspecificproblemdomain.Notallneuralnetworksaredesignedtotackleeveryproblemdomain.Asaneuralnetworkprogrammer,youneedtoknowwhichneuralnetworktouseforaspecificproblem.

Thissectionprovidesahigh-levelroadmaptotherestofthebookthatwillguideyourreadingtoareasofthebookthatalignwithyourinterests.Figure9showsagridofthe

neuralnetworktypesinthisvolumeandtheirapplicableproblemdomains:

Figure9:NeuralNetworkTypes&ProblemDomains

Theproblemdomainslistedabovearethefollowing:

Clust–UnsupervisedclusteringproblemsRegis–Regressionproblems,thenetworkmustoutputanumberbasedoninput.Classif–Classificationproblems,thenetworkmustclassifydatapointsintopredefinedclasses.Predict–Thenetworkmustpredicteventsintime,suchassignalsforfinanceapplications.Robot–Robotics,usingsensorsandmotorcontrolVision–ComputerVision(CV)problemsrequirethecomputertounderstandimages.Optim–Optimizationproblemsrequirethatthenetworkfindthebestorderingorsetofvaluestoachieveanobjective.

Thenumberofcheckmarksgivestheapplicabilityofeachoftheneuralnetworktypestothatparticularproblem.Iftherearenochecks,youcannotapplythatnetworktypetothatproblemdomain.

Allneuralnetworkssharesomecommoncharacteristics.Neurons,weights,activationfunctions,andlayersarethebuildingblocksofneuralnetworks.Inthefirstchapterofthisbook,wewillintroducetheseconceptsandpresentthebasiccharacteristicsthatmostneuralnetworksshare.

DataSetsUsedinthisBook

Thisbookcontainsseveraldatasetsthatallowustoshowapplicationoftheneuralnetworkstorealdata.Wechoseseveraldatasetsinordertocovertopicssuchasregression,classification,time-series,andcomputervision.

MNISTHandwrittenDigits

SeveralexamplesusetheMNISThandwrittendigitsdataset.TheMNISTdatabase(MixedNationalInstituteofStandardsandTechnologydatabase)isalargedatabaseofhandwrittendigitsthatprogrammersusefortrainingvariousimageprocessingsystems.Thisclassicdatasetisoftenpresentedinconjunctionwithneuralnetworks.Thisdatasetisessentiallythe“HelloWorld”programofneuralnetworks.YoucanobtainitfromthefollowingURL:

http://yann.lecun.com/exdb/mnist/

Thedatasetintheabovelistingisstoredinaspecialbinaryformat.YoucanalsofindthisformatattheaboveURL.Theexampleprogramsprovidedforthischapterarecapableofreadingthisformat.

Thisdatasetcontainsmanyhandwrittendigits.Italsoincludesatrainingsetof60,000examplesandatestsetof10,000examples.Weprovidelabelsonbothsetstoindicatewhateachdigitissupposedtobe.MNISTisahighlystudieddatasetthatprogrammersfrequentlyuseasabenchmarkfornewmachinelearningalgorithmsandtechniques.Furthermore,researchershavepublishedmanyscientificpapersabouttheirattemptstoachievethelowesterrorrate.Inonestudy,theresearchermanagedtoachieveanerrorrateontheMNISTdatabaseof0.23percentwhileusingahierarchicalsystemofconvolutionalneuralnetworks(Schmidhuber,2012).

WeshowasmallsamplingofthedatasetinFigure10:

http://yann.lecun.com/exdb/mnist/

Figure10:MNISTDigits

Wecanusethisdatasetforclassificationneuralnetworks.Thenetworkslearntolookatanimageandclassifyitintotheappropriateplaceamongthetendigits.Eventhoughthisdatasetisanimage-basedneuralnetwork,youcanthinkofitasatraditionaldataset.Theseimagesare28pixelsby28pixels,resultinginatotalof784pixels.Despitetheimpressiveimages,webeginthebookbyusingregularneuralnetworksthattreattheimagesasa784-input-neuronneuralnetwork.Youwoulduseexactlythesametypeofneuralnetworktohandleanyclassificationproblemthathasalargenumberofinputs.Suchproblemsarehighdimensional.Laterinthebook,wewillseehowtouseneuralnetworksthatwerespecificallydesignedforimagerecognition.TheseneuralnetworkswillperformconsiderablybetterontheMNISTdigitsthanthemoretraditionalneuralnetworks.

TheMNISTdatasetisstoredinaproprietybinaryformatthatisdescribedattheaboveURL.Weprovideadecoderinthebook’sexamples.

IrisDataSet

BecauseAIfrequentlyusestheirisdataset(Fisher,1936),youwillseeitseveraltimesinthisbook.SirRonaldFisher(1936)collectedthesedataasanexampleofdiscriminantanalysis.Thisdatasethasbecomeverypopularinmachinelearningeventoday.ThefollowingURLcontainstheirisdataset:

https://archive.ics.uci.edu/ml/datasets/Iris

Theirisdatasetcontainsmeasurementsandspeciesinformationfor150irisflowers,andthedataareessentiallyrepresentedasaspreadsheetwiththefollowingcolumnsorfeatures:

Sepallength


SepalwidthPetallengthPetalwidthIrisspecies

Petalsrefertotheinnermostpetalsoftheiris,andsepalreferstotheoutermostpetalsoftheirisflower.Eventhoughthedatasetseemstohaveavectoroflength5,thespeciesfeaturemustbehandleddifferentlythantheotherfour.Inotherwords,vectorstypicallycontainonlynumbers.So,thefirstfourfeaturesareinherentlynumerical.Thespeciesfeatureisnot.

Oneoftheprimaryapplicationsofthisdatasetistocreateaprogramthatwillactasaclassifier.Thatis,itwillconsidertheflower’sfeaturesasinputs(sepallength,petalwidth,etc.)andultimatelydeterminethespecies.Thisclassificationwouldbetrivialforacompleteandknowndataset,butourgoalistoseewhetherthemodelcancorrectlyidentifythespeciesusingdatafromunknownirises.

Onlysimplenumericencodingtranslatestheirisspeciestoasingledimension.Wemustuseadditionaldimensionalencodings,suchasone-of-norequilateral,sothatthespeciesencodingsareequidistantfromeachother.Ifweareclassifyingirises,wedonotwantourencodingprocesstocreateanybiases.

Thinkingoftheirisfeaturesasdimensionsinahigher-dimensionalspacemakesagreatdealofsense.Considertheindividualsamples(therowsintheirisdataset)aspointsinthissearchspace.Pointsclosertogetherlikelysharesimilarities.Let’stakealookatthesesimilaritiesbystudyingthefollowingthreerowsfromtheirisdataset:

5.1,3.5,1.4,0.2,Iris-setosa

7.0,3.2,4.7,1.4,Iris-versicolour

6.3,3.3,6.0,2.5,Iris-virginica

Thefirstlinehas5.1asthesepallength,3.5asthesepalwidth,1.4asthepetallength,and0.2asthepetalwidth.Ifweuseone-of-nencodingtotherange0to1,theabovethreerowswouldencodetothefollowingthreevectors:

[5.1,3.5,1.4,0.2,1,0,0]

[7.0,3.2,4.7,1.4,0,1,0]

[6.3,3.3,6.0,2.5,0,0,1]

Chapter4,“FeedforwardNeuralNetworks,”willcoverone-of-nencoding.

AutoMPGDataSet

Theautomilespergallon(MPG)datasetiscommonlyusedforregressionproblems.Thedatasetcontainsattributesofseveralcars.Usingtheseattributes,wecantrainneuralnetworkstopredictthefuelefficiencyofthecar.TheUCIMachineLearningRepositoryprovidesthisdataset,andyoucandownloaditfromthefollowingURL:

https://archive.ics.uci.edu/ml/datasets/Auto+MPG

WetookthesedatafromtheStatLiblibrary,whichismaintainedatCarnegieMellonUniversity.IntheexpositionfortheAmericanStatisticalAssociation,programmersusedthedatain1983,andnovaluesaremissing.Quinlan(1993),theauthorofthestudy,usedthisdatasettodescribefuelconsumption.“Thedataconcerncity-cyclefuelconsumptioninmilespergallon,tobeprojectedintermsofthreemulti-valueddiscreteandfivecontinuousattributes”(Quinlan,1993).

Thedatasetcontainsthefollowingattributes:

1.mpg:continuous

2.cylinders:multi-valueddiscrete

3.displacement:continuous

4.horsepower:continuous

5.weight:continuous

6.acceleration:continuous

7.modelyear:multi-valueddiscrete

8.origin:multi-valueddiscrete

9.carname:string(uniqueforeachinstance)

SunspotsDataSet

Sunspotsaretemporaryphenomenaonthesurfaceofthesunthatappearvisiblyasdarkspotscomparedtosurroundingregions.Intensemagneticactivitycausessunspots.Althoughtheyoccurattemperaturesofroughly3,000–4,500K(2,727–4,227°C),thecontrastwiththesurroundingmaterialatabout5,780Kleavesthemclearlyvisibleasdarkspots.Sunspotsappearanddisappearwithregularity,makingthemagooddatasetfortimeseriesprediction.

Figure11showssunspotactivityovertime:


Figure11:SunspotsActivity

Thesunspotdatafilecontainsinformationsimilartothefollowing:

YEARMONSSNDEV

1749158.024.1

1749262.625.1

1749370.026.6

1749455.723.6

1749585.029.4

1749683.529.2

1749794.831.1

1749866.325.9

1749975.927.7

Theabovedataprovidetheyear,month,sunspotcount,andstandarddeviationofsunspotsobserved.Manyworldorganizationstracksunspots.ThefollowingURLcontainsatableofsunspotreadings:

http://solarscience.msfc.nasa.gov/greenwch/spot_num.txt

http://solarscience.msfc.nasa.gov/greenwch/spot_num.txt

XOROperator

Theexclusiveor(XOR)operatorisaBooleanoperator.ProgrammersfrequentlyusethetruthtablefortheXORasanultra-simplesortof“HelloWorld”trainingsetformachinelearning.WerefertothetableastheXORdataset.ThisoperatorisrelatedtotheXORparityoperator,whichacceptsthreeinputsandhasthefollowingtruthtable:

0XOR0=0

1XOR0=1

0XOR1=1

1XOR1=0

WeutilizetheXORoperatorforcasesinwhichwewouldliketotrainorevaluatetheneuralnetworkbyhand.

KaggleOttoGroupChallenge

Inthisbook,wewillalsoutilizetheKaggleOttoGroupChallengedataset.Kaggleisaplatformthatfosterscompetitionamongdatascientistsonnewdatasets.Weusethisdatasettoclassifyproductsintoseveralgroupsbasedonunknownattributes.Additionally,wewillemployadeepneuralnetworktotacklethisproblem.WewillalsodiscussadvancedensembletechniquesinthischapterthatyoucanusetocompeteinKaggle.WewilldescribethisdatasetingreaterdetailinChapter16.

Wewillbeginthisbookwithanoverviewoffeaturesthatarecommontomostneuralnetworks.Thesefeaturesincludeneurons,layers,activationfunctions,andconnections.Fortheremainderofthebook,wewillexpandonthesetopicsasweintroducemoreneuralnetworkarchitectures.

Chapter1:NeuralNetworkBasicsNeuronsandLayersNeuronTypesActivationFunctionsLogicGates

Thisbookisaboutneuralnetworksandhowtotrain,query,structure,andinterpretthem.Wepresentmanyneuralnetworkarchitecturesaswellastheplethoraofalgorithmsthatcantraintheseneuralnetworks.Trainingistheprocessinwhichaneuralnetworkisadaptedtomakepredictionsfromdata.Inthischapter,wewillintroducethebasicconceptsthataremostrelevanttotheneuralnetworktypesfeaturedinthebook.

Deeplearning,arelativelynewsetoftrainingtechniquesformultilayeredneuralnetworks,isalsoaprimarytopic.Itencompassesseveralalgorithmsthatcantraincomplextypesofneuralnetworks.Withthedevelopmentofdeeplearning,wenowhaveeffectivemethodstotrainneuralnetworkswithmanylayers.

Thischapterwillincludeadiscussionofthecommonalitiesamongthedifferentneuralnetworks.Additionally,youwilllearnhowneuronsformweightedconnections,howtheseneuronscreatelayers,andhowactivationfunctionsaffecttheoutputofalayer.Webeginwithneuronsandlayers.

NeuronsandLayers

Mostneuralnetworkstructuresusesometypeofneuron.Manydifferentkindsofneuralnetworksexist,andprogrammersintroduceexperimentalneuralnetworkstructuresallthetime.Consequently,itisnotpossibletocovereveryneuralnetworkarchitecture.However,therearesomecommonalitiesamongneuralnetworkimplementations.Analgorithmthatiscalledaneuralnetworkwilltypicallybecomposedofindividual,interconnectedunitseventhoughtheseunitsmayormaynotbecalledneurons.Infact,thenameforaneuralnetworkprocessingunitvariesamongtheliteraturesources.Itcouldbecalledanode,neuron,orunit.

Figure1.1showstheabstractstructureofasingleartificialneuron:

Figure1.1:AnArtificialNeuron

Theartificialneuronreceivesinputfromoneormoresourcesthatmaybeotherneuronsordatafedintothenetworkfromacomputerprogram.Thisinputisusuallyfloating-pointorbinary.Oftenbinaryinputisencodedtofloating-pointbyrepresentingtrueorfalseas1or0.Sometimestheprogramalsodepictsthebinaryinputasusingabipolarsystemwithtrueas1andfalseas-1.

Anartificialneuronmultiplieseachoftheseinputsbyaweight.Thenitaddsthesemultiplicationsandpassesthissumtoanactivationfunction.Someneuralnetworksdonotuseanactivationfunction.Equation1.1summarizesthecalculatedoutputofaneuron:

Equation1.1:NeuronOutput

Intheaboveequation,thevariablesxandwrepresenttheinputandweightsoftheneuron.Thevariableicorrespondstothenumberofweightsandinputs.Youmustalwayshavethesamenumberofweightsasinputs.Eachweightismultipliedbyitsrespectiveinput,andtheproductsofthesemultiplicationsarefedintoanactivationfunctionthatisdenotedbytheGreekletterφ(phi).Thisprocessresultsinasingleoutputfromtheneuron.

Figure1.1showsthestructurewithjustonebuildingblock.Youcanchaintogethermanyartificialneuronstobuildanartificialneuralnetwork(ANN).Thinkoftheartificialneuronsasbuildingblocksforwhichtheinputandoutputcirclesaretheconnectors.Figure1.2showsanartificialneuralnetworkcomposedofthreeneurons:

Figure1.2:SimpleArtificialNeuralNetwork(ANN)

Theabovediagramshowsthreeinterconnectedneurons.ThisrepresentationisessentiallyFigure1.1,minusafewinputs,repeatedthreetimesandthenconnected.Italsohasatotaloffourinputsandasingleoutput.TheoutputofneuronsN1andN2feedN3toproducetheoutputO.TocalculatetheoutputforFigure1.2,weperformEquation1.1threetimes.ThefirsttwotimescalculateN1andN2,andthethirdcalculationusestheoutputofN1andN2tocalculateN3.

NeuralnetworkdiagramsdonottypicallyshowthelevelofdetailseeninFigure1.2.Tosimplifythediagram,wecanomittheactivationfunctionsandintermediateoutputs,andthisprocessresultsinFigure1.3:

Figure1.3:SimplifiedViewofANN

LookingatFigure1.3,youcanseetwoadditionalcomponentsofneuralnetworks.First,considertheinputsandoutputsthatareshownasabstractdottedlinecircles.Theinputandoutputcouldbepartsofalargerneuralnetwork.However,theinputandoutputareoftenaspecialtypeofneuronthatacceptsdatafromthecomputerprogramusingtheneuralnetwork,andtheoutputneuronsreturnaresultbacktotheprogram.Thistypeofneuroniscalledaninputneuron.Wewilldiscusstheseneuronsinthenextsection.

Figure1.3alsoshowstheneuronsarrangedinlayers.Theinputneuronsarethefirstlayer,theN1andN2neuronscreatethesecondlayer,thethirdlayercontainsN3,andthefourthlayerhasO.Whilemostneuralnetworksarrangeneuronsintolayers,thisisnot

alwaysthecase.Stanley(2002)introducedaneuralnetworkarchitecturecalledNeuroevolutionofAugmentingTopologies(NEAT).NEATneuralnetworkscanhaveaveryjumbled,non-layeredarchitecture.

Theneuronsthatformalayershareseveralcharacteristics.First,everyneuroninalayerhasthesameactivationfunction.However,thelayersthemselvesmighthavedifferentactivationfunctions.Second,layersarefullyconnectedtothenextlayer.Inotherwords,everyneuroninonelayerhasaconnectiontoneuronsinthepreviouslayer.Figure1.3isnotfullyconnected.Severallayersaremissingconnections.Forexample,I1andN2donotconnect.Figure1.4isanewversionofFigure1.3thatisfullyconnectedandhasanadditionallayer.

Figure1.4:FullyConnectedNetwork

InFigure1.4,youseeafullyconnected,multilayeredneuralnetwork.Networks,suchasthisone,willalwayshaveaninputandoutputlayer.Thenumberofhiddenlayersdeterminesthenameofthenetworkarchitecture.ThenetworkinFigure1.4isatwo-hidden-layernetwork.Mostnetworkswillhavebetweenzeroandtwohiddenlayers.Unlessyouhaveimplementeddeeplearningstrategies,networkswithmorethantwohiddenlayersarerare.

Youmightalsonoticethatthearrowsalwayspointdownwardorforwardfromtheinputtotheoutput.Thistypeofneuralnetworkiscalledafeedforwardneuralnetwork.

Laterinthisbook,wewillseerecurrentneuralnetworksthatforminvertedloopsamongtheneurons.

TypesofNeurons

Inthelastsection,webrieflyintroducedtheideathatdifferenttypesofneuronsexist.Nowwewillexplainalltheneurontypesdescribedinthebook.Noteveryneuralnetworkwilluseeverytypeofneuron.Itisalsopossibleforasingleneurontofilltheroleofseveraldifferentneurontypes.

InputandOutputNeurons

Nearlyeveryneuralnetworkhasinputandoutputneurons.Theinputneuronsacceptdatafromtheprogramforthenetwork.Theoutputneuronprovidesprocesseddatafromthenetworkbacktotheprogram.Theseinputandoutputneuronswillbegroupedbytheprogramintoseparatelayerscalledtheinputandoutputlayer.However,forsomenetworkstructures,theneuronscanactasbothinputandoutput.TheHopfieldneuralnetwork,whichwewilldiscussinChapter3,“Hopfield&BoltzmannMachines,”isanexampleofaneuralnetworkarchitectureinwhichneuronsarebothinputandoutput.

Theprogramnormallyrepresentstheinputtoaneuralnetworkasanarrayorvector.Thenumberofelementscontainedinthevectormustbeequaltothenumberofinputneurons.Forexample,aneuralnetworkwiththreeinputneuronsmightacceptthefollowinginputvector:

[0.5,0.75,0.2]

Neuralnetworkstypicallyacceptfloating-pointvectorsastheirinput.Likewise,neuralnetworkswilloutputavectorwithlengthequaltothenumberofoutputneurons.Theoutputwilloftenbeasinglevaluefromasingleoutputneuron.Tobeconsistent,wewillrepresenttheoutputofasingleoutputneuronnetworkasasingle-elementvector.

Noticethatinputneuronsdonothaveactivationfunctions.AsdemonstratedbyFigure1.1,inputneuronsarelittlemorethanplaceholders.Theinputissimplyweightedandsummed.Furthermore,thesizeoftheinputandoutputvectorsfortheneuralnetworkwillbethesameiftheneuralnetworkhasneuronsthatarebothinputandoutput.

HiddenNeurons

Hiddenneuronshavetwoimportantcharacteristics.First,hiddenneuronsonlyreceiveinputfromotherneurons,suchasinputorotherhiddenneurons.Second,hiddenneuronsonlyoutputtootherneurons,suchasoutputorotherhiddenneurons.Hiddenneuronshelptheneuralnetworkunderstandtheinput,andtheyformtheoutput.However,theyarenotdirectlyconnectedtotheincomingdataortotheeventualoutput.Hiddenneuronsareoftengroupedintofullyconnectedhiddenlayers.

Acommonquestionforprogrammersconcernsthenumberofhiddenneuronsinanetwork.Sincetheanswertothisquestioniscomplex,morethanonesectionofthebookwillincludearelevantdiscussionofthenumberofhiddenneurons.Priortodeeplearning,itwasgenerallysuggestedthatanythingmorethanasingle-hiddenlayerisexcessive(Hornik,1991).Researchershaveproventhatasingle-hidden-layerneuralnetworkcanfunctionasauniversalapproximator.Inotherwords,thisnetworkshouldbeabletolearntoproduce(orapproximate)anyoutputfromanyinputaslongasithasenoughhiddenneuronsinasinglelayer.

Anotherreasonwhyresearchersusedtoscoffattheideaofadditionalhiddenlayersisthattheselayerswouldimpedethetrainingoftheneuralnetwork.Trainingreferstotheprocessthatdeterminesgoodweightvalues.Beforeresearchersintroduceddeeplearningtechniques,wesimplydidnothaveanefficientwaytotrainadeepnetwork,whichareneuralnetworkswithalargenumberofhiddenlayers.Althoughasingle-hidden-layerneuralnetworkcantheoreticallylearnanything,deeplearningfacilitatesamorecomplexrepresentationofpatternsinthedata.

BiasNeurons

Programmersaddbiasneuronstoneuralnetworkstohelpthemlearnpatterns.Biasneuronsfunctionlikeaninputneuronthatalwaysproducesthevalueof1.Becausethebiasneuronshaveaconstantoutputof1,theyarenotconnectedtothepreviouslayer.Thevalueof1,whichiscalledthebiasactivation,canbesettovaluesotherthan1.However,1isthemostcommonbiasactivation.Notallneuralnetworkshavebiasneurons.Figure1.5showsasingle-hidden-layerneuralnetworkwithbiasneurons:

Figure1.5:NetworkwithBiasNeurons

Theabovenetworkcontainsthreebiasneurons.Everylevel,exceptfortheoutputlayer,containsasinglebiasneuron.Biasneuronsallowtheoutputofanactivationfunctiontobeshifted.Wewillseeexactlyhowthisshiftingoccurslaterinthechapterwhenactivationfunctionsarediscussed.

ContextNeurons

Contextneuronsareusedinrecurrentneuralnetworks.Thistypeofneuronallowstheneuralnetworktomaintainstate.Asaresult,agiveninputmaynotalwaysproduceexactlythesameoutput.Thisinconsistencyissimilartotheworkingsofbiologicalbrains.Considerhowcontextfactorsinyourresponsewhenyouhearaloudhorn.Ifyouhearthenoisewhileyouarecrossingthestreet,youmightstartle,stopwalking,andlookinthedirectionofthehorn.Ifyouhearthehornwhileyouarewatchinganactionadventurefilminamovietheatre,youdon’trespondinthesameway.Therefore,priorinputsgiveyouthecontextforprocessingtheaudioinputofahorn.

Timeseriesisoneapplicationofcontextneurons.Youmightneedtotrainaneuralnetworktolearninputsignalstoperformspeechrecognitionortopredicttrendsinsecurityprices.Contextneuronsareonewayforneuralnetworkstodealwithtimeseriesdata.Figure1.6showshowcontextneuronsmightbearrangedinaneuralnetwork:

Figure1.6:ContextNeurons

Thisneuralnetworkhasasingleinputandoutputneuron.Betweentheinputandoutputlayersaretwohiddenneuronsandtwocontextneurons.Otherthanthetwocontextneurons,thisnetworkisthesameaspreviousnetworksinthechapter.

Eachcontextneuronholdsavaluethatstartsat0andalwaysreceivesacopyofeitherhidden1orhidden2fromtheprevioususeofthenetwork.ThetwodashedlinesinFigure1.6meanthatthecontextneuronisadirectcopywithnootherweighting.Theotherlinesindicatethattheoutputisweightedbyoneofthesixweightvalueslistedabove.Equation1.1stillcalculatestheoutputinthesameway.Thevalueoftheoutputneuronwouldbethesumofallfourinputs,multipliedbytheirweights,andappliedtotheactivationfunction.

Atypeofneuralnetworkcalledasimplerecurrentneuralnetwork(SRN)usescontextneurons.JordanandElmannetworksarethetwomostcommontypesofSRN.Figure1.6showsanElmanSRN.Chapter13,“TimeSeriesandRecurrentNetworks,”includesadiscussionofbothtypesofSRN.

OtherNeuronTypes

Theindividualunitsthatcompriseaneuralnetworkarenotalwayscalledneurons.Researcherswillsometimesrefertotheseneuronsasnodes,unitsorsummations.Inlaterchaptersofthebook,wewillexploredeeplearningthatutilizesBoltzmannmachinestofilltheroleofneurons.Regardlessofthetypeofunit,neuralnetworksarealmostalwaysconstructedofweightedconnectionsbetweentheseunits.

ActivationFunctions

Inneuralnetworkprogramming,activationortransferfunctionsestablishboundsfortheoutputofneurons.Neuralnetworkscanusemanydifferentactivationfunctions.Wewilldiscussthemostcommonactivationfunctionsinthissection.

Choosinganactivationfunctionforyourneuralnetworkisanimportantconsiderationbecauseitcanaffecthowyoumustformatinputdata.Inthischapter,wewillguideyouontheselectionofanactivationfunction.Chapter14,“ArchitectingNeuralNetworks,”willalsocontainadditionaldetailsontheselectionprocess.

LinearActivationFunction

Themostbasicactivationfunctionisthelinearfunctionbecauseitdoesnotchangetheneuronoutputatall.Equation1.2showshowtheprogramtypicallyimplementsalinearactivationfunction:

Equation1.2:LinearActivationFunction

Asyoucanobserve,thisactivationfunctionsimplyreturnsthevaluethattheneuroninputspassedtoit.Figure1.7showsthegraphforalinearactivationfunction:

Figure1.7:LinearActivationFunction

Regressionneuralnetworks,thosethatlearntoprovidenumericvalues,willusuallyusealinearactivationfunctionontheiroutputlayer.Classificationneuralnetworks,thosethatdetermineanappropriateclassfortheirinput,willusuallyutilizeasoftmaxactivationfunctionfortheiroutputlayer.

StepActivationFunction

Thesteporthresholdactivationfunctionisanothersimpleactivationfunction.Neuralnetworkswereoriginallycalledperceptrons.McCulloch&Pitts(1943)introducedtheoriginalperceptronandusedastepactivationfunctionlikeEquation1.3:

Equation1.3:StepActivationFunction

Equation1.3outputsavalueof1.0forincomingvaluesof0.5orhigherand0forallothervalues.Stepfunctionsareoftencalledthresholdfunctionsbecausetheyonlyreturn1(true)forvaluesthatareabovethespecifiedthreshold,asseeninFigure1.8:

Figure1.8:StepActivationFunction

SigmoidActivationFunction

Thesigmoidorlogisticactivationfunctionisaverycommonchoiceforfeedforwardneuralnetworksthatneedtooutputonlypositivenumbers.Despiteitswidespreaduse,thehyperbolictangentortherectifiedlinearunit(ReLU)activationfunctionisusuallyamoresuitablechoice.WeintroducetheReLUactivationfunctionlaterinthischapter.Equation1.4showsthesigmoidactivationfunction:

Equation1.4:SigmoidActivationFunction

Usethesigmoidfunctiontoensurethatvaluesstaywithinarelativelysmallrange,asseeninFigure1.9:

Figure1.9:SigmoidActivationFunction

Asyoucanseefromtheabovegraph,valuesaboveorbelow0arecompressedtotheapproximaterangebetween0and1.

HyperbolicTangentActivationFunction

Thehyperbolictangentfunctionisalsoaverycommonactivationfunctionforneuralnetworksthatmustoutputvaluesintherangebetween-1and1.Thisactivationfunctionissimplythehyperbolictangent(tanh)function,asshowninEquation1.5:

Equation1.5:HyperbolicTangentActivationFunction

Thegraphofthehyperbolictangentfunctionhasasimilarshapetothesigmoidactivationfunction,asseeninFigure1.10:

Figure1.10:HyperbolicTangentActivationFunction

Thehyperbolictangentfunctionhasseveraladvantagesoverthesigmoidactivationfunction.Theseinvolvethederivativesusedinthetrainingoftheneuralnetwork,andtheywillbecoveredinChapter6,“BackpropagationTraining.”

RectifiedLinearUnits(ReLU)

Introducedin2000byTeh&Hinton,therectifiedlinearunit(ReLU)hasseenveryrapidadoptionoverthepastfewyears.PriortotheReLUactivationfunction,thehyperbolictangentwasgenerallyacceptedastheactivationfunctionofchoice.MostcurrentresearchnowrecommendstheReLUduetosuperiortrainingresults.Asaresult,mostneuralnetworksshouldutilizetheReLUonhiddenlayersandeithersoftmaxorlinearontheoutputlayer.Equation1.6showstheverysimpleReLUfunction:

Equation1.6:RectifiedLinearUnit(ReLU)

WewillnowexaminewhyReLUtypicallyperformsbetterthanotheractivationfunctionsforhiddenlayers.PartoftheincreasedperformanceisduetothefactthattheReLUactivationfunctionisalinear,non-saturatingfunction.Unlikethesigmoid/logisticorthehyperbolictangentactivationfunctions,theReLUdoesnotsaturateto-1,0,or1.Asaturatingactivationfunctionmovestowardsandeventuallyattainsavalue.Thehyperbolictangentfunction,forexample,saturatesto-1asxdecreasesandto1asxincreases.Figure1.11showsthegraphoftheReLUactivationfunction:

Figure1.11:ReLUActivationFunction

Mostcurrentresearchstatesthatthehiddenlayersofyourneuralnetworkshoulduse

theReLUactivation.ThereasonsforthesuperiorityoftheReLUoverhyperbolictangentandsigmoidwillbedemonstratedinChapter6,“BackpropagationTraining.”

SoftmaxActivationFunction

Thefinalactivationfunctionthatwewillexamineisthesoftmaxactivationfunction.Alongwiththelinearactivationfunction,softmaxisusuallyfoundintheoutputlayerofaneuralnetwork.Thesoftmaxfunctionisusedonaclassificationneuralnetwork.Theneuronthathasthehighestvalueclaimstheinputasamemberofitsclass.Becauseitisapreferablemethod,thesoftmaxactivationfunctionforcestheoutputoftheneuralnetworktorepresenttheprobabilitythattheinputfallsintoeachoftheclasses.Withoutthesoftmax,theneuron’soutputsaresimplynumericvalues,withthehighestindicatingthewinningclass.

Toseehowthesoftmaxactivationfunctionisused,wewilllookatacommonneuralnetworkclassificationproblem.Theirisdatasetcontainsfourmeasurementsfor150differentirisflowers.Eachoftheseflowersbelongstooneofthreespeciesofiris.Whenyouprovidethemeasurementsofaflower,thesoftmaxfunctionallowstheneuralnetworktogiveyoutheprobabilitythatthesemeasurementsbelongtoeachofthethreespecies.Forexample,theneuralnetworkmighttellyouthatthereisan80%chancethattheirisissetosa,a15%probabilitythatitisvirginicaandonlya5%probabilityofversicolour.Becausetheseareprobabilities,theymustaddupto100%.Therecouldnotbean80%probabilityofsetosa,a75%probabilityofvirginicaanda20%probabilityofversicolour—thistypeofaresultwouldbenonsensical.

Toclassifyinputdataintooneofthreeirisspecies,youwillneedoneoutputneuronforeachofthethreespecies.Theoutputneuronsdonotinherentlyspecifytheprobabilityofeachofthethreespecies.Therefore,itisdesirabletoprovideprobabilitiesthatsumto100%.Theneuralnetworkwilltellyoutheprobabilityofaflowerbeingeachofthethreespecies.Togettheprobability,usethesoftmaxfunctioninEquation1.7:

Equation1.7:TheSoftmaxFunction

Intheaboveequation,irepresentstheindexoftheoutputneuron(o)beingcalculated,andjrepresentstheindexesofallneuronsinthegroup/level.Thevariablezdesignatesthearrayofoutputneurons.It’simportanttonotethatthesoftmaxactivationiscalculateddifferentlythantheotheractivationfunctionsinthischapter.Whensoftmaxistheactivationfunction,theoutputofasingleneuronisdependentontheotheroutputneurons.

InEquation1.7,youcanobservethattheoutputoftheotheroutputneuronsiscontainedinthevariablez,asnoneoftheotheractivationfunctionsinthischapterutilizez.Listing1.1implementssoftmaxinpseudocode:

Listing1.1:TheSoftmaxFunction

defsoftmax(neuron_output):

sum=0

forvinneuron_output:

sum=sum+v

sum=math.exp(sum)

proba=[]

foriinrange(len(neuron_output))

proba[i]=math.exp(neuron_output[i])/sum

returnproba

Toseethesoftmaxfunctioninoperation,refertothefollowingURL:

http://www.heatonresearch.com/aifh/vol3/softmax.html

Consideratrainedneuralnetworkthatclassifiesdataintothreecategories,suchasthethreeirisspecies.Inthiscase,youwoulduseoneoutputneuronforeachofthetargetclasses.Consideriftheneuralnetworkweretooutputthefollowing:

Neuron1:setosa:0.9

Neuron2:versicolour:0.2

Neuron3:virginica:0.4

Fromtheaboveoutput,wecanclearlyseethattheneuralnetworkconsidersthedatatorepresentasetosairis.However,thesenumbersarenotprobabilities.The0.9valuedoesnotrepresenta90%likelihoodofthedatarepresentingasetosa.Thesevaluessumto1.5.Inorderforthemtobetreatedasprobabilities,theymustsumto1.0.Theoutputvectorforthisneuralnetworkisthefollowing:

[0.9,0.2,0.4]

Ifthisvectorisprovidedtothesoftmaxfunction,thefollowingvectorisreturned:

[0.47548495534876745,0.2361188410001125,0.28839620365112]

Theabovethreevaluesdosumto1.0andcanbetreatedasprobabilities.Thelikelihoodofthedatarepresentingasetosairisis48%becausethefirstvalueinthevectorroundsto0.48(48%).Youcancalculatethisvalueinthefollowingmanner:

sum=exp(0.9)+exp(0.2)+exp(0.4)=5.17283056695839

j0=exp(0.9)/sum=0.47548495534876745

j1=exp(0.2)/sum=0.2361188410001125

j2=exp(0.4)/sum=0.28839620365112

http://www.heatonresearch.com/aifh/vol3/softmax.html

WhatRoledoesBiasPlay?

Theactivationfunctionsseenintheprevioussectionspecifytheoutputofasingleneuron.Together,theweightandbiasofaneuronshapetheoutputoftheactivationtoproducethedesiredoutput.Toseehowthisprocessoccurs,considerEquation1.8.Itrepresentsasingle-inputsigmoidactivationneuralnetwork.

Equation1.8:Single-InputNeuralNetwork

Thexvariablerepresentsthesingleinputtotheneuralnetwork.Thewandbvariablesspecifytheweightandbiasoftheneuralnetwork.TheaboveequationisacombinationoftheEquation1.1thatspecifiesaneuralnetworkandEquation1.4thatdesignatesthesigmoidactivationfunction.

Theweightsoftheneuronallowyoutoadjusttheslopeorshapeoftheactivationfunction.Figure1.12showstheeffectontheoutputofthesigmoidactivationfunctioniftheweightisvaried:

Figure1.12:AdjustingNeuronWeight

Theabovediagramshowsseveralsigmoidcurvesusingthefollowingparameters:

f(x,0.5,0.0)

f(x,1.0,0.0)

f(x,1.5,0.0)

f(x,2.0,0.0)

Toproducethecurves,wedidnotusebias,whichisevidentinthethirdparameterof0ineachcase.UsingfourweightvaluesyieldsfourdifferentsigmoidcurvesinFigure1.11.Nomattertheweight,wealwaysgetthesamevalueof0.5whenxis0becauseallofthecurveshitthesamepointwhenxis0.Wemightneedtheneuralnetworktoproduceothervalueswhentheinputisnear0.5.

Biasdoesshiftthesigmoidcurve,whichallowsvaluesotherthan0.5whenxisnear0.Figure1.13showstheeffectofusingaweightof1.0withseveraldifferentbiases:

Figure1.13:AdjustingNeuronBias

Theabovediagramshowsseveralsigmoidcurveswiththefollowingparameters:

f(x,1.0,1.0)

f(x,1.0,0.5)

f(x,1.0,1.5)

f(x,1.0,2.0)

Weusedaweightof1.0forthesecurvesinallcases.Whenweutilizedseveraldifferentbiases,sigmoidcurvesshiftedtotheleftorright.Becauseallthecurvesmergetogetheratthetoprightorbottomleft,itisnotacompleteshift.

Whenweputbiasandweightstogether,theyproducedacurvethatcreatedthenecessaryoutputfromaneuron.Theabovecurvesaretheoutputfromonlyoneneuron.Inacompletenetwork,theoutputfrommanydifferentneuronswillcombinetoproducecomplexoutputpatterns.

LogicwithNeuralNetworks

Asacomputerprogrammer,youareprobablyfamiliarwithlogicalprogramming.YoucanusetheprogrammingoperatorsAND,OR,andNOTtogovernhowaprogrammakesdecisions.Theselogicaloperatorsoftendefinetheactualmeaningoftheweightsandbiasesinaneuralnetwork.Considerthefollowingtruthtable:

0AND0=0

1AND0=0

0AND1=0

1AND1=1

0OR0=0

1OR0=1

0OR1=1

1OR1=1

NOT0=1

NOT1=0

ThetruthtablespecifiesthatifbothsidesoftheANDoperatoraretrue,thefinaloutputisalsotrue.Inallothercases,theresultoftheANDisfalse.ThisdefinitionfitswiththeEnglishword“and”quitewell.IfyouwantahousewithaniceviewANDalargebackyard,thenbothrequirementsmustbefulfilledforyoutochooseahouse.Ifyouwantahousethathasanicevieworalargebackyard,thenonlyoneneedstobepresent.

Theselogicalstatementscanbecomemorecomplex.Considerifyouwantahousethathasaniceviewandalargebackyard.However,youwouldalsobesatisfiedwithahousethathasasmallbackyardyetisnearapark.Youcanexpressthisideainthefollowingway:

([niceview]AND[largeyard])OR((NOT[largeyard])and[park])

Youcanexpressthepreviousstatementwiththefollowinglogicaloperators:

Intheabovestatement,theORlookslikealetter“v,”theANDlookslikeanupsidedown“v,”andtheNOTlookslikehalfofabox.

WecanuseneuralnetworkstorepresentthebasiclogicaloperatorsofAND,OR,andNOT,asseeninFigure1.14:

Figure1.14:BasicLogicOperators

Theabovediagramshowstheweightsandbiasweightforeachofthethreefundamentallogicaloperators.YoucaneasilycalculatetheoutputforanyoftheseoperatorsusingEquation1.1.ConsidertheANDoperatorwithtwotrue(1)inputs:

(1*1)+(1*1)+(-1.5)=0.5

Weareusingastepactivationfunction.Because0.5isgreaterthanorequalto0.5,theoutputis1ortrue.Wecanevaluatetheexpressionwherethefirstinputisfalse:

(1*1)+(0*1)+(-1.5)=-0.5

Becauseofthestepactivationfunction,thisoutputis0orfalse.

Wecanbuildmorecomplexlogicalstructuresfromtheseneurons.Considertheexclusiveor(XOR)operatorthathasthefollowingtruthtable:

0XOR0=0

1XOR0=1

0XOR1=1

1XOR1=0

TheXORoperatorspecifiesthatone,butnotboth,oftheinputscanbetrue.Forexample,oneofthetwocarswillwintherace,butnotbothofthemwillwin.TheXORoperatorcanbewrittenwiththebasicAND,OR,andNOToperatorsasfollows:

Equation1.9:TheExclusiveOrOperator

ThepluswithacircleisthesymbolfortheXORoperator,andpandqarethetwo

inputstoevaluate.TheaboveexpressionmakessenseifyouthinkoftheXORoperatormeaningporq,butnotbothpandq.Figure1.15showsaneuralnetworkthatcanrepresentanXORoperator:

Figure1.15:XORNeuralNetwork

Calculatingtheaboveneuralnetworkwouldrequireseveralsteps.First,youmustcalculatethevaluesforeverynodethatisdirectlyconnectedtotheinputs.Inthecaseoftheaboveneuralnetwork,therearetwonodes.WewillshowanexampleofcalculatingtheXORwiththeinputs[0,1].Webeginbycalculatingthetwotopmost,unlabeled(hidden)nodes:

(0*1)+(1*1)-0.5=0.5=True

(0*1)+(1*1)-1.5=-0.5=False

Nextwecalculatethelower,unlabeled(hidden)node:

(0*-1)+0.5=0.5=True

Finally,wecalculateO1:

(1*1)+(1*1)-1.5=0.5=True

Asyoucansee,youcanmanuallywiretheconnectionsinaneuralnetworktoproducethedesiredoutput.However,manuallycreatingneuralnetworksisverytedious.Therestofthebookwillincludeseveralalgorithmsthatcanautomaticallydeterminetheweightandbiasvalues.

ChapterSummary

Inthischapter,weshowedthataneuralnetworkiscomprisedofneurons,layers,andactivationfunctions.Fundamentally,theneuronsinaneuralnetworkmightbeinput,hidden,oroutputinnature.Inputandoutputneuronspassinformationintoandoutoftheneuralnetwork.Hiddenneuronsoccurbetweentheinputandoutputneuronsandhelpprocessinformation.

Activationfunctionsscaletheoutputofaneuron.Wealsointroducedseveralactivationfunctions.Thetwomostcommonactivationfunctionsarethesigmoidandhyperbolictangent.Thesigmoidfunctionisappropriatefornetworksinwhichonlypositiveoutputisneeded.Thehyperbolictangentfunctionsupportsbothpositiveandnegativeoutput.

Aneuralnetworkcanbuildlogicalstatements,andwedemonstratedtheweightsto

generateAND,OR,andNOToperators.Usingthesethreebasicoperators,youcanbuildmorecomplex,logicalexpressions.WepresentedanexampleofbuildinganXORoperator.

Nowthatwe’veseenthebasicstructureofaneuralnetwork,wewillexploreinthenexttwochaptersseveralclassicneuralnetworkssothatyoucanusethisabstractstructure.Classicneuralnetworkstructuresincludetheself-organizingmap,theHopfieldneuralnetwork,andtheBoltzmannmachine.Theseclassicalneuralnetworksformthefoundationofotherarchitecturesthatwepresentinthebook.

Chapter2:Self-OrganizingMapsSelf-OrganizingMapsNeighborhoodFunctionsUnsupervisedTrainingDimensionality

Nowthatyouhaveexploredtheabstractnatureofaneuralnetworkintroducedinthepreviouschapter,youwilllearnaboutseveralclassicneuralnetworktypes.Thischaptercoversoneoftheearliesttypesofneuralnetworksthatarestillusefultoday.Becauseneuronscanbeconnectedinvariousways,manydifferentneuralnetworkarchitecturesexistandbuildonthefundamentalideasfromChapter1,“NeuralNetworkBasics.”Webeginourexaminationofclassicneuralnetworkswiththeself-organizingmap(SOM).

TheSOMisusedtoclassifyneuralinputdataintooneofseveralgroups.TrainingdataisprovidedtotheSOM,aswellasthenumberofgroupsintowhichyouwishtoclassifythesedata.Whiletraining,theSOMwillgroupthesedataintogroups.Datathathavethemostsimilarcharacteristicswillbegroupedtogether.Thisprocessisverysimilartoclusteringalgorithms,suchask-means.However,unlikek-means,whichonlygroupsaninitialsetofdata,theSOMcancontinueclassifyingnewdatabeyondtheinitialdatasetthatwasusedfortraining.Unlikemostoftheneuralnetworksinthisbook,SOMisunsupervised—youdonottellitwhatgroupsyouexpectthetrainingdatatofallinto.TheSOMsimplyfiguresoutthegroupsitself,basedonyourtrainingdata,andthenitclassifiesanyfuturedataintosimilargroups.FutureclassificationisperformedusingwhattheSOMlearnedfromthetrainingdata.

Self-OrganizingMaps

Kohonen(1988)introducedtheself-organizingmap(SOM),aneuralnetworkconsistingofaninputlayerandanoutputlayer.Thetwo-layerSOMisalsoknownastheKohonenneuralnetworkandfunctionswhentheinputlayermapsdatatotheoutputlayer.Astheprogrampresentspatternstotheinputlayer,theoutputneuronisconsideredthewinnerwhenitcontainstheweightsmostsimilartotheinput.ThissimilarityiscalculatedbycomparingtheEuclideandistancebetweenthesetofweightsfromeachoutputneuron.TheshortestEuclideandistancewins.CalculatingEuclideandistanceisthefocusofthenextsection.

UnlikethefeedforwardneuralnetworkdiscussedinChapter1,therearenobiasvaluesintheSOM.Itjusthasweightsfromtheinputlayertotheoutputlayer.Additionally,itusesonlyalinearactivationfunction.Figure2.1showstheSOM:

Figure2.1:Self-OrganizingMap

TheSOMpicturedaboveshowshowtheprogrammapsthreeinputneuronstonineoutputneuronsarrangedinathree-by-threegrid.TheoutputneuronsoftheSOMareoftenarrangedintoagrid,cube,orotherhigher-dimensionalconstruct.Becausetheorderingoftheoutputneuronsinmostneuralnetworkstypicallyconveysnomeaningatall,thisarrangementisverydifferent.Forexample,thecloseproximityoftheoutputneurons#1and#2inmostneuralnetworksisnotsignificant.However,fortheSOM,theclosenessofoneoutputneurontoanotherisimportant.Computervisionapplicationsmakeuseoftheclosenessofneuronstoidentifyimagesmoreaccurately.Convolutionalneuralnetworks(CNNs),whichwillbeexaminedinChapter10,“ConvolutionalNeuralNetworks,”groupneuronsintooverlappingregionsbasedonhowclosetheseinputneuronsaretoeachother.Whenrecognizingimages,itisveryimportanttoconsiderwhichpixelsareneareachother.Theprogramrecognizespatternssuchasedges,solidregions,andlinesbylookingatpixelsneareachother.

CommonstructuresfortheoutputneuronsofSOMsincludethefollowing:

One-Dimensional:Outputneuronsarearrangedinaline.Two-Dimensional:Outputneuronsarearrangedinagrid.Three-Dimensional:Outputneuronsarearrangedinacube.

WewillnowseehowtostructureasimpleSOMthatlearnstorecognizecolorsthataregivenasRGBvectors.Theindividualred,green,andbluevaluescanrangebetween-1and+1.Blackortheabsenceofcolordesignates-1,and+1expressesthefullintensityofred,greenorblue.Thesethree-colorcomponentscomprisetheneuralnetworkinput.

Theoutputwillbea2,500-neurongridarrangedinto50rowsby50columns.ThisSOMwillorganizesimilarcolorsneareachotherinthisoutputgrid.Figure2.2showsthis

output:

Figure2.2:TheOutputGrid

Althoughtheabovefiguremaynotbeasclearintheblackandwhiteeditionsofthisbookasitisinthecolore-bookeditions,youcanseesimilarcolorsgroupedneareachother.Asingle,color-basedSOMisaverysimpleexamplethatallowsyoutovisualizethegroupingcapabilitiesoftheSOM.

HowareSOMstrained?Thetrainingprocesswillupdatetheweightmatrix,whichis3by2,500.Tostart,theprograminitializestheweightmatrixtorandomvalues.Thenitrandomlychooses15trainingcolors.

Thetrainingwillprogressthroughaseriesofiterations.Unlikeotherneuralnetworktypes,thetrainingforSOMnetworksinvolvesafixednumberofiterations.Totrainthecolor-basedSOM,wewilluse1,000iterations.

Eachiterationwillchooseonerandomcolorsamplefromthetrainingset,acollectionofRGBcolorvectorsthateachconsistofthreenumbers.Likewise,theweightsbetweeneachofthe2,500outputneuronsandthethreeinputneuronsareavectorofthreenumbers.Astrainingprogresses,theprogramwillcalculatetheEuclideandistancebetweeneachweightvectorandthecurrenttrainingpattern.AEuclideandistancedeterminesthedifferencebetweentwovectorsofthesamesize.Inthiscase,bothvectorsarethreenumbersthatrepresentanRGBcolor.Wecomparethecolorfromthetrainingdatatothethreeweightsofeachneuron.Equation2.1showstheEuclideandistancecalculation:

Equation2.1:TheEuclideanDistancebetweenTrainingDataandOutputNeuron

Intheaboveequation,thevariableprepresentsthetrainingpattern.Thevariablewcorrespondstotheweightvector.Bysquaringthedifferencesbetweeneachvectorcomponentandtakingthesquarerootoftheresultingsum,wecalculatetheEuclideandistance.Thiscalculationmeasuresthedifferencebetweeneachweightvectorandtheinputtrainingpattern.

TheprogramcalculatestheEuclideandistanceforeveryoutputneuron,andtheonewiththeshortestdistanceiscalledthebestmatchingunit(BMU).Thisneuronwilllearnthemostfromthecurrenttrainingpattern.TheneighborsoftheBMUwilllearnless.Toperformthistraining,theprogramloopsovereveryneuronanddeterminestheextenttowhichitshouldbetrained.NeuronsthatareclosertotheBMUwillreceivemoretraining.Equation2.2canmakethisdetermination:

Equation2.2:SOMLearningFunction

Intheaboveequation,thevariablet,alsoknownastheiterationnumber,representstime.ThepurposeoftheequationistocalculatetheresultingweightvectorWv(t+1).Youwilldeterminethenextweightbyaddingtothecurrentweight,whichisWv(t).Theendgoalistocalculatehowdifferentthecurrentweightisfromtheinputvector,anditisdonebytheclauseD(T)-Wv(t).TrainingtheSOMistheprocessofmakinganeuron’sweightsmoresimilartothetrainingelement.Wedonotwanttosimplyassignthetrainingelementtotheoutputneuronsweights,makingthemidentical.Rather,wecalculatethedifferencebetweenthetrainingelementandtheneuronsweightsandscalethisdifferencebymultiplyingitbytworatios.Thefirstratio,representedbyθ(theta),istheneighborhoodfunction.Thesecondratio,representedbyα(alpha),isamonotonicallydecreasinglearningrate.Inotherwords,asthetrainingprogresses,thelearningratefallsandneverrises.

TheneighborhoodfunctionconsidershowcloseeachoutputneuronistotheBMU.Forneuronsthatarenearer,theneighborhoodfunctionwillreturnavaluethatapproaches1.Fordistantneighbors,theneighborhoodfunctionwillapproach0.Thisrangebetween0and1controlshownearandfarneighborsaretrained.Nearerneighborswillreceivemoreofthetrainingadjustmenttotheirweights.Inthenextsection,wewillanalyzehowtheneighborhoodfunctiondeterminesthetrainingadjustments.Inadditiontotheneighborhoodfunction,thelearningratealsoscaleshowmuchtheprogramwilladjusttheoutputneuron.

UnderstandingNeighborhoodFunctions

Theneighborhoodfunctiondeterminesthedegreetowhicheachoutputneuronshouldreceiveatrainingadjustmentfromthecurrenttrainingpattern.Thefunctionusuallyreturnsavalueof1fortheBMU.ThisvalueindicatesthattheBMUshouldreceivethemosttraining.ThoseneuronsfartherfromtheBMUwillreceivelesstraining.Theneighborhoodfunctiondeterminesthisweighting.

Iftheoutputneuronsarearrangedinonlyonedimension,youshoulduseasimpleone-dimensionalneighborhoodfunction,whichwilltreattheoutputasonelongarrayofnumbers.Forinstance,aone-dimensionalnetworkmighthave100outputneuronsthatformalong,single-dimensionalarrayof100values.

Atwo-dimensionalSOMmighttakethesesame100valuesandrepresentthemasagrid,perhapsof10rowsand10columns.Theactualstructureremainsthesame;theneuralnetworkhas100outputneurons.Theonlydifferenceistheneighborhoodfunction.Thefirstwouldutilizeaone-dimensionalneighborhoodfunction;thesecondwoulduseatwo-dimensionalneighborhoodfunction.Thefunctionmustconsiderthisadditionaldimensionandfactoritintothedistancereturned.

Itisalsopossibletohavethree,four,andevenmoredimensionalfunctionsfortheneighborhoodfunction.Typically,neighborhoodfunctionsareexpressedinvectorformsothatthenumberofdimensionsdoesnotmatter.Torepresentthedimensions,theEuclidiannorm(representedbytwoverticalbars)ofallinputsistaken,asseeninEquation2.3:

Equation2.3:EuclideanNorm

Fortheaboveequation,thevariableprepresentsthedimensionalinputs.Thevariablewrepresentstheweights.Asingledimensionhasonlyasinglevalueforp.CalculatingtheEuclidiannormfor[2-0]wouldsimplybethefollowing:

CalculatingtheEuclideannormfor[2-0,3-0]isonlyslightlymorecomplex:

ThemostpopularchoiceforSOMsisthetwo-dimensionalneighborhoodfunction.One-dimensionalneighborhoodfunctionsarealsocommon.However,neighborhoodfunctionswiththreeormoredimensionsaremoreunusual.Choosingthenumberofdimensionsreallycomesdowntotheprogrammerdecidinghowmanywaysanoutputneuroncanbeclosetoanother.Thisdecisionshouldnotbetakenlightlybecauseeachadditionaldimensionsignificantlyaffectstheamountofmemoryandprocessingpowerneeded.ThisadditionalprocessingiswhymostprogrammerschoosetwoorthreedimensionsfortheSOMapplication.

Itcanbedifficulttounderstandwhyyoumighthavemorethanthreedimensions.Thefollowinganalogyillustratesthelimitationsofthreedimensions.Whileatthegrocerystore,Johnnoticedapackageofdriedapples.Asheturnedhisheadtotheleftorright,travelinginthefirstdimension,hesawotherbrandsofdriedapples.Ifhelookedupordown,travelingintheseconddimension,hesawothertypesofdriedfruit.Thethirddimension,depth,simplygiveshimmoreofexactlythesamedriedapples.Hereachedbehindthefrontitemandfoundadditionalstock.However,thereisnofourthdimension,whichcouldhavebeenusefultoallowfreshapplestobelocatedneartothedriedapples.Becausethesupermarketonlyhadthreedimensions,thistypeoflinkisnotpossible.Programmersdonothavethislimitation,andtheymustdecideiftheextraprocessingtimeisnecessaryforthebenefitsofadditionaldimensions.

TheGaussianfunctionisapopularchoiceforaneighborhoodfunction.Equation2.4usestheEuclideannormtocalculatetheGaussianfunctionforanynumberofdimensions:

Equation2.4:TheVectorFormoftheGaussianFunction

ThevariablexrepresentstheinputtotheGaussianfunction,crepresentsthecenteroftheGaussianfunction,andwrepresentsthewidths.Thevariablesx,wandcallarevectorswithmultipledimensions.Figure2.3showsthegraphofthetwo-dimensionalGaussianfunction:

Figure2.3:ASingle-DimensionalGaussianFunction

ThisfigureillustrateswhytheGaussianfunctionisapopularchoiceforaneighborhoodfunction.ProgrammersfrequentlyusetheGaussianfunctiontoshowthenormaldistribution,orbellcurve.IfthecurrentoutputneuronistheBMU,thenitsdistance(x-axis)willbe0.Asaresult,thetrainingpercent(y-axis)is1.0(100%).Asthedistanceincreaseseitherpositivelyornegatively,thetrainingpercentagedecreases.Oncethedistanceislargeenough,thetrainingpercentapproaches0.

IftheinputvectortotheGaussianfunctionhastwodimensions,thegraphappearsasFigure2.4:

Figure2.4:ATwo-DimensionalGaussianFunction

HowdoesthealgorithmuseGaussianconstantswithaneuralnetwork?Thecenter(c)ofaneighborhoodfunctionisalways0,whichcentersthefunctionontheorigin.Ifthealgorithmmovesthecenterfromtheorigin,aneuronotherthantheBMUwouldreceivethefulllearning.Itisunlikelyyouwouldeverwanttomovethecenterfromtheorigin.Foramulti-dimensionalGaussian,setallcentersto0inordertopositionthecurveattheorigin.

TheonlyremainingGaussianparameteristhewidth.Youshouldsetthisparametertosomethingslightlylessthantheentirewidthofthegridorarray.Astrainingprogresses,thewidthgraduallydecreases.Justlikethelearningrate,thewidthshoulddecreasemonotonically.

MexicanHatNeighborhoodFunction

Thoughitisthemostpopular,theGaussianfunctionisnottheonlyneighborhoodfunctionavailable.TheRickerwave,orMexicanhatfunction,isanotherpopularneighborhoodfunction.JustliketheGaussianneighborhoodfunction,thevectorlengthofthexdimensionsisthebasisfortheMexicanhatfunction,asseeninEquation2.5:

Equation2.5:VectorFormofMexicanHatFunction

MuchthesameastheGaussian,theprogrammercanusetheMexicanhatfunctioninoneormoredimensions.Figure2.5showstheMexicanhatfunctionwithonedimension:

Figure2.5:AOne-DimensionalMexicanHatFunction

YoumustbeawarethattheMexicanhatfunctionpenalizesneighborsthatarebetween2and4,or-2and-4unitsfromthecenter.Ifyourmodelseekstopenalizenearmisses,theMexicanhatfunctionisagoodchoice.

YoucanalsousetheMexicanhatfunctionintwoormoredimensions.Figure2.6showsatwo-dimensionalMexicanhatfunction:

Figure2.6:ATwo-DimensionalMexicanHatFunction

Justliketheone-dimensionalversion,theaboveMexicanhatpenalizesnearmisses.Theonlydifferenceisthatthetwo-dimensionalMexicanhatfunctionutilizesatwo-dimensionalvector,whichlooksmorelikeaMexicansombrerothantheone-dimensionalvariant.Althoughitispossibletousemorethantwodimensions,thesevariantsarehardtovisualizebecauseweperceivespaceinthreedimensions.

CalculatingSOMError

Supervisedtrainingtypicallyreportsanerrormeasurementthatdecreasesastrainingprogresses.Unsupervisedmodels,suchastheSOMnetwork,cannotdirectlycalculateanerrorbecausethereisnoexpectedoutput.However,anestimationoftheerrorcanbecalculatedfortheSOM(Masters,1993).

WedefinetheerrorasthelongestEuclideandistanceofallBMUsinatrainingiteration.EachtrainingsetelementhasitsownBMU.Aslearningprogresses,thelongestdistanceshoulddecrease.TheresultsalsoindicatethesuccessoftheSOMtrainingsincethevalueswilltendtodecreaseasthetrainingcontinues.

ChapterSummary

Inthefirsttwochapters,weexplainedseveralclassicneuralnetworktypes.SincePitts(1943)introducedtheneuralnetwork,manydifferentneuralnetworktypeshavebeeninvented.Wehavefocusedprimarilyontheclassicneuralnetworktypesthatstillhaverelevanceandthatestablishthefoundationforotherarchitecturesthatwewillcoverinlaterchaptersofthebook.

Thischapterfocusedontheself-organizingmap(SOM)thatisanunsupervisedneuralnetworktypethatcanclusterdata.TheSOMhasaninputneuroncountequaltothenumberofattributesforthedatatobeclustered.Anoutputneuroncountspecifiesthenumberofgroupsintowhichthedatashouldbeclustered.TheSOMneuralnetworkistrainedinanunsupervisedmanner.Inotherwords,onlythedatapointsareprovidedtotheneuralnetwork;theexpectedoutputsarenotprovided.TheSOMnetworklearnstoclusterthedatapoints,especiallythedatapointssimilartotheoneswithwhichittrained.

Inthenextchapter,wewillexaminetwomoreclassicneuralnetworktypes:theHopfieldneuralnetworkandtheBoltzmannmachine.Theseneuralnetworktypesaresimilarinthattheybothuseanenergyfunctionduringtheirtrainingprocess.Theenergyfunctionmeasurestheamountofenergyinthenetwork.Astrainingprogresses,theenergyshoulddecreaseasthenetworklearns.

Chapter3:Hopfield&BoltzmannMachines

HopfieldNetworksEnergyFunctionsHebbianLearningAssociativeMemoryOptimizationBoltzmannMachines

ThischapterwillintroducetheHopfieldnetworkaswellastheBoltzmannmachine.ThoughneitheroftheseclassicneuralnetworksisusedextensivelyinmodernAIapplications,botharefoundationaltomoremodernalgorithms.TheBoltzmannmachineformsthefoundationofthedeepbeliefneuralnetwork(DBNN),whichisoneofthefundamentalalgorithmsofdeeplearning.Hopfieldnetworksareaverysimpletypeofneuralnetworkthatutilizesmanyofthesamefeaturesthatthemorecomplexfeedforwardneuralnetworksemploy.

HopfieldNeuralNetworks

TheHopfieldneuralnetwork(Hopfield,1982)isperhapsthesimplesttypeofneuralnetworkbecauseitisafullyconnectedsinglelayer,auto-associativenetwork.Inotherwords,ithasasinglelayerinwhicheachneuronisconnectedtoeveryotherneuron.Additionally,thetermauto-associativemeansthattheneuralnetworkwillreturntheentirepatternifitrecognizesapattern.Asaresult,thenetworkwillfillinthegapsofincompleteordistortedpatterns.

Figure3.1showsaHopfieldneuralnetworkwithjustfourneurons.Whileafour-neuronnetworkishandybecauseitissmallenoughtovisualize,itcanrecognizeafewpatterns.

Figure3.1:AHopfieldNeuralNetworkwith12Connections

BecauseeveryneuroninaHopfieldneuralnetworkisconnectedtoeveryotherneuron,youmightassumethatafour-neuronnetworkwouldcontainafour-by-fourmatrix,or16connections.However,16connectionswouldrequirethateveryneuronbeconnectedtoitselfaswellastoeveryotherneuron.InaHopfieldneuralnetwork,16connectionsdonotoccur;theactualnumberofconnectionsis12.

Theseconnectionsareweightedandstoredinamatrix.Afour-by-fourmatrixwouldstorethenetworkpicturedabove.Infact,thediagonalofthismatrixwouldcontain0’sbecausetherearenoself-connections.Allneuralnetworkexamplesinthisbookwillusesomeformofmatrixtostoretheirweights.

EachneuroninaHopfieldnetworkhasastateofeithertrue(1)orfalse(-1).ThesestatesareinitiallytheinputtotheHopfieldnetworkandultimatelybecometheoutputofthenetwork.TodeterminewhetheraHopfieldneuron’sstateis-1or1,useEquation3.1:

Equation3.1:HopfieldNeuronState

Theaboveequationcalculatesthestate(s)ofneuroni.Thestateofagivenneurongreatlydependsonthestatesoftheotherneurons.Theequationmultipliesandsumstheweight(w)andstate(s)oftheotherneurons(j).Essentially,thestateofthecurrentneuron(i)is+1ifthissumisgreaterthanthethreshold(θ,theta).Otherwiseitis-1.Thethresholdvalueisusually0.

Becausethestateofasingleneurondependsonthestatesoftheremainingneurons,theorderinwhichtheequationcalculatestheneuronsisveryimportant.ProgrammersfrequentlyemploythefollowingtwostrategiestocalculatethestatesforallneuronsinaHopfieldnetwork:

Asynchronous:Thisstrategyupdatesonlyoneneuronatatime.Itpicksthisneuronatrandom.Synchronous:Itupdatesallneuronsatthesametime.Thismethodislessrealisticsincebiologicalorganismslackaglobalclockthatsynchronizestheneurons.

YoushouldtypicallyrunaHopfieldnetworkuntilthevaluesofallneuronsstabilize.Despitethefactthateachneuronisdependentonthestatesoftheothers,thenetworkwillusuallyconvergetoastablestate.

Itisimportanttohavesomeindicationofhowclosethenetworkistoconvergingtoastablestate.YoucancalculateanenergyvalueforHopfieldnetworks.ThisvaluedecreasesastheHopfieldnetworkmovestoamorestablestate.Toevaluatethestabilityofthenetwork,youcanusetheenergyfunction.Equation3.2showstheenergycalculationfunction:

Equation3.2:HopfieldEnergyFunction

Boltzmannmachines,discussedlaterinthechapter,alsoutilizethisenergyfunction.BoltzmannmachinessharemanysimilaritieswithHopfieldneuralnetworks.Whenthethresholdis0,thesecondtermofEquation3.2dropsout.Listing3.1containsthecodetoimplementEquation3.1:

Listing3.1:HopfieldEnergy

defenergy(weights,state,threshold):

#Firstterm

a=0

foriinrange(neuron_count):

forjinrange(neuron_count):

a=a+weight[i][j]*state[i]*state[j]

a=a*-0.5

#Secondterm

b=0


b=b+state[i]*threshold[i]

#Result

returna+b

TrainingaHopfieldNetwork

YoucantrainHopfieldnetworkstoarrangetheirweightsinawaythatallowsthenetworktoconvergetodesiredpatterns,alsoknownasthetrainingset.

ThesedesiredtrainingpatternsarealistofpatternswithaBooleanvalueforeachoftheneuronsthatcomprisetheBoltzmannmachine.Thefollowingdatamightrepresentafour-patterntrainingsetforaHopfieldnetworkwitheightneurons:

11000000

00001100

10000001

00011000

Theabovedataarecompletelyarbitrary;however,theydorepresentactualpatternstotraintheHopfieldnetwork.Oncetrained,apatternsimilartotheonelistedbelowshouldfindequilibriumwithapatternclosetothetrainingset:

11100000

Therefore,thestateoftheHopfieldmachineshouldchangetothefollowingpattern:

11000000

YoucantrainHopfieldnetworkswitheitherHebbian(Hopfield,1982)orStorkey(Storkey,1999)learning.TheHebbianprocessforlearningisbiologicallyplausible,anditisoftenexpressedas,“cellsthatfiretogether,wiretogether.”Inotherwords,twoneuronswillbecomeconnectediftheyfrequentlyreacttothesameinputstimulus.Equation3.3summarizesthisbehaviormathematically:

Equation3.3:HopfieldHebbianLearning

Theconstantnrepresentsthenumberoftrainingsetelements(ε,epsilon).Theweightmatrixwillbesquareandwillcontainrowsandcolumnsequaltothenumberofneurons.Thediagonalwillalwaysbe0becauseaneuronisnotconnectedtoitself.Theotherlocationsinthematrixwillcontainvaluesspecifyinghowoftentwovaluesinthetrainingpatternareeither+1or-1.Listing3.2containsthecodetoimplementEquation3.3:

Listing3.2:HopfieldHebbianTraining

defadd_pattern(weights,pattern,n):


forjinrange(neuron_count):

ifi==j:

weights[i][j]=0

else:

weights[i][j]=weights[i][j]

+((pattern[i]*pattern[j])/n)

Weapplytheadd_patternmethodtoaddeachofthetrainingelements.Theparameterweightsspecifiestheweightmatrix,andtheparameterpatternspecifieseachindividualtrainingelement.Thevariablendesignatesthenumberofelementsinthetrainingset.

Itispossiblethattheequationandthecodearenotsufficienttoshowhowtheweightsaregeneratedfrominputpatterns.Tohelpyouvisualizethisprocess,weprovideanonlineJavascriptapplicationatthefollowingURL:

http://www.heatonresearch.com/aifh/vol3/hopfield.html

ConsiderthefollowingdatatotrainaHopfieldnetwork:

[1,0,0,1]

[0,1,1,0]

ThepreviousdatashouldproduceaweightmatrixlikeFigure3.2:

http://www.heatonresearch.com/aifh/vol3/hopfield.html

Figure3.2:HopfieldMatrix

Tocalculatetheabovematrix,divide1bythenumberoftrainingsetelements.Theresultis1/2,or0.5.Thevalue0.5isplacedintoeveryrowandcolumnthathasa1inthetrainingset.Forexample,thefirsttrainingelementhasa1inneurons#0and#3,resultingina0.5beingaddedtorow0,column3androw3,column0.Thesameprocesscontinuesfortheothertrainingsetelement.

AnothercommontrainingtechniqueforHopfieldneuralnetworksistheStorkeytrainingalgorithm.HopfieldneuralnetworkstrainedwithStorkeyhaveagreatercapacityofpatternsthantheHebbianmethodjustdescribed.TheStorkeyalgorithmismorecomplexthantheHebbianalgorithm.

ThefirststepintheStorkeyalgorithmistocalculateavaluecalledthelocalfield.Equation3.4calculatesthisvalue:

Equation3.4:HopfieldStorkeyLocalField

Wecalculatethelocalfieldvalue(h)foreachweightelement(i&j).Justasbefore,weusetheweights(w)andtrainingsetelements(ε,epsilon).Listing3.3providesthecodetocalculatethelocalfield:

Listing3.3:CalculateStorkeyLocalField

defcalculate_local_field(weights,i,j,pattern):

sum=0

forkinrange(len(pattern)):

ifk!=i:

sum=sum+weights[i][k]*pattern[k]

returnsum

Equation3.5hasthelocalfieldvaluethatcalculatestheneededchange(ΔW):

Equation3.5:HopfieldStorkeyLearning

Listing3.4calculatesthevaluesoftheweightdeltas:

Listing3.4:StorkeyLearning

defadd_pattern(weights,pattern):

sum_matrix=matrix(len(pattern),len(pattern))

n=len(pattern)

foriinrange(n):

forjinrange(n):

t1=(pattern[i]*pattern[j])/n

t2=(pattern[i]*

calculate_local_field(weights,j,i,pattern))/n

t3=(pattern[j]*

calculate_local_field(weights,i,j,pattern))/n

d=t1-t2-t3;

sum_matrix[i][j]=sum_matrix[i][j]+d

returnsum_matrix

Onceyoucalculatetheweightdeltas,youcanaddthemtotheexistingweightmatrix.Ifthereisnoexistingweightmatrix,simplyallowthedeltaweightmatrixtobecometheweightmatrix.

Hopfield-TankNetworks

Inthelastsection,youlearnedthatHopfieldnetworkscanrecallpatterns.Theycanalsooptimizeproblemssuchasthetravelingsalesmanproblem(TSP).HopfieldandTank(1984)introducedaspecialvariant,theHopfield-Tanknetwork,tofindsolutionstooptimizationproblems.

ThestructureofaHopfield-TanknetworkissomewhatdifferentthanastandardHopfieldnetwork.TheneuronsinaregularHopfieldneuralnetworkcanholdonlythetwodiscretevaluesof0or1.However,aHopfield-Tankneuroncanhaveanynumberintherange0to1.StandardHopfieldnetworkspossessdiscretevalues;Hopfield-Tanknetworkskeepcontinuousvaluesoverarange.AnotherimportantdifferenceisthatHopfield-Tanknetworksusesigmoidactivationfunctions.

ToutilizeaHopfield-Tanknetwork,youmustcreateaspecializedenergyfunctiontoexpresstheparametersofeachproblemtosolve.However,producingsuchanenergyfunctioncanbeatime-consumingtask.Hopfield&Tank(2008)demonstratedhowtoconstructanenergyfunctionforthetravelingsalesmanproblem(TSP).Otheroptimizationfunctions,suchassimulatedannealingandNelder-Mead,donotrequirethecreationofacomplexenergyfunction.Thesegeneral-purposeoptimizationalgorithmstypicallyperformbetterthantheolderHopfield-Tankoptimizationalgorithms.

Becauseotheralgorithmsaretypicallybetterchoicesforoptimizations,thisbookdoesnotcovertheoptimizationHopfield-Tanknetwork.Nelder-MeadandsimulatedannealingweredemonstratedinArtificialIntelligenceforHumans,Volume1:FundamentalAlgorithms.Chapter6,“BackpropagationTraining,”willhaveareviewofstochasticgradientdescent(SGD),whichisoneofthebesttrainingalgorithmsforfeedforwardneuralnetworks.

BoltzmannMachines

Hinton&Sejnowski(1985)firstintroducedBoltzmannmachines,butthisneuralnetworktypehasnotenjoyedwidespreaduseuntilrecently.AspecialtypeofBoltzmannmachine,therestrictedBoltzmannmachine(RBM),isoneofthefoundationaltechnologiesofdeeplearningandthedeepbeliefneuralnetwork(DBNN).Inthischapter,wewillintroduceclassicBoltzmannmachines.Chapter9,“DeepLearning,”willincludedeeplearningandtherestrictedBoltzmannmachine.

ABoltzmannmachineisessentiallyafullyconnected,two-layerneuralnetwork.Werefertotheselayersasthevisualandhiddenlayers.Thevisuallayerisanalogoustotheinputlayerinfeedforwardneuralnetworks.DespitethefactthataBoltzmannmachinehas

ahiddenlayer,itfunctionsmoreasanoutputlayer.ThisdifferenceinthemeaningofhiddenlayerisoftenasourceofconfusionbetweenBoltzmannmachinesandfeedforwardneuralnetworks.TheBoltzmannmachinehasnohiddenlayerbetweentheinputandoutputlayers.Figure3.3showstheverysimplestructureofaBoltzmannmachine:

Figure3.3:BoltzmannMachine

TheaboveBoltzmannmachinehasthreehiddenneuronsandfourvisibleneurons.ABoltzmannmachineisfullyconnectedbecauseeveryneuronhasaconnectiontoeveryotherneuron.However,noneuronisconnectedtoitself.ThisconnectivityiswhatdifferentiatesaBoltzmannmachinefromarestrictedBoltzmannmachine(RBM),asseeninFigure3.4:

Figure3.4:RestrictedBoltzmannMachine(RBM)

TheaboveRBMisnotfullyconnected.Allhiddenneuronsareconnectedtoeachvisibleneuron.However,therearenoconnectionsamongthehiddenneuronsnorarethereconnectionsamongthevisibleneurons.

LiketheHopfieldneuralnetwork,aBoltzmannmachine’sneuronsacquireonlybinarystates,either0or1.WhilethereissomeresearchoncontinuousBoltzmannmachinescapableofassigningdecimalnumberstotheneurons,nearlyallresearchontheBoltzmannmachinecentersonbinaryunits.Therefore,thisbookwillnotincludeinformationoncontinuousBoltzmannmachines.

Boltzmannmachinesarealsocalledagenerativemodel.Inotherwords,aBoltzmannmachinedoesnotgenerateconstantoutput.ThevaluespresentedtothevisibleneuronsofaBoltzmannmachine,whenconsideredwiththeweights,specifyaprobabilitythatthehiddenneuronswillassumeavalueof1,asopposedto0.

AlthoughaBoltzmannmachineandHopfieldneuralnetworkshavesomecharacteristicsincommon,thereareseveralimportantdifferences:

Hopfieldnetworkssufferfromrecognizingcertainfalsepatterns.BoltzmannmachinescanstoreagreatercapacityofpatternsthanHopfieldnetworks.Hopfieldnetworksrequiretheinputpatternstobeuncorrelated.Boltzmannmachinescanbestackedtoformlayers.

BoltzmannMachineProbability

Whentheprogramqueriesthevalueof1oftheBoltzmannmachine’shiddenneurons,itwillrandomlyproducea0or1.Equation3.6obtainsthecalculatedprobabilityforthatneuronwithavalueof1:

Equation3.6:ProbabilityofNeuronBeingOne(on)

Theaboveequationwillcalculateanumberbetween0and1thatrepresentsaprobability.Forexample,ifthevalue0.75weregenerated,theneuronwouldreturna1in75%ofthecases.Onceitcalculatestheprobability,itcanproducetheoutputbygeneratingarandomnumberbetween0and1andreturning1iftherandomnumberisbelowtheprobability.

Theaboveequationreturnstheprobabilityforneuronibeingonandiscalculatedwiththedeltaenergy(ΔE)ati.TheequationalsousesthevalueT,whichrepresentsthetemperatureofthesystem.Equation3.2,fromearlierinthechapter,cancalculateT.Thevalueθ(theta)istheneuron’sbiasvalue.

ThechangeinenergyiscalculatedusingEquation3.7:

Equation3.7:CalculatingtheEnergyChangeforaNeuron

Thisvalueistheenergydifferencebetween1(on)and0(off)forneuroni.Itiscalculatedusingtheθ(theta),whichrepresentsthebias.

Althoughthevaluesoftheindividualneuronsarestochastic(random),theywilltypicallyfallintoequilibrium.Toreachthisequilibrium,youcanrepeatedlycalculatethenetwork.Eachtime,aunitischosenwhileEquation3.6setsitsstate.Afterrunningforanadequateperiodoftimeatacertaintemperature,theprobabilityofaglobalstateofthenetworkwilldependonlyuponthatglobalstate’senergy.

Inotherwords,thelogprobabilitiesofglobalstatesbecomelinearintheirenergies.Thisrelationshipistruewhenthemachineisatthermalequilibrium,whichmeansthatthe

probabilitydistributionofglobalstateshasconverged.Ifwestartrunningthenetworkfromahightemperatureandgraduallydecreaseituntilwereachathermalequilibriumatalowtemperature,thenwemayconvergetoadistributionwheretheenergylevelfluctuatesaroundtheglobalminimum.Wecallthisprocesssimulatedannealing.

ApplyingtheBoltzmannMachine

MostresearcharoundBoltzmannmachineshasmovedtotherestrictedBoltzmannmachine(RBM)thatwewillexplaininChapter9,“DeepLearning.”Inthissection,wewillfocusontheolder,unrestrictedformoftheBoltzmann,whichhasbeenappliedtobothoptimizationandrecognitionproblems.Wewilldemonstrateanexampleofeachtype,beginningwithanoptimizationproblem.

TravelingSalesmanProblem

Thetravelingsalesmanproblem(TSP)isaclassiccomputerscienceproblemthatisdifficulttosolvewithtraditionalprogrammingtechniques.ArtificialintelligencecanbeappliedtofindpotentialsolutionstotheTSP.Theprogrammustdeterminetheorderofafixedsetofcitiesthatminimizesthetotaldistancecovered.Thetravelingsalesmaniscalledacombinationalproblem.IfyouarealreadyfamiliarwithTSPoryouhavereadaboutitinapreviousvolumeinthisseries,youcanskipthissection.

TSPinvolvesdeterminingtheshortestrouteforatravelingsalesmanwhomustvisitacertainnumberofcities.Althoughhecanbeginandendinanycity,hemayvisiteachcityonlyonce.TheTSPhasseveralvariants,someofwhichallowmultiplevisitstocitiesorassigndifferentvaluestocities.TheTSPinthischaptersimplyseekstheshortestpossibleroutetovisiteachcityonetime.Figure3.5showstheTSPproblemusedhere,aswellasapotentialshortestroute:

Figure3.5:TheTravelingSalesman

Findingtheshortestroutemayseemlikeaneasytaskforanormaliterativeprogram.However,asthenumberofcitiesincreases,thenumberofpossiblecombinationsincreasesdrastically.Iftheproblemhasoneortwocities,onlyoneortworoutesarepossible.Ifitincludesthreecities,thepossibleroutesincreasetosix.Thefollowinglistshowshowquicklythenumberofpathsgrows:

1cityhas1path

2citieshave2paths

3citieshave6paths

4citieshave24paths

5citieshave120paths

6citieshave720paths

7citieshave5,040paths



10citieshave3,628,800paths



13citieshave6,227,020,800paths

...

50citieshave3.041*10^64paths

Intheabovetable,theformulatocalculatetotalpathsisthefactorial.Thenumberofcities,n,iscalculatedusingthefactorialoperator(!).Thefactorialofsomearbitraryvaluenisgivenbyn*(n–1)*(n–2)*…*3*2*1.Thesevaluesbecomeincrediblylargewhenaprogrammustdoabrute-forcesearch.Thetravelingsalesmanproblemisanexampleofanon-deterministicpolynomialtime(NP)hardproblem.Informally,NP-hardisdefinedasanyproblemthatlacksanefficientwaytoverifyacorrectsolution.TheTSPfitsthisdefinitionformorethan10cities.YoucanfindaformaldefinitionofNP-hardinComputersandIntractability:AGuidetotheTheoryofNP-Completeness(Garey,1979).

Dynamicprogrammingisanothercommonapproachtothetravelingsalesmanproblem,asseeninxkcd.comcomicinFigure3.6:

Figure3.6:TheTravelingSalesman(fromxkcd.com)

Althoughthisbookdoesnotincludeafulldiscussionofdynamicprogramming,understandingitsessentialfunctionisvaluable.Dynamicprogrammingbreaksalargeproblem,suchastheTSP,intosmallerproblems.Youcanreuseworkformanyofthesmallerprograms,therebydecreasingtheamountofiterationsrequiredbyabrute-forcesolution.

Unlikebrute-forcesolutionsanddynamicprogramming,ageneticalgorithmisnotguaranteedtofindthebestsolution.Althoughitwillfindagoodsolution,thescoremightnotbethebest.Thesampleprogramexaminedinthenextsectionshowshowageneticalgorithmproducedanacceptablesolutionforthe50-cityprobleminamatterofminutes.

OptimizationProblems

TousetheBoltzmannmachineforanoptimizationproblem,itisnecessarytorepresentaTSPsolutioninsuchawaythatitfitsontothebinaryneuronsoftheBoltzmannmachine.Hopfield(1984)devisedanencodingfortheTSPthatbothBoltzmannandHopfieldneuralnetworkscommonlyusetorepresentthiscombinationalproblem.

ThealgorithmarrangestheneuronsoftheHopfieldorBoltzmannmachineonasquaregridwiththenumberofrowsandcolumnsequaltothenumberofcities.Eachcolumnrepresentsacity,andeachrowcorrespondstoasegmentinthejourney.Thenumberofsegmentsinthejourneyisequaltothenumberofcities,resultinginasquaregrid.Eachrowinthematrixshouldhaveexactlyonecolumnwithavalueof1.Thisvaluedesignatesthedestinationcityforeachofthetripsegments.ConsiderthecitypathshowninFigure3.7:

Figure3.7:FourCitiestoVisit

Becausetheproblemincludesfourcities,thesolutionrequiresafour-by-fourgrid.ThefirstcityvisitedisCity#0.Therefore,theprogrammarks1inthefirstcolumnofthefirstrow.Likewise,visitingCity#3secondproducesa1inthefinalcolumnofthesecondrow.Figure3.8showsthecompletepath:

Figure3.8:EncodingofFourCities

Ofcourse,theBoltzmannmachinesdonotarrangeneuronsinagrid.Torepresenttheabovepathasavectorofvaluesfortheneuron,therowsaresimplyplacedsequentially.Thatis,thematrixisflattenedinarow-wisemanner,resultinginthefollowingvector:

[1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0]

TocreateaBoltzmannmachinethatcanprovideasolutiontotheTSP,theprogrammustaligntheweightsandbiasesinsuchawaythatallowsthestatesoftheBoltzmannmachineneuronstostabilizeatapointthatminimizesthetotaldistancebetweencities.Keepinmindthattheabovegridcanalsofinditselfinmanyinvalidstates.Therefore,avalidgridmusthavethefollowing:

Asingle1valueperrow.Asingle1valuepercolumn.

Asaresult,theprogramneedstoconstructtheweightssothattheBoltzmannmachine

willnotreachequilibriuminaninvalidstate.Listing3.5showsthepseudocodethatwillgeneratethisweightmatrix:

Listing3.5:BoltzmannWeightsforTSP

gamma=7

#Source

forsource_tourinrange(NUM_CITIES):

forsource_cityinrange(NUM_CITIES):

source_index=source_tour*NUM_CITIES+source_city

#Target

fortargetTourinrange(NUM_CITIES):

for(inttarget_cityinrange(NUM_CITIES):

target_index=target_tour*NUM_CITIES+target_city

#Calculatetheweight

weight=0

#Diagonalweightis0

ifsource_index!=target_index:

#Determinethenextandpreviouselementinthetour.

#Wrapbetween0andlastelement.

prev_target_tour=wrappednexttargettour

next_target_tour=wrappedprevioustargettour

#Ifsametourelementorcity,then-gama

if(source_tour==target_tour)

or(source_city==target_city):

weight=-gamma

#Ifnextorpreviouscity,-gamma

elif((source_tour==prev_target_tour)

or(source_tour==next_target_tour))

weight=-distance(source_city,target_city)

#Otherwise0

set_weight(source_index,target_index,weight)

#Allbiasesare-gamma/2

set_bias(source_index,-gamma/2)

Figure3.9displayspartofthecreatedweightmatrixforfourcities:

Figure3.9:BoltzmannMachineWeightsforTSP(4cities)

Dependingonyourviewingdevice,youmighthavedifficultyreadingtheabovegrid.Therefore,youcangenerateitforanynumberofcitieswiththeJavascriptutilityatthefollowingURL:

http://www.heatonresearch.com/aifh/vol3/boltzmann_tsp_grid.html

Essentially,theweightshavethefollowingspecifications:

Matrixdiagonalisassignedto0.Shownas“\”inFigure3.9.Samesourceandtargetposition,setto–γ(gamma).Shownas-ginFigure3.9.Samesourceandtargetcity,setto–γ(gamma).Shownas-ginFigure3.9.Sourceandtargetnext/previouscities,setto–distance.Shownasd(x,y)inFigure3.9.Otherwise,setto0.

Thematrixissymmetricalbetweentherowsandcolumns.

http://www.heatonresearch.com/aifh/vol3/boltzmann_tsp_grid.html

BoltzmannMachineTraining

Theprevioussectionshowedtheuseofhard-codedweightstoconstructaBoltzmannmachinethatwascapableoffindingsolutionstotheTSP.Theprogramconstructedtheseweightsthroughitsknowledgeoftheproblem.ManuallysettingtheweightsisanecessaryanddifficultstepforapplyingBoltzmannmachinestooptimizationproblems.However,thisbookwillnotincludeinformationaboutconstructingweightmatricesforgeneraloptimizationproblemsbecauseNelder-Meadandsimulatedannealingaremoreoftenusedforgeneral-purposealgorithms.

ChapterSummary

Inthischapter,weexplainedseveralclassicneuralnetworktypes.SincePitts(1943)introducedtheneuralnetwork,manydifferentneuralnetworktypeshavebeeninvented.Wehavefocusedprimarilyontheclassicneuralnetworktypesthatstillhaverelevanceandthatestablishthefoundationforotherarchitecturesthatwewillcoverinlaterchaptersofthebook.

Theself-organizingmap(SOM)isanunsupervisedneuralnetworktypethatcanclusterdata.TheSOMhasaninputneuroncountequaltothenumberofattributesforthedatatobeclustered.Anoutputneuroncountspecifiesthenumberofgroupsintowhichthedatashouldbeclustered.

TheHopfieldneuralnetworkisasimpleneuralnetworktypethatcanrecognizepatternsandoptimizeproblems.YoumustcreateaspecialenergyfunctionforeachtypeofoptimizationproblemthatrequirestheHopfieldneuralnetwork.Becauseofthisquality,programmerschoosealgorithmslikeNelder-MeadorsimulatedannealinginsteadoftheoptimizedversionoftheHopfieldneuralnetwork.

TheBoltzmannmachineisaneuralnetworkarchitecturethatsharesmanycharacteristicswiththeHopfieldneuralnetwork.However,unliketheHopfieldnetwork,youcanstackthedeepbeliefneuralnetwork(DBNN).ThisstackingabilityallowstheBoltzmannmachinetoplayacentralroleintheimplementationofthedeepbeliefneuralnetwork(DBNN),thebasisofdeeplearning.

Inthenextchapter,wewillexaminethefeedforwardneuralnetwork,whichremainsoneofthemostpopularneuralnetworktypes.Thischapterwillfocusonclassicfeedforwardneuralnetworksthatusesigmoidandhyperbolictangentactivationfunctions.Newtrainingalgorithms,layertypes,activationfunctionsandotherinnovationsallowtheclassicfeedforwardneuralnetworktobeusedwithdeeplearning.

Chapter4:FeedforwardNeuralNetworksClassificationRegressionNetworkLayersNormalization

Inthischapter,weshallexamineoneofthemostcommonneuralnetworkarchitectures,thefeedforwordneuralnetwork.Becauseofitsversatility,thefeedforwardneuralnetworkarchitectureisverypopular.Therefore,wewillexplorehowtotrainitandhowitprocessesapattern.

Thetermfeedforwarddescribeshowthisneuralnetworkprocessesandrecallspatterns.Inafeedforwardneuralnetwork,eachlayeroftheneuralnetworkcontainsconnectionstothenextlayer.Forexample,theseconnectionsextendforwardfromtheinputtothehiddenlayer,butnoconnectionsmovebackward.ThisarrangementdiffersfromtheHopfieldneuralnetworkfeaturedinthepreviouschapter.TheHopfieldneuralnetworkwasfullyconnected,anditsconnectionswerebothforwardandbackward.Wewillanalyzethestructureofafeedforwardneuralnetworkandthewayitrecallsapatternlaterinthechapter.

Wecantrainfeedforwardneuralnetworkswithavarietyoftechniquesfromthebroadcategoryofbackpropagationalgorithms,aformofsupervisedtrainingthatwewilldiscussingreaterdetailinthenextchapter.Wewillfocusonapplyingoptimizationalgorithmstotraintheweightsofaneuralnetworkinthischapter.Ifyouneedmoreinformationaboutoptimizationalgorithms,Volumes1and2ofArtificialIntelligenceforHumanscontainsectionsonthissubject.Althoughwecanemployseveraloptimizationalgorithmstotraintheweights,wewillprimarilydirectourattentiontosimulatedannealing.

Optimizationalgorithmsadjustavectorofnumberstoachieveagoodscorefromanobjectivefunction.Theobjectivefunctiongivestheneuralnetworkascorebasedcloselyontheneuralnetwork’soutputthatmatchestheexpectedoutput.Thisscoreallowsanyoptimizationalgorithmtotrainneuralnetworks.

Afeedforwardneuralnetworkissimilartothetypesofneuralnetworksthatwehavealreadyexamined.Justlikeothertypesofneuralnetworks,thefeedforwardneuralnetworkbeginswithaninputlayerthatmayconnecttoahiddenlayerortotheoutputlayer.Ifitconnectstoahiddenlayer,thehiddenlayercansubsequentlyconnecttoanotherhiddenlayerortotheoutputlayer.Anynumberofhiddenlayerscanexist.

FeedforwardNeuralNetworkStructure

InChapter1,“NeuralNetworkBasics,”wediscussedthatneuralnetworkscouldhavemultiplehiddenlayersandanalyzedthepurposesoftheselayers.Inthischapter,wewillfocusmoreonthestructureoftheinputandoutputneurons,beginningwiththestructureoftheoutputlayer.Thetypeofproblemdictatesthestructureoftheoutputlayer.Aclassificationneuralnetworkwillhaveanoutputneuronforeachclass,whereasaregressionneuralnetworkwillhaveoneoutputneuron.

Single-OutputNeuralNetworksforRegression

Thoughfeedforwardneuralnetworkscanhavemorethanoneoutputneuron,wewillbeginbylookingatasingle-outputneuronnetworkinaregressionproblem.Aregressionnetworkiscapableofpredictingasinglenumericvalue.Figure4.1illustratesasingle-outputfeedforwardneuralnetwork:

Figure4.1:Single-OutputFeedforwardNetwork

Thisneuralnetworkwilloutputasinglenumericvalue.Wecanusethistypeofneuralnetworkinthefollowingways:

Regression–Computeanumberbasedontheinputs.(e.g.,Howmanymilespergallon(MPG)willaspecifictypeofcarachieve?)BinaryClassification–Decidebetweentwooptions,basedontheinputs.(e.g.,Ofthegivencharacteristics,whichisacanceroustumor?)

Weprovidearegressionexampleforthischapterthatutilizesdataaboutvariouscarmodelsandpredictsthemilespergallonthatthecarwillachieve.Youcanfindthisdata

setatthefollowingURL:


Asmallsamplingofthisdataisshownhere:

mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name

18,8,307,130,3504,12,70,1,"chevroletchevellemalibu"

15,8,350,165,3693,11,70,1,"buickskylark320"

18,8,318,150,3436,11,70,1,"plymouthsatellite"

16,8,304,150,3433,12,70,1,"amcrebelsst"

Foraregressionproblem,theneuralnetworkwouldcreatecolumnssuchascylinders,displacement,horsepower,andweighttopredicttheMPG.Thesevaluesareallfieldsusedintheabovelistingthatspecifyqualitiesofeachcar.Inthiscase,thetargetisMPG;however,wecouldalsoutilizeMPG,cylinders,horsepower,weight,andaccelerationtopredictdisplacement.

Tomaketheneuralnetworkperformregressiononmultiplevalues,youmightapplymultipleoutputneurons.Forexample,cylinders,displacement,andhorsepowercanpredictbothMPGandweight.Althoughamulti-outputneuralnetworkiscapableofperformingregressionontwovariables,wedon’trecommendthistechnique.Youwillusuallyachievebetterresultswithseparateneuralnetworksforeachregressionoutcomethatyouaretryingtopredict.

CalculatingtheOutput

InChapter1,“NeuralNetworkBasics,”weexploredhowtocalculatetheindividualneuronsthatcompriseaneuralnetwork.Asabriefreview,theoutputofanindividualneuronissimplytheweightedsumofitsinputsandabias.Thissummationispassedtoanactivationfunction.Equation4.1summarizesthecalculatedoutputofaneuralnetwork:

Equation4.1:NeuronOutput

Theneuronmultipliestheinputvector(x)bytheweights(w)andpassestheresultintoanactivationfunction(φ,phi).Thebiasvalueisthelastvalueintheweightvector(w),anditisaddedbyconcatenatinga1valuetotheinput.Forexample,consideraneuronthathastwoinputsandabias.Iftheinputswere0.1and0.2,theinputvectorwouldappearasfollows:


[0.1,0.2,1.0]

Inthisexample,addthevalue1.0tosupportthebiasweight.Wecanalsocalculatethevaluewiththefollowingweightvector:

[0.01,0.02,0.3]

Thevalues0.01and0.02aretheweightsforthetwoinputstotheneuron.Thevalue0.3isthebias.Theweightedsumiscalculatedasfollows:

(0.1*0.01)+(0.2*0.02)+(1.0*0.3)=0.305

Thevalue0.305isthenpassedtoanactivationfunction.

Calculatinganentireneuralnetworkisessentiallyamatteroffollowingthissameprocedureforeachneuroninthenetwork.Thisprocessallowsyoutoworkyourwayfromtheinputneuronstotheoutput.Youcanimplementthisprocessbycreatingobjectsforeachconnectioninthenetworkorbyaligningtheseconnectionvaluesintomatrices.

Object-orientedprogrammingallowsyoutodefineanobjectforeachneuronanditsweights.Thisapproachcanproduceveryreadablecode,butithastwosignificantproblems:

Theweightsarestoredacrossmanyobjects.Performancesuffersbecauseittakesmanyfunctioncallsandmemoryaccessestopiecealltheweightstogether.

Itisvaluabletocreateweightsintheneuralnetworkasasinglevector.Avarietyofdifferentoptimizationalgorithmscanadjustavectortoperfectascoringfunction.ArtificialIntelligenceforHumans,Volumes1&2includeadiscussionoftheseoptimizationfunctions.Laterinthischapter,wewillseehowsimulatedannealingoptimizestheweightvectorfortheneuralnetwork.

Toconstructaweightvector,wewillfirstlookatanetworkthathasthefollowingattributes:

InputLayer:2neurons,1biasHiddenLayer:2neurons,1biasOutputLayer:1neuron

Thesecharacteristicsgivethisnetworkatotalof7neurons.

Youcannumbertheseneuronsforthevectorinthefollowingmanner:

Neuron0:Output1

Neuron1:Hidden1

Neuron2:Hidden2

Neuron3:Bias2(setto1,usually)

Neuron4:Input1

Neuron5:Input2

Neuron6:Bias1(setto1,usually)

Graphically,youcanseethenetworkasFigure4.2:

Figure4.2:SimpleNeuralNetwork

Youcancreateseveraladditionalvectorstodefinethestructureofthenetwork.Thesevectorsholdindexvaluestoallowthequicknavigationoftheweightvector.Thesevectorsarelistedhere:

layerFeedCounts:[1,2,2]

layerCounts:[1,3,3]

layerIndex:[0,1,4]

layerOutput:[0.0,0.0,0.0,1.0,0.0,0.0,1.0]

weightIndex:[0,3,9]

Eachvectorstoresthevaluesfortheoutputlayerfirstandworksitswaytotheinput

layer.ThelayerFeedCountsvectorholdsthecountofnon-biasneuronsineachlayer.Thischaracteristicisessentiallythecountofnon-biasneurons.ThelayerOutputvectorholdsthecurrentvalueofeachneuron.Initially,allneuronsstartwith0.0exceptforthebiasneurons,whichstartat1.0.ThelayerIndexvectorholdsindexestowhereeachlayerbeginsinthelayerOuputvector.TheweightIndexholdsindexestothelocationofeachlayerintheweightvector.

Theweightsarestoredintheirownvectorandstructuredasfollows:

Weight0:H1->O1

Weight1:H2->O1

Weight2:B2->O1

Weight3:I1->H1

Weight4:I2->H1

Weight5:B1->H1

Weight6:I1->H2

Weight7:I2->H2

Weight8:B1->H2

Oncethevectorshavebeenarranged,calculatingtheoutputoftheneuralnetworkisrelativelyeasy.Listing4.1canaccomplishthiscalculation:

Listing4.1:CalculateFeedforwardOutput

defcompute(net,input):

sourceIndex=len(net.layerOutput)

-net.layerCounts[len(net.layerCounts)-1]

#CopytheinputintothelayerOutputvector

array_copy(input,0,net.layerOutput,sourceIndex,net.inputCount)

#Calculateeachlayer

foriinreversed(range(0,len(layerIndex))):

compute_layer(i)

#updatecontextvalues

offset=net.contextTargetOffset[0]

#Createresult

result=vector(net.outputCount)

array_copy(net.layerOutput,0,result,0,net.outputCount)

returnresult

defcompute_layer(net,currentLayer):

inputIndex=net.layerIndex[currentLayer]

outputIndex=net.layerIndex[currentLayer-1]

inputSize=net.layerCounts[currentLayer]

outputSize=net.layerFeedCounts[currentLayer-1]

index=this.weightIndex[currentLayer-1]

limit_x=outputIndex+outputSize

limit_y=inputIndex+inputSize

#weightvalues

forxinrange(outputIndex,limit_x):

sum=0;

foryinrange(inputIndex,limit_y):

sum+=net.weights[index]*net.layerOutput[y]

net.layerSums[x]=sum

net.layerOutput[x]=sum

index=index+1

net.activationFunctions[currentLayer-1]

.activation_function(

net.layerOutput,outputIndex,outputSize)

InitializingWeights

Theweightsofaneuralnetworkdeterminetheoutputfortheneuralnetwork.Theprocessoftrainingcanadjusttheseweightssotheneuralnetworkproducesusefuloutput.Mostneuralnetworktrainingalgorithmsbeginbyinitializingtheweightstoarandomstate.Trainingthenprogressesthroughaseriesofiterationsthatcontinuouslyimprovetheweightstoproducebetteroutput.

Therandomweightsofaneuralnetworkimpacthowwellthatneuralnetworkcanbetrained.Ifaneuralnetworkfailstotrain,youcanremedytheproblembysimplyrestarting

withanewsetofrandomweights.However,thissolutioncanbefrustratingwhenyouareexperimentingwiththearchitectureofaneuralnetworkandtryingdifferentcombinationsofhiddenlayersandneurons.Ifyouaddanewlayer,andthenetwork’sperformanceimproves,youmustaskyourselfifthisimprovementresultedfromthenewlayerorfromanewsetofweights.Becauseofthisuncertainty,welookfortwokeyattributesinaweightinitializationalgorithm:

Howconsistentlydoesthisalgorithmprovidegoodweights?Howmuchofanadvantagedotheweightsofthealgorithmprovide?

Oneofthemostcommon,yetleasteffective,approachestoweightinitializationistosettheweightstorandomvalueswithinaspecificrange.Numbersbetween-1and+1or-5and+5areoftenthechoice.Ifyouwanttoensurethatyougetthesamesetofrandomweightseachtime,youshoulduseaseed.Theseedspecifiesasetofpredefinedrandomweightstouse.Forexample,aseedof1000mightproducerandomweightsof0.5,0.75,and0.2.Thesevaluesarestillrandom;youcannotpredictthem,yetyouwillalwaysgetthesevalueswhenyouchooseaseedof1000.

Notallseedsarecreatedequal.Oneproblemwithrandomweightinitializationisthattherandomweightscreatedbysomeseedsaremuchmoredifficulttotrainthanothers.Infact,theweightscanbesobadthattrainingisimpossible.Ifyoufindthatyoucannottrainaneuralnetworkwithaparticularweightset,youshouldgenerateanewsetofweightsusingadifferentseed.

Becauseweightinitializationisaproblem,therehasbeenconsiderableresearcharoundit.OvertheyearswehavestudiedthisresearchandaddedsixdifferentweightinitializationroutinestotheEncogproject.Fromourresearch,theXavierweightinitializationalgorithm,introducedin2006byGlorot&Bengio,producesgoodweightswithreasonableconsistency.Thisrelativelysimplealgorithmusesnormallydistributedrandomnumbers.

TousetheXavierweightinitialization,itisnecessarytounderstandthatnormallydistributedrandomnumbersarenotthetypicalrandomnumbersbetween0and1thatmostprogramminglanguagesgenerate.Infact,normallydistributedrandomnumbersarecenteredonamean(μ,mu)thatistypically0.If0isthecenter(mean),thenyouwillgetanequalnumberofrandomnumbersaboveandbelow0.Thenextquestionishowfartheserandomnumberswillventurefrom0.Intheory,youcouldendupwithbothpositiveandnegativenumbersclosetothemaximumpositiveandnegativerangessupportedbyyourcomputer.However,therealityisthatyouwillmorelikelyseerandomnumbersthatarebetween0andthreestandarddeviationsfromthecenter.

Thestandarddeviationσ(sigma)parameterspecifiesthesizeofthisstandarddeviation.Forexample,ifyouspecifiedastandarddeviationof10,thenyouwouldmainlyseerandomnumbersbetween-30and+30,andthenumbersnearerto0haveamuchhigherprobabilityofbeingselected.Figure4.3showsthenormaldistribution:

Figure4.3:TheNormalDistribution

Theabovefigureillustratesthatthecenter,whichinthiscaseis0,willbegeneratedwitha0.4(40%)probability.Additionally,theprobabilitydecreasesveryquicklybeyond-2or+2standarddeviations.Bydefiningthecenterandhowlargethestandarddeviationsare,youareabletocontroltherangeofrandomnumbersthatyouwillreceive.

Mostprogramminglanguageshavethecapabilityofgeneratingnormallydistributedrandomnumbers.Ingeneral,theBox-Mulleralgorithmisthebasisforthisfunctionality.Theexamplesinthisvolumewilleitherusethebuilt-innormalrandomnumbergeneratorortheBox-Mulleralgorithmtotransformregular,uniformlydistributedrandomnumbersintoanormaldistribution.ArtificialIntelligenceforHumans,Volume1:FundamentalAlgorithmscontainsanexplanationoftheBox-Mulleralgorithm,butyoudonotnecessarilyneedtounderstanditinordertograsptheideasinthisbook.

TheXavierweightinitializationsetsalloftheweightstonormallydistributedrandomnumbers.Theseweightsarealwayscenteredat0;however,theirstandarddeviationvariesdependingonhowmanyconnectionsarepresentforthecurrentlayerofweights.Specifically,Equation4.2candeterminethestandarddeviation:

Equation4.2:StandardDeviationforXavierAlgorithm

Theaboveequationshowshowtoobtainthevarianceforalloftheweights.Thesquarerootofthevarianceisthestandarddeviation.Mostrandomnumbergenerators

acceptastandarddeviationratherthanavariance.Asaresult,youusuallyneedtotakethesquarerootoftheaboveequation.Figure4.4showshowonelayermightbeinitialized:

Figure4.4:XavierInitializationofaLayer

Thisprocessiscompletedforeachlayerintheneuralnetwork.

Radial-BasisFunctionNetworks

Radial-basisfunction(RBF)networksareatypeoffeedforwardneuralnetworkintroducedbyBroomheadandLowe(1988).Thesenetworkscanbeusedforbothclassificationandregression.Thoughtheycansolveavarietyofproblems,RBFnetworksseemtolosingpopularity.Bytheirverydefinition,RBFnetworkscannotbeusedinconjunctionwithdeeplearning.

TheRBFnetworkutilizesaparametervector,amodelthatspecifiesweightsandcoefficients,inordertoallowtheinputtogeneratethecorrectoutput.Byadjustingarandomparametervector,theRBFnetworkproducesoutputconsistentwiththeirisdataset.Theprocessofadjustingtheparametervectortoproducethedesiredoutputiscalledtraining.ManydifferentmethodsexistfortraininganRBFnetwork.Theparametervectorsalsorepresentitslong-termmemory.

Inthenextsection,wewillbrieflyreviewRBFsanddescribetheexactmakeupofthesevectors.

Radial-BasisFunctions

BecausemanyAIalgorithmsutilizeradial-basisfunctions,theyareaveryimportantconcepttounderstand.Aradial-basisfunctionissymmetricwithrespecttoitscenter,whichisusuallysomewherealongthex-axis.TheRBFwillreachitsmaximumvalueorpeakatthecenter.WhereasatypicalsettingforthepeakinRBFnetworksis1,thecentervariesaccordingly.

RBFscanhavemanydimensions.RegardlessofthenumberofdimensionsinthevectorpassedtotheRBF,itsoutputwillalwaysbeasinglescalarvalue.

RBFsarequitecommoninAI.Wewillstartwiththemostprevalent,theGaussianfunction.Figure4.5showsagraphofa1DGaussianfunctioncenteredat0:

Figure4.5:GaussianFunction

Youmightrecognizetheabovecurveasanormaldistributionorabellcurve,whichisaradial-basisfunction.TheRBFs,suchasaGaussianfunction,canselectivelyscalenumericvalues.ConsiderFigure4.5above.Ifyouappliedthisfunctiontoscalenumericvalues,theresultwouldhavemaximumintensityatthecenter.Asyoumovedfromthecenter,theintensitywoulddiminishineitherthepositiveornegativedirections.

BeforewecanlookattheequationfortheGaussianRBF,wemustconsiderhowtoprocessthemultipledimensions.RBFsacceptmulti-dimensionalinputandreturnasinglevaluebycalculatingthedistancebetweentheinputandthecentervector.Thisdistanceiscalledr.TheRBFcenterandinputtotheRBFmustalwayshavethesamenumberofdimensionsforthecalculationtooccur.Oncewecalculater,wecandeterminetheindividualRBF.AlloftheRBFsusethiscalculatedr.

Equation4.3showshowtocalculater:

Equation4.3:Calculatingr

Thedoubleverticalbarsthatyouseeintheaboveequationsignifythatthefunctiondescribesadistanceoranorm.Incertaincases,thesedistancescanvary;however,RBFstypicallyutilizeEuclideandistance.Asaresult,theexamplesthatweprovideinthisbookalwaysapplytheEuclideandistance.Therefore,rissimplytheEuclideandistancebetweenthecenterandthexvector.IneachoftheRBFsinthissection,wewillusethisvaluer.Equation4.4showstheequationforaGaussianRBF:

Equation4.4:GaussianRBF

Onceyou’vecalculatedr,determiningtheRBFisfairlyeasy.TheGreekletterφ,whichyouseeattheleftoftheequation,alwaysrepresentstheRBF.TheconstanteinEquation4.4representsEuler’snumber,orthenaturalbase,andisapproximately2.71828.

Radial-BasisFunctionNetworks

RBFnetworksprovideaweightedsummationofoneormoreradial-basisfunctions;eachofthesefunctionsreceivestheweightedinputattributesinordertopredicttheoutput.ConsidertheRBFnetworkasalongequationthatcontainstheparametervector.Equation4.5showstheequationneededtocalculatetheoutputofthisnetwork:

Equation4.5:TheRBFNetwork

Notethatthedoubleverticalbarsintheaboveequationsignifythatyoumustcalculatethedistance.Becausethesesymbolsdonotspecifywhichdistancealgorithmtouse,youcanselectthealgorithm.Intheaboveequation,Xistheinputvectorofattributes;cisthevectorcenteroftheRBF;pisthechosenRBF(Gaussian,forexample);aisthevectorcoefficient(orweight)foreachRBF;andbspecifiesthevectorcoefficienttoweighttheinputattributes.

Inourexample,wewillapplyanRBFnetworktotheirisdataset.Figure4.6providesagraphicrepresentationofthisapplication:

Figure4.6:TheRBFNetworkfortheIrisData

Theabovenetworkcontainsfourinputs(thelengthandwidthofpetalsandsepals)thatindicatethefeaturesthatdescribeeachirisspecies.Theabovediagramassumesthatweareusingone-of-nencodingforthethreedifferentirisspecies.Usingequilateralencodingforonlytwooutputsisalsopossible.Tokeepthingssimple,wewilluseone-of-nandarbitrarilychoosethreeRBFs.EventhoughadditionalRBFsallowthemodeltolearnmorecomplexdatasets,theyrequiremoretimetoprocess.

Arrowsrepresentallcoefficientsfromtheequation.InEquation4.5,brepresentsthearrowsbetweentheinputattributesandtheRBFs.Similarly,arepresentsthearrowsbetweentheRBFsandthesummation.Noticealsothebiasbox,whichisasyntheticfunctionthatalwaysreturnsavalueof1.Becausethebiasfunction’soutputisconstant,theprogramdoesnotrequireinputs.Theweightsfromthebiastothesummationspecifythey-interceptfortheequation.Inshort,biasisnotalwaysbad.ThiscasedemonstratesthatbiasisanimportantcomponenttotheRBFnetwork.Biasnodesarealsoverycommoninneuralnetworks.

Becausemultiplesummationsexist,youcanseethedevelopmentofaclassificationproblem.Thehighestsummationspecifiesthepredictedclass.Aregressionproblemindicatesthatthemodelwilloutputasinglenumericvalue.

YouwillalsonoticethatFigure4.4containsabiasnodeintheplaceofanadditional

RBF.UnliketheRBF,thebiasnodedoesnotacceptanyinput.Italwaysoutputsaconstantvalueof1.Ofcourse,thisconstantvalueof1ismultipliedbyacoefficientvalue,whichalwayscausesthecoefficienttobedirectlyaddedtotheoutput,regardlessoftheinput.Whentheinputis0,biasnodesareveryusefulbecausetheyallowtheRBFlayertooutputvaluesdespitethelowvalueoftheinput.

Thelong-termmemoryvectorfortheRBFnetworkhasseveraldifferentcomponents:

InputcoefficientsOutput/SummationcoefficientsRBFwidthscalars(samewidthinalldimensions)RBFcentervectors

TheRBFnetworkwillstoreallofthesecomponentsasasinglevectorthatwillbecomeitslong-termmemory.Thenanoptimizationalgorithmcansetthevectortovaluesthatwillproducethecorrectirisspeciesforthefeaturespresented.ThisbookcontainsseveraloptimizationalgorithmsthatcantrainanRBFnetwork.

Inconclusion,thisintroductionprovidedabasicoverviewofvectors,distance,andRBFnetworks.SincethisdiscussionincludedonlytheprerequisitematerialtounderstandVolume3,refertoVolumes1and2foramorethoroughexplanationofthesetopics.

NormalizingData

Normalizationwasbrieflymentionedpreviouslyinthisbook.Inthissection,wewillseeexactlyhowitisperformed.Dataarenotusuallypresentedtotheneuralnetworkinexactlythesamerawformasyoufoundit.Usuallydataarescaledtoaspecificrangeinaprocesscallednormalization.Therearemanydifferentwaystonormalizedata.Forafullsummary,refertoArtificialIntelligenceforHumans,Volume1:FundamentalAlgorithms.Thischapterwillpresentafewnormalizationmethodsmostusefulforneuralnetworks.

One-of-NEncoding

Ifyouhaveacategoricalvalue,suchasthespeciesofaniris,themakeofanautomobile,orthedigitlabelintheMNISTdataset,youshoulduseone-of-nencoding.Thistypeofencodingissometimesreferredtoasone-hotencoding.Toencodeinthisway,youwoulduseoneoutputneuronforeachclassintheproblem.RecalltheMNSITdatasetfromthebook’sintroduction,whereyouhaveimagesfordigitsbetween0and9.Thisproblemismostcommonlyencodedastenoutputneuronswithasoftmaxactivationfunctionthatgivestheprobabilityoftheinputbeingoneofthesedigits.Usingone-of-nencoding,thetendigitsmightbeencodedasfollows:

0->[1,0,0,0,0,0,0,0,0,0]

1->[0,1,0,0,0,0,0,0,0,0]

2->[0,0,1,0,0,0,0,0,0,0]

3->[0,0,0,1,0,0,0,0,0,0]

4->[0,0,0,0,1,0,0,0,0,0]

5->[0,0,0,0,0,1,0,0,0,0]

6->[0,0,0,0,0,0,1,0,0,0]

7->[0,0,0,0,0,0,0,1,0,0]

8->[0,0,0,0,0,0,0,0,1,0]

0->[0,0,0,0,0,0,0,0,0,1]

One-of-nencodingshouldalwaysbeusedwhentheclasseshavenoordering.Anotherexampleofthistypeofencodingisthemakeofanautomobile.Usuallythelistofautomakersisunorderedunlessthereissomemeaningyouwishtoconveybythisordering.Forexample,youmightordertheautomakersbythenumberofyearsinbusiness.However,thisclassificationshouldonlybedoneifthenumberofyearsinbusinesshasmeaningtoyourproblem.Ifthereistrulynoorder,thenone-of-nshouldalwaysbeused.

Becauseyoucaneasilyorderthedigits,youmightwonderwhyweuseone-of-nencodingforthem.However,theorderofthedigitsdoesnotmeantheprogramcanrecognizethem.Thefactthat“1”and“2”arenumericallynexttoeachotherdoesnothingtohelptheprogramrecognizetheimage.Therefore,weshouldnotuseasingle-outputneuronthatsimplyoutputsthedigitrecognized.Thedigits0-9arecategories,notactualnumericvalues.Encodingcategorieswithasinglenumericvalueisdetrimentaltotheneuralnetwork’sdecisionsprocess.

Boththeinputandoutputcanuseone-of-nencoding.Theabovelistingused0’sand1’s.Normallyyouwillusetherectifiedlinearunit(ReLU)andsoftmaxactivation,andthistypeofencodingisnormal.However,ifyouareworkingwithahyperbolictangentactivationfunction,youshouldutilizeavalueof-1forthe0’stomatchthehyperbolictangent’srangeof-1to1.

Ifyouhaveanextremelylargenumberofclasses,one-of-nencodingcanbecomecumbersomebecauseyoumusthaveaneuronforeveryclass.Insuchcases,youhaveseveraloptions.First,youmightfindawaytoorderyourcategories.Withthisordering,yourcategoriescannowbeencodedasanumericvalue,whichwouldbethecurrentcategory’spositionwithintheorderedlist.

Anotherapproachtodealingwithanextremelylargenumberofcategoriesisfrequency-inversedocumentfrequency(TF-IDF)encodingbecauseeachclassessentiallybecomestheprobabilityofthatclass’soccurrencerelativetotheothers.Inthisway,TF-IDFallowstheprogramtomapalargenumberofclassestoasingleneuron.AcompletediscussionofTF-IDFisbeyondthescopeofthisbook;however,itisbuiltintomanymachinelearningframeworksforlanguagessuchasR,Python,andsomeothers.

RangeNormalization

Ifyouhavearealnumberoranorderedlistofcategories,youmightchooserangenormalizationbecauseitsimplymapstheinputdata’srangeintotherangeofyouractivationfunction.Sigmoid,ReLUandsoftmaxusearangebetween0and1,whereashyperbolictangentusesarangebetween-1and1.

YoucannormalizeanumberwithEquation4.6:

Equation4.6:NormalizetoaRange

Toperformthenormalization,youneedthehighandlowvaluesofthedatatobenormalized,givenbydlanddhintheequationabove.Similarly,youneedthehighandlowvaluestonormalizeinto(usually0and1),givenbynlandnh.

Sometimesyouwillneedtoundothenormalizationperformedonanumberandreturnittoadenormalizedstate.Equation4.7performsthisoperation:

Equation4.7:DenormalizefromaRange

Averysimplewaytothinkofrangenormalizationispercentages.Considerthefollowinganalogy.Youseeanadvertisementstatingthatyouwillreceivea$10(USD)reductiononaproduct,andyouhavetodecideifthisdealisworthwhile.Ifyouarebuyingat-shirt,thisofferisprobablyagooddeal;however,ifyouarebuyingacar,$10doesnotreallymatter.Furthermore,youneedtobefamiliarwiththecurrentvalueofUSdollarsinordertomakeyourdecision.Thesituationchangesifyoulearnthatthemerchanthadoffereda10%discount.Thus,thevalueisnowmoremeaningful.Nomatterifyouarebuyingat-shirt,carorevenahouse,the10%discounthasclearramificationsontheproblembecauseittranscendscurrencies.Inotherwords,thepercentageisatypeofnormalization.Justlikeintheanalogy,normalizingtoarangehelpstheneuralnetworkevaluateallinputswithequalsignificance.

Z-ScoreNormalization

Z-scorenormalizationisthemostcommonnormalizationforeitherarealnumberoranorderedlist.Fornearlyallapplications,z-scorenormalizationshouldbeusedinplaceofrangenormalization.Thisnormalizationtypeisbasedonthestatisticalconceptofz-scores,thesametechniqueforgradingexamsonacurve.Z-scoresprovideevenmoreinformationthanpercentages.

Considerthefollowingexample.StudentAscored85%ofthepointsonherexam.StudentBscored75%ofthepointsonhisexam.Whichstudentearnedthebettergrade?Iftheprofessorissimplyreportingthepercentageofcorrectpoints,thenstudentAearnedabetterscore.However,youmightchangeyouranswerifyoulearnedthattheaverage(mean)scoreforstudentA’sveryeasyexamwas95%.Similarly,youmightreconsideryourpositionifyoudiscoveredthatstudentB’sclasshadanaveragescoreof65%.StudentBperformedaboveaverageonhisexam.EventhoughstudentAearnedabetterscore,sheperformedbelowaverage.Totrulyreportacurvedscore(az-score)youmusthavethemeanscoreandthestandarddeviation.Equation4.8showsthecalculationofamean:

Equation4.8:CalculatetheArithmeticMean

Youcancalculatethemean(μ,mu)byaddingallofthescoresanddividingbythenumberofscores.Thisprocessisthesameastakinganaverage.Nowthatyouhavetheaverage,youneedthestandarddeviation.Ifyouhadameanscoreof50points,theneveryonetakingtheexamvariedfromthemeanbysomeamount.Theaverageamountthatstudentsvariedfromthemeanisessentiallythestandarddeviation.Equation4.9showsthecalculationofthestandarddeviation(σ,sigma):

Equation4.9:StandardDeviation

Essentially,theprocessoftakingastandarddeviationissquaringandsummingeachscore’sdifferencefromthemean.Thesevaluesareaddedtogetherandthesquarerootistakenofthistotal.Nowthatyouhavethestandarddeviation,youcancalculatethez-scorewithEquation4.10:

Equation4.10:Z-Score

Listing4.2showsthepseudocodeneededtocalculateaz-score:

Listing4.2:CalculateaZ-Score

#Datatoscore:

data=[5,10,3,20,4]

#Sumthevalues

sum=0

fordindata:

sum=sum+d

#Calculatemean

mean=float(sum)/len(data)

print("Mean:"+mean)

#Calculatethevariance

variance=0

fordindata:

variance=variance+((mean-d)**2)

variance=variance/len(data)

print("Variance:"+variance)

#Calculatethestandarddeviation

sdev=sqrt(variance)

print("StandardDeviation:"+sdev)

#Calculatezscore

zscore=[]

fordindata:

zscore.append((d-mean)/sdev)

print("Z-Scores:"+str(zscore))

Theabovecodewillresultinthefollowingoutput:

Mean:8.4

Variance:39.440000000000005

StandardDeviation:6.280127387243033

Z-Scores:[-0.5413902920037097,0.2547719021193927,-0.8598551696529507,

1.8470962903655976,-0.7006227308283302]

Thez-scoreisanumericvaluewhere0representsascorethatisexactlythemean.Apositivez-scoreisaboveaverage;anegativez-scoreisbelowaverage.Tohelpvisualizez-scores,considerthefollowingmappingbetweenz-scoresandlettergrades:

<-2.0=D+

-2.0=C-

-1.5=C

-1.0=C+

-0.5=B-

0.0=B

+0.5=B+

+1.0=A-

+1.5=A

+2.0=A+

Wetookthemappinglistedabovefromanundergraduatesyllabus.Thereisagreatdealofvariationonz-scoretolettergrademapping.Mostprofessorswillsetthe0.0z-scoretoeitheraCoraB,dependingoniftheprofessor/universityconsidersCorBtorepresentanaveragegrade.TheaboveprofessorconsideredBtobeaverage.Thez-scoreworkswellforneuralnetworkinputasitiscenteredat0andwillveryrarelygoabove+3andbelow-3.

ComplexNormalization

Theinputtoaneuralnetworkiscommonlycalleditsfeaturevector.Theprocessofcreatingafeaturevectoriscriticaltomappingyourrawdatatoaformthattheneuralnetworkcancomprehend.Theprocessofmappingtherawdatatoafeaturevectoriscalledencoding.Toseethismappingatwork,considertheautoMPGdataset:

1.mpg:numeric

2.cylinders:numeric,3unique

3.displacement:numeric

4.horsepower:numeric

5.weight:numeric

6.acceleration:numeric

7.modelyear:numeric,3unique

8.origin:numeric,7unique

9.carname:string(uniqueforeachinstance)

Toencodetheabovedata,wewilluseMPGastheoutputandtreatthedatasetasregression.TheMPGfeaturewillbez-scoreencoded,anditfallswithintherangeofthe

linearactivationfunctionthatwewilluseontheoutput.

Wewilldiscardthecarname.Cylindersandmodel-yeararebothone-of-nencoded,theremainingfieldswillbez-scoreencoded.Thefollowingfeaturevectorresults:

InputFeatureVector:

Feature1:cylinders-2,-1no,+1yes



Feature4:displacementz-score

Feature5:horsepowerz-score

Feature6:weightz-score

Feature7:accelerationz-score

Feature8:modelyear-1977,-1no,+1yes



Feature11:origin-1

Feature12:origin-2

Feature13:origin-3

Output:

mpgz-score

Asyoucansee,thefeaturevectorhasgrownfromtheninerawfieldstothirteenfeaturesplusanoutput.Aneuralnetworkforthesedatawouldhavethirteeninputneuronsandasingleoutput.Assumingasingle-hiddenlayeroftwentyneuronswiththeReLUactivation,thisnetworkwouldlooklikeFigure4.7:

Figure4.7:SimpleRegressionNeuralNetwork

ChapterSummary

Feedforwardneuralnetworksareoneofthemostcommonalgorithmsinartificialintelligence.Inthischapter,weintroducedthemultilayerfeedforwardneuralnetworkandtheradial-basisfunction(RBF)neuralnetwork.Classificationandregressionapplybothofthesetypesofneuralnetwork.

Feedforwardnetworkshavewell-definedlayers.Theinputlayeracceptstheinputfromthecomputerprogram.Theoutputlayerreturnstheprocessingresultoftheneuralnetworktothecallingprogram.Betweentheselayersarehiddenneuronsthathelptheneuralnetworktorecognizeapatternpresentedattheinputlayerandproducethecorrectresultontheoutputlayer.

RBFneuralnetworksuseaseriesofradial-basisfunctionsfortheirhiddenlayer.Inadditiontotheweights,itisalsopossibletochangethewidthsandcentersoftheseRBFs.ThoughanRBFandfeedforwardnetworkcanapproximateanyfunction,theygoabouttheprocessindifferentways.

Sofar,we’veseenonlyhowtocalculatethevaluesforneuralnetworks.Trainingistheprocessbywhichweadjusttheweightsofneuralnetworkssothattheneuralnetworkoutputsthevaluesthatwedesire.Totrainneuralnetworks,wealsoneedtohaveawaytoevaluateit.Thenextchapterintroducesbothtrainingandvalidationofneuralnetworks.

Chapter5:Training&EvaluationMeanSquaredErrorSensitivity&SpecificityROCCurveSimulatedAnnealing

Sofarwe’veseenhowtocalculateaneuralnetworkbasedonitsweights;however,wehavenotseenwheretheseweightvaluesactuallycomefrom.Trainingistheprocesswhereaneuralnetwork’sweightsareadjustedtoproducethedesiredoutput.Trainingusesevaluation,whichistheprocesswheretheoutputoftheneuralnetworkisevaluatedagainsttheexpectedoutput.

Thischapterwillcoverevaluationandintroducetraining.Becauseneuralnetworkscanbetrainedandevaluatedinmanydifferentways,weneedaconsistentmethodtojudgethem.Anobjectivefunctionevaluatesaneuralnetworkandreturnsascore.Trainingadjuststheneuralnetworkinwaysthatmightachievebetterresults.Typically,theobjectivefunctionwantslowerscores.Theprocessofattemptingtoachievelowerscoresiscalledminimization.Youmightestablishmaximizationproblems,inwhichtheobjectivefunctionwantshigherscores.Therefore,youcanusemosttrainingalgorithmsforeitherminimizationormaximizationproblems.

Youcanoptimizeweightsofaneuralnetworkwithanycontinuousoptimizationalgorithm,suchassimulatedannealing,particleswarmoptimization,geneticalgorithms,hillclimbing,Nelder-Mead,orrandomwalk.Inthischapter,wewillintroducesimulatedannealingasasimpletrainingalgorithm.However,inadditiontooptimizationalgorithms,youcantrainneuralnetworkswithbackpropagation.Chapter6,“BackpropagationTraining,”andChapter7,“OtherPropagationTraining,”willintroduceseveralalgorithmsthatwerebasedonthebackpropagationtrainingalgorithmsintroducedinChapter6.

EvaluatingClassification

Classificationistheprocessbywhichaneuralnetworkattemptstoclassifytheinputintooneormoreclasses.Thesimplestwayofevaluatingaclassificationnetworkistotrackthepercentageoftrainingsetitemsthatwereclassifiedincorrectly.Wetypicallyscorehumanexamplesinthismanner.Forexample,youmighthavetakenmultiple-choiceexamsinschoolinwhichyouhadtoshadeinabubbleforchoicesA,B,C,orD.Ifyouchosethewrongletterona10-questionexam,youwouldearna90%.Inthesameway,wecangradecomputers;however,mostclassificationalgorithmsdonotsimplychooseA,B,C,orD.Computerstypicallyreportaclassificationastheirpercentconfidenceineachclass.Figure5.1showshowacomputerandahumanmightbothrespondtoquestion#1onanexam:

Figure5.1:HumanExamversusComputerClassification

Asyoucansee,thehumantesttakermarkedthefirstquestionas“B.”However,thecomputertesttakerhadan80%(0.8)confidencein“B”andwasalsosomewhatsurewith10%(0.1)on“A.”Thecomputerthendistributedtheremainingpointsontheothertwo.Inthesimplestsense,themachinewouldget80%ofthescoreforthisquestionifthecorrectanswerwere“B.”Themachinewouldgetonly5%(0.05)ofthepointsifthecorrectanswerwere“D.”

BinaryClassification

Binaryclassificationoccurswhenaneuralnetworkmustchoosebetweentwooptions,whichmightbetrue/false,yes/no,correct/incorrect,orbuy/sell.Toseehowtousebinaryclassification,wewillconsideraclassificationsystemforacreditcardcompany.Thisclassificationsystemmustdecidehowtorespondtoanewpotentialcustomer.Thissystemwilleither“issueacreditcard”or“declineacreditcard.”

Whenyouhaveonlytwoclassesthatyoucanconsider,theobjectivefunction’sscoreisthenumberoffalsepositivepredictionsversusthenumberoffalsenegatives.Falsenegativesandfalsepositivesarebothtypesoferrors,anditisimportanttounderstandthedifference.Forthepreviousexample,issuingacreditcardwouldbethepositive.Afalsepositiveoccurswhenacreditcardisissuedtosomeonewhowillbecomeabadcreditrisk.Afalsenegativehappenswhenacreditcardisdeclinedtosomeonewhowouldhavebeenagoodrisk.

Becauseonlytwooptionsexist,wecanchoosethemistakethatisthemoreserioustypeoferror,afalsepositiveorafalsenegative.Formostbanksissuingcreditcards,afalsepositiveisworsethanafalsenegative.Decliningapotentiallygoodcreditcardholderisbetterthanacceptingacreditcardholderwhowouldcausethebanktoundertakeexpensivecollectionactivities.

Aclassificationproblemseekstoassigntheinputintooneormorecategories.Abinaryclassificationemploysasingle-outputneuralnetworktoclassifyintotwocategories.ConsidertheautoMPGdatasetthatisavailablefromtheUniversityofCaliforniaatIrvine(UCI)machinelearningrepositoryatthefollowingURL:


FortheautoMPGdataset,wemightcreateclassificationsforcarsbuiltinsideoftheUnitedStates.Thefieldnamedoriginprovidesinformationonthelocationofthecarassembly.Thus,thesingle-outputneuronwouldgiveanumberthatindicatesthe


probabilitythatthecarwasbuiltintheUSA.

Toperformthisprediction,youneedtochangetheoriginfieldtoholdvaluesbetween1andthelow-endrangeoftheactivationfunction.Forexample,thelowendoftherangeforthesigmoidfunctionis0;forthehyperbolictangent,itis-1.TheneuralnetworkwilloutputavaluethatindicatestheprobabilityofacarbeingmadeintheUSAorelsewhere.Valuescloserto1indicateahigherprobabilityofthecaroriginatingintheUSA;valuescloserto0or-1indicateacaroriginatingfromoutsidetheUSA.

YoumustchooseacutoffvaluethatdifferentiatesthesepredictionsintoeitherUSAornon-USA.IfUSAis1.0andnon-USAis0.0,wecouldjustchoose0.5asthecutoffvalue.Consequently,acarwithanoutputof0.6wouldbeUSA,and0.4wouldbenon-USA.

Invariably,thisneuralnetworkwillproduceerrorsasitclassifiescars.AUSA-madecarmightyieldanoutputof0.45;however,becausetheneuralnetworkisbelowthecutoffvalue,itwouldnotputthecarinthecorrectcategory.BecausewedesignedthisneuralnetworktoclassifyUSA-madecars,thiserrorwouldbecalledafalsenegative.Inotherwords,theneuralnetworkindicatedthatthecarwasnon-USA,creatinganegativeresultbecausethecarwasactuallyfromtheUSA.Thus,thenegativeclassificationwasfalse.Thiserrorisalsoknownasatype-2error.

Similarly,thenetworkmightfalselyclassifyanon-USAcarasUSA.Thiserrorisafalsepositive,oratype-1.Neuralnetworkspronetoproducefalsepositivesarecharacterizedasmorespecific.Similarly,neuralnetworksthatproducemorefalsenegativesarelabeledasmoresensitive.Figure5.2summarizestheserelationshipsbetweentrue/false,positives/negatives,type-1&type-2errors,andsensitivity/specificity:

Figure5.2:TypesofErrors

Settingthecutofffortheoutputneuronselectswhethersensitivityorspecificityismoreimportant.Itispossibletomakeaneuralnetworkmoresensitiveorspecificbyadjustingthiscutoff,asillustratedinFigure5.3:

Figure5.3:Sensitivityvs.Specificity

Asthelimitlinemovesleft,thenetworkbecomesmorespecific.Thedecreaseinthesizeofthetruenegative(TN)areamakesthisspecificityevident.Conversely,asthelimitlinemovesright,thenetworkbecomesmoresensitive.Thissensitivityisevidentinthedecreaseinsizeofthetruepositive(TP)area.

Increasesinsensitivitywillusuallyresultinadecreaseofspecificity.Figure5.4showsaneurallimitdesignedtomaketheneuralnetworkverysensitive:

Figure5.4:SensitiveCutoff

Theneuralnetworkcanalsobecalibratedforgreatersensitivity,asshowninFigure5.5:

Figure5.5:SpecificCutoff

Attaining100%specificityorsensitivityisnotnecessarilygood.Amedicaltestcanreach100%specificitybysimplypredictingthateveryonedoesnothavethedisease.Thistestwillnevercommitafalsepositiveerrorbecauseitnevergaveapositiveanswer.Obviously,thistestisnotuseful.Highlyspecificorsensitiveneuralnetworksproducethesamemeaninglessresult.Weneedawaytoevaluatethetotaleffectivenessoftheneuralnetworkthatisindependentofthecutoffpoint.Thetotalpredictionratecombinesthepercentageoftruepositivesandtruenegatives.Equation5.1cancalculatethetotalpredictionrate:

Equation5.1:TotalPredictionRate

Additionally,youcanvisualizethetotalpredictionrate(TPR)withareceiveroperatorcharacteristic(ROC)chart,asseeninFigure5.6:

Figure5.6:ReceiverOperatorCharacteristic(ROC)Chart

TheabovechartshowsthreedifferentROCcurves.ThedashedlineshowsanROCwithzeropredictivepower.Thedottedlineshowsabetterneuralnetwork,andthesolidlineshowsanearlyperfectneuralnetwork.TounderstandhowtoreadanROCchart,lookfirstattheorigin,whichismarkedby0%.AllROClinesalwaysstartattheoriginandmovetotheupper-rightcornerwheretruepositive(TP)andfalsepositive(FP)areboth100%.

They-axisshowstheTPpercentagesfrom0to100.Asyoumoveupthey-axis,bothTPandFPincrease.AsTPincreases,sodoessensitivity;however,specificityfalls.TheROCchartallowsyoutoselectthelevelofsensitivityyouneed,butitalsoshowsyouthenumberofFPsyoumustaccepttoachievethatlevelofsensitivity.

Theworstnetwork,thedashedline,alwayshasa50%totalpredictionrate.Giventhatthereareonlytwooutcomes,thisresultisnobetterthanrandomguessing.Toget100%TP,youmustalsohavea100%FP,whichstillresultsinhalfofthepredictionsbeingwrong.

ThefollowingURLallowsyoutoexperimentwithasimpleneuralnetworkandROCcurve:

http://www.heatonresearch.com/aifh/vol3/anneal_roc.html

WecantraintheneuralnetworkattheaboveURLwithsimulatedannealing.Eachtimeanannealingepochiscompleted,theneuralnetworkimproves.Wecanmeasurethisimprovementbythemeansquarederrorcalculation(MSE).AstheMSEdrops,theROCcurvestretchestowardstheupperleftcorner.WewilldescribetheMSEingreaterdetaillaterinthischapter.Fornow,simplythinkofitasameasurementoftheneuralnetwork’serrorwhenyoucompareittotheexpectedoutput.AlowerMSEisdesirable.Figure5.7showstheROCcurveafterwehavetrainedthenetworkforanumberofiterations:


Figure5.7:ROCCurve

Itisimportanttonotethatthegoalisnotalwaystomaximizethetotalpredictionrate.Sometimesafalsepositive(FP)isbetterthanafalsenegative(FN.)Consideraneuralnetworkthatpredictsabridgecollapse.AFPmeansthattheprogrampredictsacollapsewhenthebridgewasactuallysafe.Inthiscase,checkingastructurallysoundbridgewouldwasteanengineer’stime.Ontheotherhand,aFNwouldmeanthattheneuralnetworkpredictedthebridgewassafewhenitactuallycollapsed.Abridgecollapsingisamuchworseoutcomethanwastingthetimeofanengineer.Therefore,youshouldarrangethistypeofneuralnetworksothatitisoverlyspecific.

Toevaluatethetotaleffectivenessofthenetwork,youshouldconsidertheareaunderthecurve(AUC).TheoptimalAUCwouldbe1.0,whichisa100%(1.0)x100%(1.0)rectanglethatpushestheareaunderthecurvetothemaximum.WhenreadinganROCcurve,themoreeffectiveneuralnetworkshavemorespaceunderthecurve.Thecurvesshownpreviously,inFigure5.6,correspondwiththisassessment.

Multi-ClassClassification

Ifyouwanttopredictmorethanoneoutcome,youwillneedmorethanoneoutputneuron.Becauseasingleneuroncanpredicttwooutcomes,aneuralnetworkwithtwooutputneuronsissomewhatrare.Iftherearethreeormoreoutcomes,therewillbethreeormoreoutputneurons.ArtificialIntelligenceforHumans,Volume1:FundamentalAlgorithmsdoesshowamethodthatcanencodethreeoutcomesintotwooutputneurons.

ConsiderFisher’sirisdataset.Thisdatasetcontainsfourdifferentmeasurementsforthreedifferentspeciesofirisflower.ThefollowingURLcontainsthisdataset:


Sampledatafromtheirisdatasetisshownhere:

sepal_length,sepal_width,petal_length,petal_width,species

5.1,3.5,1.4,0.2,Iris-setosa

4.9,3.0,1.4,0.2,Iris-setosa





Fourmeasurementscanpredictthespecies.Ifyouareinterestedinreadingmoreabouthowtomeasureanirisflower,refertotheabovelink.Forthisprediction,themeaningofthefourmeasurementsdoesnotreallymatter.Thesemeasurementswillteachtheneuralnetworktopredict.Figure5.8showsaneuralnetworkstructurethatcanpredicttheirisdataset:

Figure5.8:IrisDataSetNeuralNetwork

Theaboveneuralnetworkacceptsthefourmeasurementsandoutputsthreenumbers.Eachoutputcorrespondswithoneoftheirisspecies.Theoutputneuronthatproducesthehighestnumberdeterminesthespeciespredicted.

LogLoss

Classificationnetworkscanderiveaclassfromtheinputdata.Forexample,thefourirismeasurementscangroupthedataintothethreespeciesofiris.Oneeasymethodtoevaluateclassificationistotreatitlikeamultiple-choiceexamandreturnapercentscore.Althoughthistechniqueiscommon,mostmachinelearningmodelsdonotanswermultiple-choicequestionslikeyoudidinschool.Considerhowthefollowingquestionmightappearonanexam:

1.Wouldanirissetosahaveasepallengthof5.1cm,

asepalwidthof3.5cm,apetallengthof1.4cm,and

apetalwidthof0.2cm?


A)True

B)False

Thisquestionisexactlythetypethataneuralnetworkmustfaceinaclassificationtask.However,theneuralnetworkwillnotrespondwithananswerof“True”or“False.”Itwillanswerthequestioninthefollowingmanner:

True:80%

Theaboveresponsemeansthattheneuralnetworkis80%surethattheflowerisasetosa.Thistechniquewouldbeveryhandyinschool.Ifyoucouldnotdecidebetweentrueandfalse,youcouldsimplyplace80%on“True.”Scoringisrelativelyeasybecauseyoureceiveyourpercentagevalueforthecorrectanswer.Inthiscase,if“True”werethecorrectanswer,yourscorewouldbe80%forthatquestion.

However,loglossisnotquitethatsimple.Equation5.2istheequationforlogloss:

Equation5.2:LogLossFunction

Youshouldusethisequationonlyasanobjectivefunctionforclassificationsthathavetwooutcomes.Thevariabley-hatistheneuralnetwork’sprediction,andthevariableyistheknowncorrectanswer.Inthiscase,ywillalwaysbe0or1.Thetrainingdatahavenoprobabilities.Theneuralnetworkclassifiesiteitherintooneclass(1)ortheother(0).

ThevariableNrepresentsthenumberofelementsinthetrainingset—thenumberofquestionsinthetest.WedividebyNbecausethisprocessiscustomaryforanaverage.Wealsobegintheequationwithanegativebecausethelogfunctionisalwaysnegativeoverthedomain0to1.Thisnegationallowsapositivescoreforthetrainingtominimize.

Youwillnoticetwotermsareseparatedbytheaddition(+).Eachcontainsalogfunction.Becauseywillbeeither0or1,thenoneofthesetwotermswillcanceloutto0.Ifyis0,thenthefirsttermwillreduceto0.Ifyis1,thenthesecondtermwillbe0.

Ifyourpredictionforthefirstclassofatwo-classpredictionisy-hat,thenyourpredictionforthesecondclassis1minusy-hat.Essentially,ifyourpredictionforclassAis70%(0.7),thenyourpredictionforclassBis30%(0.3).Yourscorewillincreasebythelogofyourpredictionforthecorrectclass.Iftheneuralnetworkhadpredicted1.0forclassA,andthecorrectanswerwasA,yourscorewouldincreasebylog(1),whichis0.Forlogloss,weseekalowscore,soacorrectanswerresultsin0.Someoftheselogvaluesforaneuralnetwork’sprobabilityestimateforthecorrectclass:

-log(1.0)=0-log(0.95)=0.02-log(0.9)=0.05-log(0.8)=0.1-log(0.5)=0.3-log(0.1)=1-log(0.01)=2-log(1.0e-12)=12-log(0.0)=negativeinfinity

Asyoucansee,givingalowconfidencetothecorrectansweraffectsthescorethemost.Becauselog(0)isnegativeinfinity,wetypicallyimposeaminimumvalue.Ofcourse,theabovelogvaluesareforasingletrainingsetelement.Wewillaveragethelogvaluesfortheentiretrainingset.

Multi-ClassLogLoss

Ifmorethantwooutcomesareclassified,thenwemustusemulti-classlogloss.Thislossfunctionisverycloselyrelatedtothebinaryloglossjustdescribed.Equation5.3showstheequationformulti-classlogloss:

Equation5.3:Multi-ClassLogLoss

Intheaboveequation,Nisthenumberoftrainingsetelements,andMrepresentsthenumberofcategoriesfortheclassificationprocess.Conceptually,themulti-classloglossobjectivefunctionworkssimilarlytosinglelogloss.Theaboveequationessentiallygivesyouascorethatistheaverageofthenegative-logofyourpredictionforthecorrectclassoneachofthedatasets.Theinnermostsigma-summationintheaboveequationfunctionsasanif-thenstatementandallowsonlythecorrectclasswithayof1.0tocontributetothesummation.

EvaluatingRegression

Meansquarederror(MSE)calculationisthemostcommonlyutilizedprocessforevaluatingregressionmachinelearning.MostInternetexamplesofneuralnetworks,supportvectormachines,andothermodelsapplyMSE(Draper,1998),showninEquation5.4:

Equation5.4:MeanSquaredError(MSE)

Intheaboveequation,yistheidealoutputandy-hatistheactualoutput.Themeansquarederrorisessentiallythemeanofthesquaresoftheindividualdifferences.Becausetheindividualdifferencesaresquared,thepositiveornegativenatureofthedifferencedoesnotmattertoMSE.

YoucanevaluateclassificationproblemswithMSE.ToevaluateclassificationoutputwithMSE,eachclass’sprobabilityissimplytreatedasanumericoutput.Theexpectedoutputsimplyhasavalueof1.0forthecorrectclass,and0fortheothers.Forexample,ifthefirstclasswerecorrect,andtheotherthreeclassesincorrect,theexpectedoutcomevectorwouldlooklikethefollowing:

[1.0,0,0,0]

Youcanusenearlyanyregressionobjectivefunctionforclassificationinthisway.Avarietyoffunctions,suchasrootmeansquare(RMS)andsumofsquareserror(SSE)canevaluateregression,andwediscussedthesefunctionsinArtificialIntelligenceforHumans,Volume1:FundamentalAlgorithms.

TrainingwithSimulatedAnnealing

Totrainaneuralnetwork,youmustdefineitstasks.Anobjectivefunction,otherwiseknownasscoringorlossfunctions,cangeneratethesetasks.Essentially,anobjectivefunctionevaluatestheneuralnetworkandreturnsanumberindicatingtheusefulnessoftheneuralnetwork.Thetrainingprocessmodifiestheweightsoftheneuralnetworkineachiterationsothevaluereturnedfromtheobjectivefunctionimproves.

SimulatedannealingisaneffectiveoptimizationtechniquethatweexaminedinArtificialIntelligenceforHumansVolume1.Inthischapter,wewillreviewsimulatedannealingaswellasshowyouhowanyvectoroptimizationfunctioncanimprovetheweightsofafeedforwardneuralnetwork.Inthenextchapter,wewillexamineevenmoreadvancedoptimizationtechniquesthattakeadvantageofthedifferentiablelossfunction.

Asareview,simulatedannealingworksbyfirstassigningtheweightvectorofaneuralnetworktorandomvalues.Thisvectoristreatedlikeaposition,andtheprogramevaluateseverypossiblemovefromthatposition.Tounderstandhowaneuralnetworkweightvectortranslatestoaposition,thinkofaneuralnetworkwithjustthreeweights.Intherealworld,weconsiderpositionintermsofthex,yandzcoordinates.Wecanwriteanypositionasavectorof3.Ifwearewillingtomoveinasingledimension,wecouldmoveinatotalofsixdifferentdirections.Wewouldhavetheoptionofmovingforwardorbackwardsinthex,yorzdimensions.

Simulatedannealingfunctionsbymovingforwardorbackwardsinallavailabledimensions.Ifthealgorithmtakesthebestmove,asimplehill-climbingalgorithmwouldresult.Hillclimbingonlyimprovesscores.Therefore,itiscalledagreedyalgorithm.Toreachthebestposition,analgorithmwillsometimeneedtomovetoalowerposition.Asaresult,simulatedannealingverymuchfollowstheexpressionoftwostepsforward,onestepback.

Inotherwords,simulatedannealingwillsometimesallowamovetoaweightconfigurationwithaworsescore.Theprobabilityofacceptingsuchamovestartshighanddecreases.Thisprobabilityisknownasthecurrenttemperature,anditsimulatestheactualmetallurgicalannealingprocesswhereametalcoolsandachievesgreaterhardness.Figure5.9showstheentireprocess:

Figure5.9:SimulatedAnnealing

Afeedforwardneuralnetworkcanutilizesimulatedannealingtolearntheirisdataset.Thefollowingprogramshowstheoutputfromthistraining:

Iteration#1,Score=0.3937,k=1,kMax=100,t=343.5891,prob=0.9998







...

Iteration#99,Score=0.1031,k=99,kMax=100,t=1.16E-4,prob=2.8776E-7

Iteration#100,Score=0.1031,k=100,kMax=100,t=9.9999E-5,prob=2.1443E-70

Finalscore:0.1031

[0.22222222222222213,0.6249999999999999,0.06779661016949151,

0.04166666666666667]->Iris-setosa,Ideal:Iris-setosa

[0.1666666666666668,0.41666666666666663,0.06779661016949151,

0.04166666666666667]->Iris-setosa,Ideal:Iris-setosa

...

[0.6666666666666666,0.41666666666666663,0.711864406779661,

0.9166666666666666]->Iris-virginica,Ideal:Iris-virginica

[0.5555555555555555,0.20833333333333331,0.6779661016949152,0.75]->

Iris-virginica,Ideal:Iris-virginica

[0.611111111111111,0.41666666666666663,0.711864406779661,


[0.5277777777777778,0.5833333333333333,0.7457627118644068,


[0.44444444444444453,0.41666666666666663,0.6949152542372881,


[1.178018083703488,16.66575553359515,-0.6101619300462806,

-3.9894606091020965,13.989551673146842,-8.87489712462323,

8.027287801488647,-4.615098285283519,6.426489182215509,

-1.4672962642199618,4.136699061975335,4.20036115439746,

0.9052469139543605,-2.8923515248132063,-4.733219252086315,

18.6497884912826,2.5459600552510895,-5.618872440836617,

4.638827606092005,0.8887726364890928,8.730809901357286,

-6.4963370793479545,-6.4003385330186795,-11.820235441582424,

-3.29494170904095,-1.5320936828139837,0.1094081633203249,

0.26353076268018827,3.935780218339343,0.8881280604852664,

-5.048729642423418,8.288232057956957,-14.686080237582006,

3.058305829324875,-2.4144038920292608,21.76633883966702,

12.151853576801647,-3.6372061664901416,6.28253174293219,

-4.209863472970308,0.8614258660906541,-9.382012074551428,

-3.346419915864691,-0.6326977049713416,2.1391118323593203,

0.44832732990560714,6.853600355726914,2.8210824313745957,

1.3901883615737192,-5.962068350552335,0.502596306917136]

Theinitialrandomneuralnetworkstartsoutwithahighmulti-classloglossscoreof30.Asthetrainingprogresses,thisvaluefallsuntilitislowenoughfortrainingtostop.Forthisexample,thetrainingstopsassoonastheerrorfallsbelow10.Todetermineagoodstoppingpointfortheerror,youshouldevaluatehowwellthenetworkisperformingforyourintendeduse.Aloglossbelow0.5isoftenintheacceptablerange;however,youmightnotbeabletoachievethisscorewithalldatasets.

ThefollowingURLshowsanexampleofaneuralnetworktrainedwithsimulatedannealing:


ChapterSummary

Objectivefunctionscanevaluateneuralnetworks.Theysimplyreturnanumberthatindicatesthesuccessoftheneuralnetwork.Regressionneuralnetworkswillfrequentlyutilizemeansquarederror(MSE).Classificationneuralnetworkswilltypicallyusealoglossormulti-classloglossfunction.Theseneuralnetworkscreatecustomobjectivefunctions.

Simulatedannealingcanoptimizetheneuralnetwork.YoucanutilizeanyoftheoptimizationalgorithmspresentedinVolumes1and2ofArtificialIntelligenceforHumans.Infact,youcanoptimizeanyvectorinthiswaybecausetheoptimizationalgorithmsarenottiedtoaneuralnetwork.Inthenextchapter,youwillseeseveraltrainingmethodsdesignedspecificallyforneuralnetworks.Whilethesespecializedtrainingalgorithmsareoftenmoreefficient,theyrequireobjectivefunctionsthathaveaderivative.


Chapter6:BackpropagationTrainingGradientCalculationBackpropagationLearningRate&MomentumStochasticGradientDescent

Backpropagationisoneofthemostcommonmethodsfortraininganeuralnetwork.Rumelhart,Hinton,&Williams(1986)introducedbackpropagation,anditremainspopulartoday.Programmersfrequentlytraindeepneuralnetworkswithbackpropagationbecauseitscalesreallywellwhenrunongraphicalprocessingunits(GPUs).Tounderstandthisalgorithmforneuralnetworks,wemustexaminehowtotrainitaswellashowitprocessesapattern.

Classicbackpropagationhasbeenextendedandmodifiedtogiverisetomanydifferenttrainingalgorithms.Inthischapter,wewilldiscussthemostcommonlyusedtrainingalgorithmsforneuralnetworks.Webeginwithclassicbackpropagationandthenendthechapterwithstochasticgradientdescent(SGD).

UnderstandingGradients

Backpropagationisatypeofgradientdescent,andmanytextswillusethesetwotermsinterchangeably.Gradientdescentreferstothecalculationofagradientoneachweightintheneuralnetworkforeachtrainingelement.Becausetheneuralnetworkwillnotoutputtheexpectedvalueforatrainingelement,thegradientofeachweightwillgiveyouanindicationabouthowtomodifyeachweighttoachievetheexpectedoutput.Iftheneuralnetworkdidoutputexactlywhatwasexpected,thegradientforeachweightwouldbe0,indicatingthatnochangetotheweightisnecessary.

Thegradientisthederivativeoftheerrorfunctionattheweight’scurrentvalue.Theerrorfunctionmeasuresthedistanceoftheneuralnetwork’soutputfromtheexpectedoutput.Infact,wecanusegradientdescent,aprocessinwhicheachweight’sgradientvaluecanreachevenlowervaluesoftheerrorfunction.

Withrespecttotheerrorfunction,thegradientisessentiallythepartialderivativeofeachweightintheneuralnetwork.Eachweighthasagradientthatistheslopeoftheerrorfunction.Aweightisaconnectionbetweentwoneurons.Calculatingthegradientoftheerrorfunctionallowsthetrainingmethodtodeterminewhetheritshouldincreaseordecreasetheweight.Inturn,thisdeterminationwilldecreasetheerroroftheneuralnetwork.Theerroristhedifferencebetweentheexpectedoutputandactualoutputoftheneuralnetwork.Manydifferenttrainingmethodscalledpropagation-trainingalgorithmsutilizegradients.Inallofthem,thesignofthegradienttellstheneuralnetworkthefollowinginformation:

Zerogradient–Theweightisnotcontributingtotheerroroftheneuralnetwork.Negativegradient–Theweightshouldbeincreasedtoachievealowererror.Positivegradient–Theweightshouldbedecreasedtoachievealowererror.

Becausemanyalgorithmsdependongradientcalculation,wewillbeginwithananalysisofthisprocess.

WhatisaGradient

Firstofall,let’sexaminethegradient.Essentially,trainingisasearchforthesetofweightsthatwillcausetheneuralnetworktohavethelowesterrorforatrainingset.Ifwehadaninfiniteamountofcomputationresources,wewouldsimplytryeverypossiblecombinationofweightstodeterminetheonethatprovidedthelowesterrorduringthetraining.

Becausewedonothaveunlimitedcomputingresources,wehavetousesomesortofshortcuttopreventtheneedtoexamineeverypossibleweightcombination.Thesetrainingmethodsutilizeclevertechniquestoavoidperformingabrute-forcesearchofallweightvalues.Thistypeofexhaustivesearchwouldbeimpossiblebecauseevensmallnetworkshaveaninfinitenumberofweightcombinations.

Considerachartthatshowstheerrorofaneuralnetworkforeachpossibleweight.Figure6.1isagraphthatdemonstratestheerrorforasingleweight:

Figure6.1:GradientofaSingleWeight

Lookingatthischart,youcaneasilyseethattheoptimalweightisthelocationwherethelinehasthelowesty-value.Theproblemisthatweseeonlytheerrorforthecurrentvalueoftheweight;wedonotseetheentiregraphbecausethatprocesswouldrequireanexhaustivesearch.However,wecandeterminetheslopeoftheerrorcurveataparticularweight.Intheabovechart,weseetheslopeoftheerrorcurveat1.5.Thestraightlinethatbarelytouchestheerrorcurveat1.5givestheslope.Inthiscase,theslope,orgradient,is-0.5622.Thenegativeslopeindicatesthatanincreaseintheweightwilllowertheerror.

Thegradientistheinstantaneousslopeoftheerrorfunctionatthespecifiedweight.Thederivativeoftheerrorcurveatthatpointgivesthegradient.Thislinetellsusthesteepnessoftheerrorfunctionatthegivenweight.

Derivativesareoneofthemostfundamentalconceptsincalculus.Forthepurposesofthisbook,youjustneedtounderstandthataderivativeprovidestheslopeofafunctionataspecificpoint.Atrainingtechniqueandthisslopecangiveyoutheinformationtoadjusttheweightforalowererror.Usingourworkingdefinitionofthegradient,wewillnowshowhowtocalculateit.

CalculatingGradients

Wewillcalculateanindividualgradientforeachweight.Ourfocusisnotonlytheequationsbutalsotheapplicationsinactualneuralnetworkswithrealnumbers.Figure6.2showstheneuralnetworkthatwewilluse:

Figure6.2:AnXORNetwork

Additionally,weusethissameneuralnetworkinseveralexamplesonthewebsiteforthisbook.Inthischapter,wewillshowseveralcalculationsthatdemonstratethetrainingofaneuralnetwork.Wemustusethesamestartingweightssothatthesecalculationsareconsistent.However,theaboveweightshavenospecialcharacteristic;theprogramgeneratedthemrandomly.

Theaforementionedneuralnetworkisatypicalthree-layerfeedforwardnetworkliketheoneswehavepreviouslystudied.Thecirclesindicateneurons.Thelinesconnectingthecirclesaretheweights.Therectanglesinthemiddleoftheconnectionsgivetheweightforeachconnection.

Theproblemthatwenowfaceiscalculatingthepartialderivativeforeachoftheweightsintheneuralnetwork.Weuseapartialderivativewhenanequationhasmorethanonevariable.Eachoftheweightsisconsideredavariablebecausetheseweightvalueswillchangeindependentlyastheneuralnetworkchanges.Thepartialderivativesofeachweightsimplyshoweachweight’sindependenteffectontheerrorfunction.Thispartial

derivativeisthegradient.

Wecancalculateeachpartialderivativewiththechainruleofcalculus.Wewillbeginwithonetrainingsetelement.ForFigure6.2weprovideaninputof[1,0]andexpectanoutputof[1].Youcanseethatweapplytheinputontheabovefigure.Thefirstinputneuronhasaninputvalueof1.0,andthesecondinputneuronhasaninputvalueof0.0.

Thisinputfeedsthroughthenetworkandeventuallyproducesanoutput.Chapter4,“FeedforwardNeuralNetworks,”coverstheexactprocesstocalculatetheoutputandsums.Backpropagationhasbothaforwardandbackwardspass.Theforwardpassoccurswhenwecalculatetheoutputoftheneuralnetwork.Wewillcalculatethegradientsonlyforthisiteminthetrainingset.Otheritemsinthetrainingsetwillhavedifferentgradients.Wewilldiscusshowtocombinethegradientsfortheindividualtrainingsetelementlaterinthechapter.

Wearenowreadytocalculatethegradients.Thestepsinvolvedincalculatingthegradientsforeachweightaresummarizedhere:

Calculatetheerror,basedontheidealofthetrainingset.Calculatethenode(neuron)deltafortheoutputneurons.Calculatethenodedeltafortheinteriorneurons.Calculateindividualgradients.

Wewilldiscussthesestepsinthesubsequentsections.

CalculatingOutputNodeDeltas

Calculatingaconstantvalueforeverynode,orneuron,intheneuralnetworkisthefirststep.Wewillstartwiththeoutputnodesandworkourwaybackwardsthroughtheneuralnetwork.Thetermbackpropagationcomesfromthisprocess.Weinitiallycalculatetheerrorsfortheoutputneuronsandpropagatetheseerrorsbackwardsthroughtheneuralnetwork.

Thenodedeltaisthevaluethatwewillcalculateforeachnode.Layerdeltaalsodescribesthisvaluebecausewecancalculatethedeltasonelayeratatime.Themethodfordeterminingthenodedeltascandifferifyouarecalculatingforanoutputorinteriornode.Theoutputnodesarecalculatedfirst,andtheytakeintoaccounttheerrorfunctionfortheneuralnetwork.Inthisvolume,wewillexaminethequadraticerrorfunctionandthecrossentropyerrorfunction.

QuadraticErrorfunction

Programmersofneuralnetworksfrequentlyusethequadraticerrorfunction.Infact,youcanfindmanyexamplesofthequadraticerrorfunctionontheInternet.Ifyouarereadinganexampleprogram,anditdoesnotmentionaspecificerrorfunction,theprogramisprobablyusingthequadraticerrorfunction,alsoknownasthemeansquarederror(MSE)functiondiscussedinChapter5,“TrainingandEvaluation.”Equation6.1showstheMSEfunction:

Equation6.1:MeanSquareError(MSE)

Theaboveequationcomparestheneuralnetwork’sactualoutput(y)withtheexpectedoutput(y-hat).Thevariablencontainsthenumberoftrainingelementstimesthenumberofoutputneurons.MSEhandlesmultipleoutputneuronsasindividualcases.Equation6.2showsthenodedeltausedinconjunctionwiththequadraticerrorfunction:

Equation6.2:NodeDeltaofMSEOutputLayer

Thequadraticerrorfunctionisverysimplebecauseittakesthedifferencebetweentheexpectedandactualoutputfortheneuralnetwork.TheGreekletterφ(phi-prime)representsthederivativeoftheactivationfunction.

CrossEntropyErrorFunction

Thequadraticerrorfunctioncansometimestakealongtimetoproperlyadjusttheweight.Equation6.3showsthecrossentropyerrorfunction:

Equation6.3:CrossEntropyError

ThenodedeltacalculationforthecrossentropyerrorturnsouttobemuchlesscomplexthantheMSE,asseeninEquation6.4.

Equation6.4:NodeDeltaofCrossEntropyOutputLayer

Thecrossentropyerrorfunctionwilltypicallybetterresultsthanthequadraticitwillcreateamuchsteepergradientforerrors.Youshouldalwaysusethecrossentropyerrorfunction.

CalculatingRemainingNodeDeltas

Nowthattheoutputnodedeltahasbeencalculatedaccordingtotheappropriateerrorfunction,wecancalculatethenodedeltasfortheinteriornodes,asdemonstratedbyEquation6.5:

Equation6.5:CalculatingInteriorNodeDeltas

Wewillcalculatethenodedeltaforallhiddenandnon-biasneurons,butwedonotneedtocalculatethenodedeltafortheinputandbiasneurons.EventhoughwecaneasilycalculatethenodedeltaforinputandbiasneuronswithEquation6.5,gradientcalculationdoesnotrequirethesevalues.Asyouwillsoonsee,gradientcalculationforaweightonlyconsiderstheneurontowhichtheweightisconnected.Biasandinputneuronsareonlythebeginningpointforaconnection;theyarenevertheendpoint.

Ifyouwouldliketoseethegradientcalculationprocess,severalJavaScriptexampleswillshowtheindividualcalculations.TheseexamplescanbefoundatthefollowingURL:

http://www.heatonresearch.com/aifh/vol3/

http://www.heatonresearch.com/aifh/vol3/

DerivativesoftheActivationFunctions

Thebackpropagationprocessrequiresthederivativesoftheactivationfunctions,andtheyoftendeterminehowthebackpropagationprocesswillperform.Mostmoderndeepneuralnetworksusethelinear,softmax,andReLUactivationfunctions.WewillalsoexaminethederivativesofthesigmoidandhyperbolictangentactivationfunctionssothatwecanseewhytheReLUactivationfunctionperformssowell.

DerivativeoftheLinearActivationFunction

Thelinearactivationfunctionisbarelyanactivationfunctionatallbecauseitsimplyreturnswhatevervalueitisgiven.Forthisreason,thelinearactivationfunctionissometimescalledtheidentityactivationfunction.Thederivativeofthisfunctionis1,asdemonstratedbyEquation6.6:

Equation6.6:DerivativeoftheLinearActivationFunction

TheGreekletterφ(phi)representstheactivationfunction,asinpreviouschapters.However,theapostrophejustaboveandtotherightofφ(phi)meansthatweareusingthederivativeoftheactivationfunction.Thisisoneofseveralwaysthataderivativeisexpressedinamathematicalform.

DerivativeoftheSoftmaxActivationFunction

Inthisvolume,thesoftmaxactivationfunction,alongwiththelinearactivationfunction,isusedonlyontheoutputlayeroftheneuralnetworks.AsmentionedinChapter1,“NeuralNetworkBasics,”thesoftmaxactivationfunctionisdifferentfromtheotheractivationfunctionsinthatitsvalueisdependentontheotheroutputneurons,notjustontheoutputneuroncurrentlybeingcalculated.Forconvenience,thesoftmaxactivationfunctionisrepeatedinEquation6.7:

Equation6.7:SoftmaxActivationFunction

Thezvectorrepresentstheoutputfromalloutputneurons.Equation6.8showsthederivativeofthisactivationfunction:

Equation6.8:DerivativeoftheSoftmaxActivationFunction

Weusedslightlydifferentnotationfortheabovederivative.Theratio,withthecursive-stylized“d”symbolmeansapartialderivative,whichoccurswhenyoudifferentiateanequationwithmultiplevariables.Totakeapartialderivative,youdifferentiatetheequationrelativetoonevariable,holdingallothersconstant.Thetop“d”tellsyouwhatfunctionyouaredifferentiating.Inthiscase,itistheactivationfunctionφ(phi).Thebottom“d”denotestherespectivedifferentiationofthepartialderivative.Inthiscase,wearecalculatingtheoutputoftheneuron.Allothervariablesaretreatedasconstant.Aderivativeistheinstantaneousrateofchange—onlyonethingcanchangeatonce.

Youwillnotusethederivativeofthelinearorsoftmaxactivationfunctionstocalculatethegradientsoftheneuralnetworkifyouusethecrossentropyerrorfunction.Youshouldusethelinearandsoftmaxactivationfunctionsonlyattheoutputlayerofaneuralnetwork.Therefore,wedonotneedtoworryabouttheirderivativesfortheinteriornodes.Fortheoutputnodeswithcrossentropy,thederivativeofbothlinearandsoftmaxisalways1.Asaresult,youwillneverusethelinearorsoftmaxderivativesforinteriornodes.

DerivativeoftheSigmoidActivationFunction

Equation6.9showsthederivativeofthesigmoidactivationfunction:

Equation6.9:DerivativeoftheSigmoidActivationFunction

Machinelearningfrequentlyutilizesthesigmoidfunctionrepresentedintheaboveequation.Wederivedtheformulathroughalgebraicmanipulationofthesigmoidderivativeinordertousethesigmoidactivationfunctioninitsownderivative.Forcomputationalefficiency,theGreekletterφ(phi)intheaboveactivationfunctionrepresentsthesigmoidfunction.Duringthefeedforwardpass,wecalculatedthevalueofthesigmoidfunction.Retainingthesigmoidfunctionmakesthesigmoidderivativeasimplecalculation.IfyouareinterestedinhowtoobtainEquation6.9,youcanrefertothefollowingURL:

http://www.heatonresearch.com/aifh/vol3/deriv_sigmoid.html

DerivativeoftheHyperbolicTangentActivationFunction

Equation6.10showsthederivativeofthehyperbolictangentactivationfunction:

Equation6.10:DerivativeoftheHyperbolicTangentActivationFunction

Werecommendthatyoualwaysusethehyperbolictangentactivationfunctioninsteadofthesigmoidactivationfunction.

DerivativeoftheReLUActivationFunction

Equation6.11showsthederivativeoftheReLUfunction:

Equation6.11:DerivativeoftheReLUActivationFunction

Strictlyspeaking,theReLUfunctiondoesnothaveaderivativeat0.However,becauseofconvention,thegradientof0issubstitutedwhenxis0.Deepneuralnetworkswithsigmoidandhyperbolictangentactivationfunctionscanbedifficulttotrainusingbackpropagation.Severalfactorscausethisdifficulty.Thevanishinggradientproblemisonethemostcommoncauses.Figure6.3showsthehyperbolictangentfunction,alongwithitsgradient/derivative:

http://www.heatonresearch.com/aifh/vol3/deriv_sigmoid.html

Figure6.3:TanhActivationFunction&Derivative

Figure6.3showsthatasthehyperbolictangent(blueline)saturatesto-1and1,thederivativeofthehyperbolictangent(redline)vanishesto0.Thesigmoidandhyperbolictangentactivationfunctionsbothhavethisproblem,butReLUdoesn’t.Figure6.4showsthesamegraphforthesigmoidactivationfunctionanditsvanishingderivative:

Figure6.4:SigmoidActivationFunction&Derivative

ApplyingBackpropagation

Backpropagationisasimpletrainingmethodthatadjuststheweightsoftheneuralnetworkwithitscalculatedgradients.Thismethodisaformofgradientdescentsincewearedescendingthegradientstolowervalues.Astheprogramadjuststheseweights,theneuralnetworkshouldproducemoredesirableoutput.Theglobalerroroftheneuralnetworkshouldfallasittrains.Beforewecanexaminethebackpropagationweightupdateprocess,wemustexaminetwodifferentwaystoupdatetheweights.

BatchandOnlineTraining

Wehavealreadyshownhowtocalculatethegradientsforanindividualtrainingsetelement.Earlierinthischapter,wecalculatedthegradientsforacaseinwhichwegavetheneuralnetworkaninputof[1,0]andexpectedanoutputof[1].Thisresultisacceptableforasingletrainingsetelement.However,mosttrainingsetshavemanyelements.Therefore,wecanhandlemultipletrainingsetelementsthroughtwoapproachescalledonlineandbatchtraining.

Onlinetrainingimpliesthatyoumodifytheweightsaftereverytrainingsetelement.Usingthegradientsobtainedinthefirsttrainingsetelement,youcalculateandapplyachangetotheweights.Trainingprogressestothenexttrainingsetelementandalsocalculatesanupdatetotheneuralnetwork.Thistrainingcontinuesuntilyouhaveusedeverytrainingsetelement.Atthispoint,oneiteration,orepoch,oftraininghascompleted.

Batchtrainingalsoutilizesallthetrainingsetelements.However,wehavenotupdatedtheweights.Instead,wesumthegradientsforeachtrainingsetelement.Oncewehavesummedthetrainingsetelements,wecanupdatetheneuralnetworkweights.Atthispoint,theiterationiscomplete.

Sometimes,wecansetabatchsize.Forexample,youmighthaveatrainingsetsizeof10,000elements.Youmightchoosetoupdatetheweightsoftheneuralnetworkevery1,000elements,therebycausingtheneuralnetworkweightstoupdatetentimesduringthetrainingiteration.

Onlinetrainingwastheoriginalmethodforbackpropagation.Ifyouwouldliketoseethecalculationsforthebatchversionofthisprogram,refertothefollowingonlineexample:

http://www.heatonresearch.com/aifh/vol3/xor_batch.html

http://www.heatonresearch.com/aifh/vol3/xor_batch.html

StochasticGradientDescent

Batchandonlinetrainingarenottheonlychoicesforbackpropagation.Stochasticgradientdescent(SGD)isthemostpopularofthebackpropagationalgorithms.SGDcanworkineitherbatchoronlinemode.Onlinestochasticgradientdescentsimplyselectsatrainingsetelementatrandomandthencalculatesthegradientandperformsaweightupdate.Thisprocesscontinuesuntiltheerrorreachesanacceptablelevel.Choosingrandomtrainingsetelementswillusuallyconvergetoanacceptableweightfasterthanloopingthroughtheentiretrainingsetforeachiteration.

Batchstochasticgradientdescentworksbychoosingabatchsize.Foreachiteration,amini-batchischosenbyrandomlyselectinganumberoftrainingsetelementsuptothechosenbatchsize.Thegradientsfromthemini-batcharesummedjustasregularbackpropagationbatchupdating.Thisupdateisverysimilartoregularbatchupdatingexceptthatthemini-batchesarerandomlychoseneachtimetheyareneeded.TheiterationstypicallyprocessasinglebatchinSGD.Batchesareusuallymuchsmallerthantheentiretrainingsetsize.Acommonchoiceforthebatchsizeis600.

BackpropagationWeightUpdate

Wearenowreadytoupdatetheweights.Aspreviouslymentioned,wewilltreattheweightsandgradientsasasingle-dimensionalarray.Giventhesetwoarrays,wearereadytocalculatetheweightupdateforaniterationofbackpropagationtraining.Equation6.6showstheformulatoupdatetheweightsforbackpropagation:

Equation6.12:BackpropagationWeightUpdate

Theaboveequationcalculatesthechangeinweightforeachelementintheweightarray.Youwillalsonoticethattheaboveequationcallsfortheweightchangefromthepreviousiteration.Youmustkeepthesevaluesinanotherarray.Aspreviouslymentioned,thedirectionoftheweightupdateisinverselyrelatedtothesignofthegradient—apositivegradientshouldcauseaweightdecrease,andviceversa.BecauseofthisinverserelationshipEquation6.12beginswithanegative.

Theaboveequationcalculatestheweightdeltaastheproductofthegradientandthelearningrate(representedbyε,epsilon).Furthermore,weaddtheproductofthepreviousweightchangeandthemomentumvalue(representedbyα,alpha).Thelearningrateand

momentumaretwoparametersthatwemustprovidetothebackpropagationalgorithm.Choosingvaluesforlearningrateandmomentumisveryimportanttotheperformanceofthetraining.Unfortunately,theprocessfordetermininglearningrateandmomentumismostlytrialanderror.

Thelearningratescalesthegradientandcanslowdownorspeeduplearning.Alearningratebelow0willslowdownlearning.Forexample,alearningrateof0.5woulddecreaseeverygradientby50%.Alearningrateabove1.0wouldacceleratetraining.Inreality,thelearningrateisalmostalwaysbelow1.

Choosingalearningratethatistoohighwillcauseyourneuralnetworktofailtoconvergeandhaveahighglobalerrorthatsimplybouncesaroundinsteadofconvergingtoalowvalue.Choosingalearningratethatistoolowwillcausetheneuralnetworktotakeagreatdealoftimetoconverge.

Likethelearningrate,themomentumisalsoascalingfactor.Althoughitisoptional,momentumdeterminesthepercentofthepreviousiteration’sweightchangethatshouldbeappliedtotheiteration.Ifyoudonotwanttousemomentum,justspecifyavalueof0.

Momentumisatechniqueaddedtobackpropagationthathelpsthetrainingescapelocalminima,whicharelowpointsontheerrorgraphthatarenotthetrueglobalminimum.Backpropagationhasatendencytofinditswayintoalocalminimumandnotfinditswaybackoutagain.Thisprocesscausesthetrainingtoconvergetoahigherundesirableerror.Momentumgivestheneuralnetworksomeforceinitscurrentdirectionandmayallowittobreakthroughalocalminimum.

ChoosingLearningRateandMomentum

Momentumandlearningratecontributetothesuccessofthetraining,buttheyarenotactuallypartoftheneuralnetwork.Oncetrainingiscomplete,thetrainedweightsremainandnolongerutilizemomentumorthelearningrate.Theyareessentiallypartofthetemporaryscaffoldingthatcreatesatrainedneuralnetwork.Choosingthecorrectmomentumandlearningratecanimpacttheeffectivenessofyourtraining.

Thelearningrateaffectsthespeedatwhichyourneuralnetworktrains.Decreasingthelearningratemakesthetrainingmoremeticulous.Higherlearningratesmightskippastoptimalweightsettings.Alowertrainingratewillalwaysproducebetterresults.However,loweringthetrainingratecangreatlyincreaseruntime.Loweringthelearningrateasthenetworktrainscanbeaneffectivetechnique.

Youcanusethemomentumtocombatlocalminima.Ifyoufindtheneuralnetworkstagnating,ahighermomentumvaluemightpushthetrainingpastthelocalminimumthatitencountered.Ultimately,choosinggoodvaluesformomentumandlearningrateisaprocessoftrialanderror.Youcanvarybothastrainingprogresses.Momentumisoftensetto0.9andthelearningrateto0.1orlower.

NesterovMomentum

Thestochasticgradientdescent(SGD)algorithmcansometimesproduceerraticresultsbecauseoftherandomnessintroducedbythemini-batches.Theweightsmightgetaverybeneficialupdateinoneiteration,butapoorchoiceoftrainingelementscanundoitinthenextmini-batch.Therefore,momentumisaresourcefultoolthatcanmitigatethissortoferratictrainingresult.

NesterovmomentumisarelativelynewapplicationofatechniqueinventedbyYuNesterovin1983andupdatedinhisbook,IntroductoryLecturesonConvexOptimization:ABasicCourse(Nesterov,2003).ThistechniqueisoccasionallyreferredtoasNesterov’sacceleratedgradientdescent.AlthoughafullmathematicalexplanationofNesterovmomentumisbeyondthescopeofthisbook,wewillpresentitfortheweightsinsufficientdetailsoyoucanimplementit.Thisbook’sexamples,includingthosefortheonlineJavaScript,containanimplementationofNesterovmomentum.Additionally,thebook’swebsitecontainsJavascriptthatoutputexamplecalculationsfortheweightupdatesofNesterovmomentum.

Equation6.13calculatesapartialweightupdatebasedonboththelearningrate(ε,epsilon)andmomentum(α,alpha):

Equation6.13:NesterovMomentum

Thecurrentiterationissignifiedbyt,andthepreviousiterationbyt-1.Thispartialweightupdateiscallednandinitiallystartsoutat0.Subsequentcalculationsofthepartialweightupdatearebasedonthepreviousvalueofthepartialweightupdate.Thepartialderivativeintheaboveequationisthegradientoftheerrorfunctionatthecurrentweight.Equation6.14showstheNesterovmomentumupdatethatreplacesthestandardbackpropagationweightupdateshownearlierinEquation6.12:

Equation6.14:NesterovUpdate

Theaboveweightchangeiscalculatedasanamplificationofthepartialweightchange.Thedeltaweightshownintheaboveequationisaddedtothecurrentweight.Stochasticgradientdescent(SGD)withNesterovmomentumisoneofthemosteffectivetrainingalgorithmsfordeeplearning.

ChapterSummary

Thischapterintroducedclassicbackpropagationaswellasstochasticgradientdescent(SGD).Thesemethodsareallbasedongradientdescent.Inotherwords,theyoptimizedindividualweightswithderivatives.Foragivenweightvalue,thederivativegavetheprogramtheslopeoftheerrorfunction.Theslopeallowedtheprogramtodeterminehowtochangetheweightvalue.Eachtrainingalgorithminterpretsthisslope,orgradient,differently.

Despitethefactthatbackpropagationisoneoftheoldesttrainingalgorithms,itremainsoneofthemostpopularones.Backpropagationsimplyaddsthegradienttotheweight.Anegativegradientwillincreasetheweight,andapositivegradientwilldecreasetheweight.Wescaletheweightbythelearningrateinordertopreventtheweightsfromchangingtoorapidly.Alearningrateof0.5wouldmeantoaddhalfofthegradienttotheweight,whereasalearningrateof2.0wouldmeantoaddtwicethegradient.

Thereareanumberofvariantstothebackpropagationalgorithm.Someofthese,suchasresilientpropagation,aresomewhatpopular.Thenextchapterwillintroducesomebackpropagationvariants.Thoughthesevariantsareusefultoknow,stochasticgradientdescent(SGD)remainsthemostcommondeeplearningtrainingalgorithm.

Chapter7:OtherPropagationTrainingResilientPropagationLevenberg-MarquardtHessianandJacobeanMatrices

Thebackpropagationalgorithmhasinfluencedmanytrainingalgorithms,suchasthestochasticgradientdescent(SGD),introducedinthepreviouschapter.Formostpurposes,theSGDalgorithm,alongwithNesterovmomentum,isagoodchoiceforatrainingalgorithm.However,otheroptionsexist.Inthischapter,weexaminetwopopularalgorithmsinspiredbyelementsfrombackpropagation.

Tomakeuseofthesetwoalgorithms,youdonotneedtounderstandeverydetailoftheirimplementation.Essentially,bothalgorithmsaccomplishthesameobjectiveasbackpropagation.Thus,youcansubstitutethemforbackpropagationorstochasticgradientdescent(SGD)inmostneuralnetworkframeworks.IfyoufindSGDisnotconverging,youcanswitchtoresilientpropagation(RPROP)orLevenberg-Marquardtalgorithminordertoexperiment.However,youcanskipthischapterifyouarenotinterestedintheactualimplementationdetailsofeitheralgorithm.

ResilientPropagation

RPROPfunctionsverymuchlikebackpropagation.BothbackpropagationandRPROPmustfirstcalculatethegradientsfortheweightsoftheneuralnetwork.However,backpropagationandRPROPdifferinthewaytheyusethegradients.Reidmiller&Braun(1993)introducedtheRPROPalgorithm.

OneimportantfeatureoftheRPROPalgorithmisthatithasnonecessarytrainingparameters.Whenyouutilizebackpropagation,youmustspecifythelearningrateandmomentum.Thesetwoparameterscangreatlyimpacttheeffectivenessofyourtraining.AlthoughRPROPdoesincludeafewtrainingparameters,youcanalmostalwaysleavethemattheirdefault.

TheRPROPprotocolhasseveralvariants.Someofthevariantsarelistedbelow:

RPROP+RPROP-iRPROP+iRPROP-

WewillfocusonclassicRPROP,asdescribedbyReidmiller&Braun(1994).TheotherfourvariantsdescribedabovearerelativelyminoradaptationsofclassicRPROP.Inthenextsections,wewilldescribehowtoimplementtheclassicRPROPalgorithm.

RPROPArguments

Aspreviouslymentioned,oneadvantageRPROPhasoverbackpropagationisthatyoudon’tneedtoprovideanytrainingargumentsinordertouseRPROP.However,thisdoesn’tmeanthatRPROPlacksconfigurationsettings.ItsimplymeansthatyouusuallydonotneedtochangetheconfigurationsettingsforRPROPfromtheirdefaults.However,ifyoureallywanttochangethem,youcanchooseamongthefollowingconfigurationsettings:

InitialUpdateValuesMaximumStep

Asyouwillseeinthenextsection,RPROPkeepsanarrayofupdatevaluesfortheweights,whichdetermineshowmuchyouwillaltereachweight.Thischangeissimilartothelearningrateinbackpropagation,butitismuchbetterbecausethealgorithmadjuststheupdatevalueofeveryweightintheneuralnetworkastrainingprogresses.Althoughsomebackpropagationalgorithmswillvarythelearningrateandmomentumaslearningprogresses,mostwilluseasinglelearningratefortheentireneuralnetwork.Therefore,theRPROPapproachhasanadvantageoverbackpropagationalgorithms.

Westarttheseupdatevaluesatthedefaultof0.1,accordingtotheinitialupdatevaluesargument.Asageneralrule,weshouldneverchangethisdefault.However,wecanmakeanexceptiontothisruleifwehavealreadytrainedtheneuralnetwork.Inthecaseofapreviouslytrainedneuralnetwork,someoftheinitialupdatevaluesaregoingtobetoostrong,andtheneuralnetworkwillregressformanyiterationsbeforeitcanimprove.Asaresult,atrainedneuralnetworkmaybenefitfromamuchsmallerinitialupdate.

Anotherapproachforanalreadytrainedneuralnetworkistosavetheupdatevaluesoncetrainingstopsandusethemforthenewtraining.Thismethodwillallowyoutoresumetrainingwithouttheinitialspikeinerrorsthatyouwouldnormallyseewhenresumingresilientpropagationtraining.Thisapproachwillonlyworkifyouarecontinuingresilientpropagationonanalreadytrainednetwork.Ifyouwerepreviouslytrainingtheneuralnetworkwithadifferenttrainingalgorithm,thenyouwillbeabletorestorefromanarrayofupdatevalues.

Astrainingprogresses,youwillusethegradientstoadjusttheupdatesupanddown.Themaximumstepargumentdefinesthemaximumupwardstepsizethatthegradientcantakeovertheupdatevalues.Thedefaultvalueforthemaximumstepargumentis50.Itisunlikelythatyouwillneedtochangethevalueofthisargument.

Inadditiontothesearguments,RPROPkeepsconstantsduringprocessing.Thesearevaluesthatyoucanneverchange.Theconstantsarelistedasfollows:

DeltaMinimum(1e-6)Negativeη(Eta)(0.5)

Positive-η(Eta)(1.2)ZeroTolerance(1e-16)

Deltaminimumspecifiestheminimumvaluethatanyoftheupdatevaluescanreach.Ifanupdatevaluewereat0,itwouldneverbeabletoincreasebeyond0.Wewilldescribenegativeandpositiveη(eta)inthenextsections.

Thezerotolerancedefineshowcloselyanumbershouldreach0beforethatnumberisequalto0.Incomputerprogramming,itistypicallybadpracticetocompareafloating-pointnumberto0becausethenumberwouldhavetoequal0exactly.Rather,youtypicallyseeiftheabsolutevalueofanumberisbelowanarbitrarilysmallnumber.Asufficientlysmallnumberisconsidered0.

DataStructures

YoumustkeepseveraldatastructuresinmemorywhileyouperformRPROPtraining.Thesestructuresareallarraysoffloating-pointnumbers.Theyaresummarizedhere:

CurrentUpdateValuesLastWeightChangeValuesCurrentWeightChangeValuesCurrentGradientValuesPreviousGradientValues

Youkeepthecurrentupdatevaluesforthetraining.Ifyouwanttoresumetrainingatsomepoint,youmuststorethisupdatevaluearray.Eachweighthasoneupdatevaluethatcannotgobelowtheminimumdeltaconstant.Likewise,theseupdatevaluescannotexceedthemaximumstepargument.

RPROPmustkeepseveralvaluesbetweeniterations.Youmustalsotrackthelastweightdeltavalue.Backpropagationkeepsthepreviousweightdeltaformomentum.RPROPusesthisdeltavalueinadifferentwaythatwewillexamineinthenextsection.Youalsoneedthecurrentandpreviousgradients.RPROPneedstoknowwhenthesignchangesfromthecurrentgradienttothepreviousgradient.Thischangeindicatesthatyoumustactontheupdatevalues.Wewilldiscusstheseactionsinthenextsection.

UnderstandingRPROP

Intheprevioussections,weexaminedthearguments,constants,anddatastructuresnecessaryforRPROP.Inthissection,wewillshowyouaniterationofRPROP.Whenwediscussedbackpropagationinearliersections,wementionedtheonlineandbatchweightupdatemethods.However,RPROPdoesnotsupportonlinetrainingsoallweightupdatesforRPROPwillbeperformedinbatchmode.Asaresult,eachiterationofRPROPwillreceivegradientsthatarethesumoftheindividualgradientsofeachtrainingset.Thisaspectisconsistentwithbackpropagationinbatchmode.

DetermineSignChangeofGradient

Atthispoint,wehavethegradientsthatarethesameasthegradientscalculatedbythebackpropagationalgorithm.BecauseweusethesameprocesstoobtaingradientsinbothRPROPandbackpropagation,wewillnotrepeatithere.Forthefirststep,wecomparethegradientofthecurrentiterationtothegradientofthepreviousiteration.Ifthereisnopreviousiteration,thenwecanassumethatthepreviousgradientwas0.

Todeterminewhetherthegradientsignhaschanged,wewillusethesign(sgn)function.Equation7.1definesthesgnfunction:

Equation7.1:TheSignFunction(sgn)

Thesgnfunctionreturnsthesignofthenumberprovided.Ifxislessthan0,theresultis-1.Ifxisgreaterthan0,thentheresultis1.Ifxisequalto0,thentheresultis0.Weusuallyimplementthesgnfunctiontouseatolerancefor0,sinceitisnearlyimpossibleforfloating-pointoperationstohit0preciselyonacomputer.

Todeterminewhetherthegradienthaschangedsign,weuseEquation7.2:

Equation7.2:DetermineGradientSignChange

Equation7.2willresultinaconstantc.Weevaluatethisvalueasnegativeorpositiveorcloseto0.Anegativevalueforcindicatesthatthesignhaschanged.Apositivevalueindicatesthatthereisnochangeinsignforthegradient.Avaluenear0indicatesaverysmallchangeinsignoralmostnochangeinsign.

Considerthefollowingsituationsforthesethreeoutcomes:

-1*1=-1(negative,changedfromnegativetopositive)

1*1=1(positive,nochangeinsign)

1.0*0.000001=0.000001(nearzero,almostchangedsigns,butnotquite)

Nowthatwehavecalculatedtheconstantc,whichgivessomeindicationofsignchange,wecancalculatetheweightchange.Thenextsectionincludesadiscussionofthiscalculation.

CalculateWeightChange

Nowthatwehavethechangeinsignofthegradient,wecanobservewhathappensineachofthethreecasesmentionedintheprevioussection.Equation7.3summarizesthesethreecases:

Equation7.3:CalculateRPROPWeightChange

Thisequationcalculatestheactualweightchangeforeachiteration.Ifthevalueofcispositive,thentheweightchangewillbeequaltothenegativeoftheweightupdatevalue.Similarly,ifthevalueofcisnegative,theweightchangewillbeequaltothepositiveof

theweightupdatevalue.Finally,ifthevalueofcisnear0,therewillbenoweightchange.

ModifyUpdateValues

Weusetheweightupdatevaluesfromtheprevioussectiontoupdatetheweightsoftheneuralnetwork.Everyweightintheneuralnetworkhasaseparateweightupdatevaluethatworksmuchbetterthanthesinglelearningrateofbackpropagation.Wemodifytheseweightupdatevaluesduringeachtrainingiteration,asseeninEquation7.4:

Equation7.4:ModifyUpdateValues

Wecanmodifytheweightupdatevaluesinawaythatisverysimilartothechangesoftheweights.Webasetheseweightupdatevaluesonthepreviouslycalculatedvaluec,justliketheweights.

Ifthevalueofcispositive,thenwemultiplytheweightupdatevaluebythevalueofpositive+η(eta).Similarly,ifthevalueofcisnegative,wemultiplytheweightupdatevaluebynegative-η(eta).Finally,ifthevalueofcisnear0,thenwedon’tchangetheweightupdatevalue.

TheJavaScriptexamplesiteforthisbookhasexamplesoftheRPROPupdateaswellasexamplesofthepreviousequationsandsamplecalculations.

Levenberg-MarquardtAlgorithm

TheLevenberg–Marquardtalgorithm(LMA)isaveryefficienttrainingmethodforneuralnetworks.Inmanycases,LMAwilloutperformRPROP.Asaresult,everyneuralnetworkprogrammershouldconsiderthistrainingalgorithm.Levenberg(1940)introducedthefoundationfortheLMA,andMarquardt(1963)expandeditsmethods.

LMAisahybridalgorithmthatisbasedonNewton’smethod(GNA)andongradientdescent(backpropagation).Thus,LMAcombinesthestrengthsofGNAand

backpropagation.Althoughgradientdescentisguaranteedtoconvergetoalocalminimum,itisslow.Newton’smethodisfast,butitoftenfailstoconverge.Byusingadampingfactortointerpolatebetweenthetwo,wecreateahybridmethod.Tounderstandhowthishybridworks,wewillfirstexamineNewton’smethod.Equation7.5showsNewton’smethod:

Equation7.5:Newton’sMethod(GNA)

Youwillnoticeseveralvariablesintheaboveequation.Theresultoftheequationisthatyoucanapplydeltastotheweightsoftheneuralnetwork.ThevariableHrepresentstheHessian,whichwewilldiscussinthenextsection.Thevariablegrepresentsthegradientsoftheneuralnetwork.Youwillalsonoticethe-1“exponent”onthevariableH,whichspecifiesthatwearedoingamatrixdecompositionofthevariablesHandg.

Wecouldeasilyspendanentirechapteronmatrixdecomposition.However,wewillsimplytreatmatrixdecompositionasablackboxatomicoperatorforthepurposesofthisbook.Becausewewillnotexplainhowtocalculatematrixdecomposition,wehaveincludedacommonpieceofcodetakenfromtheJAMApackage.Manymathematicalcomputerapplicationshaveusedthispublicdomaincode,adaptedfromaFORTRANprogram.Toperformmatrixdecomposition,youcanuseJAMAoranothersource.

Althoughseveraltypesofmatrixdecompositionexist,wearegoingtousetheLUdecomposition,whichrequiresasquarematrix.ThisdecompositionworkswellbecausetheHessianmatrixhasthesamenumberofrowsascolumns.Everyweightintheneuralnetworkhasarowandcolumn.TheLUdecompositiontakestheHessian,whichisamatrixofthesecondderivativeofthepartialderivativesoftheoutputofeachoftheweights.TheLUdecompositionsolvestheHessianbythegradients,whicharethesquareoftheerrorofeachweight.ThesegradientsarethesameasthosethatwecalculatedinChapter6,“BackpropagationTraining,”excepttheyaresquared.Becausetheerrorsaresquared,wemustusethesumofsquareerrorwhendealingwithLMA.

Secondderivativeisanimportanttermtoknow.Itisthederivativeofthefirstderivative.RecallfromChapter6,“BackpropagationTraining,”thatthederivativeofafunctionistheslopeatanypoint.Thisslopeshowsthedirectionthatthecurveisapproachingforalocalminimum.Thesecondderivativeisalsoaslope,anditpointsinadirectiontominimizethefirstderivative.ThegoalofNewton’smethod,aswellasoftheLMA,istoreduceallofthegradientsto0.

It’sinterestingtonotethatthegoaldoesnotincludetheerror.Newton’smethodandLMAcanbeoblivioustotheerrorbecausetheytrytoreduceallthegradientsto0.Inreality,theyarenotcompletelyoblivioustotheerrorbecausetheyuseittocalculatethegradients.

Newton’smethodwillconvergetheweightsofaneuralnetworktoalocalminimum,alocalmaximum,orastraddleposition.Weachievethisconvergencebyminimizingallthegradients(firstderivatives)to0.Thederivativeswillbe0atlocalminima,maxima,or

straddleposition.Figure7.1showsthesethreepoints:

Figure7.1:LocalMinimum,StraddleandLocalMaximum

Thealgorithmimplementationmustensurethatlocalmaximaandstraddlepointsarefilteredout.TheabovealgorithmworksbytakingthematrixdecompositionoftheHessianmatrixandthegradients.TheHessianmatrixistypicallyestimated.SeveralmethodsexisttoestimatetheHessianmatrix.However,ifitisinaccurate,itcanharmNewton’smethod.

LMAenhancesNewton’salgorithmtothefollowingformulainEquation7.6:

Equation7.6:Levenberg–MarquardtAlgorithm

Inthisequation,weaddadampingfactormultipliedbyanidentitymatrix.Thedampingfactorisrepresentedbyλ(lambda),andIrepresentstheidentitymatrix,whichisasquarematrixwithallthevaluesat0exceptforanorthwest(NW)lineofvaluesat1.Aslambdaincreases,theHessianwillbefactoredoutoftheaboveequation.Aslambdadecreases,theHessianbecomesmoresignificantthangradientdescent,allowingthetrainingalgorithmtointerpolatebetweengradientdescentandNewton’smethod.Higherlambdafavorsgradientdescent;lowerlambdafavorsNewton.AtrainingiterationofLMAbeginswithalowlambdaandincreasesituntiladesirableoutcomeisproduced.

CalculationoftheHessian

TheHessianmatrixisasquarematrixwithrowsandcolumnsequaltothenumberofweightsintheneuralnetwork.Eachcellinthismatrixrepresentsthesecondorderderivativeoftheoutputoftheneuralnetworkwithrespecttoagivenweightcombination.Equation7.7showstheHessian:

Equation7.7:TheHessianMatrix

ItisimportanttonotethattheHessianissymmetricalaboutthediagonal,whichyoucanusetoenhanceperformanceofthecalculation.Equation7.8calculatestheHessianbycalculatingthegradients:

Equation7.8:CalculatingtheGradients

ThesecondderivativeoftheaboveequationbecomesanelementoftheHessianmatrix.YoucanuseEquation7.9tocalculateit:

Equation7.9:CalculatingtheExactHessian

Ifnotforthesecondcomponent,youcouldeasilycalculatetheaboveformula.However,thissecondcomponentinvolvesthesecondpartialderivativeandthatisdifficulttocalculate.Becausethecomponentisnotimportant,youcanactuallydropitbecauseitsvaluedoesnotsignificantlycontributetotheoutcome.Whilethesecondpartialderivativemightbeimportantforanindividualtrainingcase,itsoverallcontributionisnotsignificant.ThesecondcomponentofEquation7.9ismultipliedbytheerrorofthattrainingcase.Weassumethattheerrorsinatrainingsetareindependentandevenlydistributedabout0.Onanentiretrainingset,theyshouldessentiallycanceleachotherout.Becausewearenotusingallcomponentsofthesecondderivative,wehaveonlyanapproximationoftheHessian,whichissufficienttogetagoodtrainingresult.

Equation7.10usestheapproximation,resultinginthefollowing:

Equation7.10:ApproximatingtheExactHessian

WhiletheaboveequationisonlyanapproximationofthetrueHessian,thesimplificationofthealgorithmtocalculatethesecondderivativeiswellworththelossinaccuracy.Infact,anincreaseinλ(lambda)willaccountforthelossofaccuracy.

TocalculatetheHessianandgradients,wemustdeterminethepartialfirstderivativesoftheoutputoftheneuralnetwork.Oncewehavethesepartialfirstderivatives,theaboveequationsallowustoeasilycalculatetheHessianandgradients.

Calculationofthefirstderivativesoftheoutputoftheneuralnetworkisverysimilarto

theprocessthatweusedtocalculatethegradientsforbackpropagation.Themaindifferenceisthatwetakethederivativeoftheoutput.Instandardbackpropagation,wetakethederivativeoftheerrorfunction.Wewillnotreviewtheentirebackpropagationprocesshere.Chapter6,“BackpropagationTraining,”coversbackpropagationandgradientcalculation.

LMAwithMultipleOutputs

SomeimplementationsofLMAsupportonlyasingle-outputneuronbecauseLMAhasrootsinmathematicalfunctionapproximation.Inmathematics,functionstypicallyreturnonlyasinglevalue.Asaresult,manybooksandpapersdonotcontaindiscussionsofmultiple-outputLMA.However,youcanuseLMAwithmultipleoutputs.

Supportformultiple-outputneuronsinvolvessummingeachcelloftheHessianasyoucalculatetheadditionaloutputneurons.TheprocessworksasifyoucalculatedaseparateHessianmatrixforeachoutputneuronandthensummedtheHessianmatricestogether.Encog(Heaton,2015)usesthisapproach,anditleadstofastconvergencetimes.

Youneedtorealizethatyouwillnotuseeveryconnectionwithmultipleoutputs.Youwillneedtocalculateindependentlyanupdatefortheweightofeachoutputneuron.Dependingontheoutputneuronyouarecurrentlycalculating,therewillbeunusedconnectionsfortheotheroutputneurons.Therefore,youmustsetthepartialderivativeforeachoftheseunusedconnectionsto0whenyouarecalculatingtheotheroutputneurons.

Forexample,consideraneuralnetworkthathastwooutputneuronsandthreehiddenneurons.Eachofthesetwooutputneuronswouldhaveatotaloffourconnectionsfromthehiddenlayer.Threeconnectionsresultfromthethreehiddenneurons,andafourthcomesfromthebiasneuron.ThissegmentoftheneuralnetworkwouldresembleFigure7.2:

Figure7.2:CalculatingOutputNeuron1

Herewearecalculatingoutputneuron1.Noticethatoutputneuron2hasfourconnectionsthatmusthavetheirpartialderivativestreatedas0.Becauseweare

calculatingoutput1asthecurrentneuron,itonlyusesitsnormalpartialderivatives.Youcanrepeatthisprocessforeachoutputneuron.

OverviewoftheLMAProcess

Sofar,wehaveexaminedonlythemathbehindLMA.Tobeeffective,LMAmustbepartofanalgorithm.ThefollowingstepssummarizetheLMAprocess:

1.Calculatethefirstderivativeofoutputoftheneuralnetworkwith

respecttoeveryweight.

2.CalculatetheHessian.

3.Calculatethegradientsoftheerror(ESS)withrespecttoeveryweight.

4.Eithersetlambdatoalowvalue(firstiteration)orthelambdaofthe

previousiteration.

5.Savetheweightsoftheneuralnetwork.

6.Calculatedeltaweightbasedonthelambda,gradients,andHessian.

7.Applythedeltastotheweightsandevaluateerror.

8.Iferrorhasimproved,endtheiteration.

9.Iferrorhasnotimproved,increaselambda(uptoamaxlambda),restore

theweights,andgobacktostep6.

Asyoucansee,theprocessforLMArevolvesaroundsettingthelambdavaluelowandthenslowlyincreasingitiftheerrorratedoesnotimprove.Youmustsavetheweightsateachchangeinlambdasothatyoucanrestorethemiftheerrordoesnotimprove.

ChapterSummary

Resilientpropagation(RPROP)solvestwolimitationsofsimplebackpropagation.First,theprogramassignseachweightaseparatelearningrate,allowingtheweightstolearnatdifferentspeeds.Secondly,RPROPrecognizesthatwhilethegradient’ssignisagreatindicatorofthedirectiontomovetheweight,thesizeofthegradientdoesnotindicatehowfartomove.Additionally,whiletheprogrammermustdetermineanappropriatelearningrateandmomentumforbackpropagation,RPROPautomaticallysetssimilararguments.

Geneticalgorithms(GAs)areanothermeansoftrainingneuralnetworks.ThereisanentirefamilyofneuralnetworksthatuseGAstoevolveeveryaspectoftheneuralnetwork,fromweightstotheoverallstructure.ThisfamilyincludestheNEAT,CPPNandHyperNEATneuralnetworksthatwewilldiscussinthenextchapter.TheGAusedbyNEAT,CPPNandHyperNEATisnotjustanothertrainingalgorithmbecausetheseneuralnetworksintroduceanewarchitecturebasedonthefeedforwardneuralnetworksexaminedsofarinthisbook.

Chapter8:NEAT,CPPN&HyperNEATNEATGeneticAlgorithmsCPPNHyperNEAT

Inthischapter,wediscussthreecloselyrelatedneuralnetworktechnologies:NEAT,CPPNandHyperNEAT.KennethStanley’sEPLEXgroupattheUniversityofCentralFloridaconductsextensiveresearchforallthreetechnologies.InformationabouttheircurrentresearchcanbefoundatthefollowingURL:

http://eplex.cs.ucf.edu/

NeuroEvolutionofAugmentingTopologies(NEAT)isanalgorithmthatevolvesneuralnetworkstructureswithgeneticalgorithms.Thecompositionalpattern-producingnetwork(CPPN)isatypeofevolvedneuralnetworkthatcancreateotherstructures,suchasimagesorotherneuralnetworks.Hypercube-basedNEAT,orHyperNEAT,atypeofCPPN,alsoevolvesotherneuralnetworks.OnceHyperNEATtrainthenetworks,theycaneasilyhandlemuchhigherresolutionsoftheirdimensions.

ManydifferentframeworkssupportNEATandHyperNEAT.ForJavaandC#,werecommendourownEncogimplementation,whichcanbefoundatthefollowingURL:


YoucanfindacompletelistofNEATimplementationsatKennethStanley’swebsite:

http://www.cs.ucf.edu/~kstanley/neat.html

KennethStanley’swebsitealsoincludesacompletelistofHyperNEATimplementations:

http://eplex.cs.ucf.edu/hyperNEATpage/

Fortheremainderofthischapter,wewillexploreeachofthesethreenetworktypes.

NEATNetworks

NEATisaneuralnetworkstructuredevelopedbyStanleyandMiikkulainen(2002).NEAToptimizesboththestructureandweightsofaneuralnetworkwithageneticalgorithm(GA).TheinputandoutputofaNEATneuralnetworkareidenticaltoatypicalfeedforwardneuralnetwork,asseeninpreviouschaptersofthisbook.

ANEATnetworkstartsoutwithonlybiasneurons,inputneurons,andoutputneurons.Generally,noneoftheneuronshaveconnectionsattheoutset.Ofcourse,acompletelyunconnectednetworkisuseless.NEATmakesnoassumptionsaboutwhethercertaininput

http://eplex.cs.ucf.edu/


http://www.cs.ucf.edu/~kstanley/neat.html

http://eplex.cs.ucf.edu/hyperNEATpage/

neuronsareactuallyneeded.Anunneededinputissaidtobestatisticallyindependentoftheoutput.NEATwilloftendiscoverthisindependencebyneverevolvingoptimalgenomestoconnecttothatstatisticallyindependentinputneuron.

AnotherimportantdifferencebetweenaNEATnetworkandanordinaryfeedforwardneuralnetworkisthatotherthantheinputandoutputlayers,NEATnetworksdonothaveclearlydefinedhiddenlayers.However,thehiddenneuronsdonotorganizethemselvesintoclearlydelineatedlayers.OnesimilaritybetweenNEATandfeedforwardnetworksisthattheybothuseasigmoidactivationfunction.Figure8.1showsanevolvedNEATnetwork:

Figure8.1:NEATNetwork

Input2intheaboveimageneverformedanyconnectionsbecausetheevolutionaryprocessdeterminedthatinput2wasunnecessary.Arecurrentconnectionalsoexistsbetweenhidden3andhidden2.Hidden4hasarecurrentconnectiontoitself.Overall,youwillnotethataNEATnetworklacksacleardelineationoflayers.

YoucancalculateaNEATnetworkinexactlythesamewayasyoudoforaregularweightedfeedforwardnetwork.YoucanmanagetherecurrentconnectionsbyrunningtheNEATnetworkmultipletimes.Thisworksbyhavingtherecurrentconnectioninputstartat0andupdatethemeachtypeyoucyclethroughtheNEATnetwork.Additionally,youmustdefineahyper-parametertospecifythenumberoftimestocalculatetheNEATnetwork.Figure8.2showsrecurrentlinkcalculationwhenaNEATnetworkisinstructedtocyclethreetimestocalculaterecurrentconnections:

Figure8.2:CyclingtoCalculateRecurrences

Theabovediagramshowstheoutputsfromeachneuron,overeachconnection,forthreecycles.Thedashedlinesindicatetheadditionalconnections.Forsimplicity,thediagramdoesn’thavetheweights.ThepurposeofFigure8.2istoshowthattherecurrentoutputstaysonecyclebehind.

Forthefirstcycle,therecurrentconnectionprovideda0tothefirstneuronbecauseneuronsarecalculatedlefttoright.Thefirstcyclehasnovaluefortherecurrentconnection.Forthesecondcycle,therecurrentconnectionnowhastheoutput0.3,whichthefirstcycleprovided.Cycle3followsthesamepattern,takingthe0.5outputfromcycle2astherecurrentconnection’soutput.Sincetherewouldbeotherneuronsinthecalculation,wehavecontrivedthesevalues,whichthedashedarrowsshowatthebottom.However,Figure8.2doesillustratethattherecurrentconnectionsarecycledthroughpreviouscycles.

NEATnetworksextensivelyusegeneticalgorithms,whichweexaminedinArtificialIntelligenceforHumans,Volume2:Nature-InspiredAlgorithms.Althoughyoudonotneedtounderstandcompletelygeneticalgorithmstofollowthediscussionoftheminthischapter,youcanrefertoVolume2,asneeded.

NEATusesatypicalgeneticalgorithmthatincludes:

Mutation–Theprogramchoosesonefitindividualtocreateanewindividualthathasarandomchangefromitsparent.Crossover–Theprogramchoosestwofitindividualstocreateanewindividualthathasarandomsamplingofelementsfrombothparents.

Allgeneticalgorithmsengagethemutationandcrossovergeneticoperatorswithapopulationofindividualsolutions.Mutationandcrossoverchoosewithgreaterprobabilitythesolutionsthatreceivehigherscoresfromanobjectivefunction.WeexploremutationandcrossoverforNEATnetworksinthenexttwosections.

NEATMutation

NEATmutationconsistsofseveralmutationoperationsthatcanbeperformedontheparentgenome.Wediscusstheseoperationshere:

Addaneuron:Byselectingarandomlink,wecanaddaneuron.Anewneuronandtwolinksreplacethisrandomlink.Thenewneuroneffectivelysplitsthelink.Theprogramselectstheweightsofeachofthetwonewlinkstoprovidenearlythesameeffectiveoutputasthelinkbeingreplaced.Addalink:Theprogramchoosesasourceanddestination,ortworandomneurons.Thenewlinkwillbebetweenthesetwoneurons.Biasneuronscanneverbeadestination.Outputneuronscannotbeasource.Therewillneverbemorethantwolinksinthesamedirectionbetweenthesametwoneurons.Removealink:Linkscanberandomlyselectedforremoval.Iftherearenoremaininglinksinteractingwiththem,youcanremovethehiddenneurons,whichareneuronsthatarenotinput,output,orthesinglebiasneuron.Perturbaweight:Youcanchoosearandomlink.Thenmultiplyitsweightbyanumberfromanormalrandomdistributionwithagammaof1orlower.Smallerrandomnumberswillusuallycauseaquickerconvergence.Agammavalueof1orlowerwillspecifythatasinglestandarddeviationwillsamplearandomnumberof1orlower.

Youcanincreasetheprobabilityofthemutationsothattheweightperturbationoccursmorefrequently,therebyallowingfitgenomestovarytheirweightsandfurtheradaptthroughtheirchildren.Thestructuralmutationshappenwithmuchlessfrequency.YoucanadjusttheexactfrequencyofeachoperationwithmostNEATimplementations.

NEATCrossover

NEATcrossoverismorecomplexthanmanygeneticalgorithmsbecausetheNEATgenomeisanencodingoftheneuronsandconnectionsthatcompriseanindividualgenome.Mostgeneticalgorithmsassumethatthenumberofgenesisconsistentacrossallgenomesinthepopulation.Infact,childgenomesinNEATthatresultfrombothmutationandcrossovermayhaveadifferentnumberofgenesthantheirparents.ManagingthisnumberdiscrepancyrequiressomeingenuitywhenyouimplementtheNEATcrossoveroperation.

NEATkeepsadatabaseofallthechangesmadetoagenomethroughmutation.Thesechangesarecalledinnovations,andtheyexistinordertoimplementmutations.Eachtimeaninnovationisadded,itisgivenanID.TheseIDswillalsobeusedtoordertheinnovations.WewillseethatitisimportanttoselecttheinnovationwiththelowerIDwhenchoosingbetweentwoinnovations.

Itisimportanttorealizethattherelationshipbetweeninnovationsandmutationsisnotonetoone.Itcantakeseveralinnovationstoachieveonemutation.Theonlytwotypesofinnovationarecreatinganeuronandalinkbetweentwoneurons.Onemutationmightresultfrommultipleinnovations.Additionally,amutationmightnothaveanyinnovations.Onlymutationsthataddtothestructureofthenetworkwillgenerateinnovations.Thefollowinglistsummarizestheinnovationsthatthepreviouslymentionedmutationtypescouldpotentiallycreate.

Addaneuron:OnenewneuroninnovationandtwonewlinkinnovationsAddalink:OnenewlinkinnovationRemovealink:NoinnovationsPerturbaweight:Noinnovations

YoualsoneedtonotethatNEATwillnotrecreateinnovationrecordsifyouhavealreadyattemptedthistypeofinnovation.Furthermore,innovationsdonotcontainanyweightinformation;innovationsonlycontainstructuralinformation.

Crossoverfortwogenomesoccursbyconsideringtheinnovations,andthistraitallowsNEATtoensurethatallprerequisiteinnovationsarealsopresent.Anaïvecrossover,suchasthosethatmanygeneticalgorithmsuse,wouldpotentiallycombinelinkswithnonexistentneurons.Listing8.1showstheentireNEATcrossoverfunctioninpseudocode:

Listing8.1:NEATCrossover

defneat_crossover(rnd,mom,dad):

#Choosebestgenome(byobjectivefunction),iftie,chooserandom.

best=favor_parent(rnd,mom,dad)

not_best=dadif(best<>mom)elsemom

selected_links=[]

selected_neurons=[]

#currentgeneindexfrommomanddad

cur_mom=0

cur_dad=0

selected_gene=None

#addintheinputandbias,theyshouldalwaysbehere

always_count=mom.input_count+mom.output_count+1

forifrom0toalways_count-1:

selected_neurons.add(i,best,not_best)

#Loopoverallgenesinbothmotherandfather

while(cur_mom<mom.num_genes)or(cur_dad<dad.num_genes):

#Themomanddadgeneobject

mom_gene=None

mom_innovation=-1

dad_gene=None

dad_innovation=-1

#grabtheactualobjectsfrommomanddadforthespecified

#indexes

#iftherearenone,thenNone

ifcur_mom<mom.num_genes:

mom_gene=mom.links[cur_mom];

mom_innovation=mom_gene.innovation_id

ifcur_dad<dad.num_genes:

dad_gene=dad.links[cur_dad]

dad_gene=dad.links[cur_dad]

dad_innovation_id=dad_gene.innovation_id

#nowselectagenefrormomordad.Thisgeneisforthebaby

#Dadgeneonly,momhasrunout

ifmom_gene==Noneanddad_gene<>None:

cur_dad=cur_dad+1

selected_gene=dad_gene

#Momgeneonly,dadhasrunout

elseifdadGene==nullandmomGene<>null:

cur_mom=cur_mom+1

selected_gene=mom_gene

#Momhaslowerinnovationnumber

elseifmom_innovation_id<dad_innovation_id:

cur_mom=cur_mom+1

ifbest==mom:


#Dadhaslowerinnovationnumber

elseifdad_innovation_id<mom_innovation_id:

cur_dad=cur_dad+1

ifbest==dad:


#Momanddadhavethesameinnovationnumber

#Flipacoin.

elseifdad_innovation_id==mom_innovation_id:

cur_dad=cur_dad+1

cur_mom=cur_mom+1

ifrnd.next_double()>0.5:


else:


#Ifagenewaschosenforthechildthenprocessit.

#Ifnot,theloopcontinues.

ifselected_gene<>None:

#Donotaddthesameinnovationtwiceinarow.

ifselected_links.count==0:

selected_links.add(selected_gene)

else:

ifselected_links[selected_links.count-1]

.innovation_id<>selected_gene.innovation_id{

selected_links.add(selected_gene)

#Checkifwealreadyhavethenodesreferredtoin

#SelectedGene.

#Ifnot,theyneedtobeadded.

selected_neurons.add(

selected_gene.from_neuron_id,best,not_best)

selected_neurons.add(

selected_gene.to_neuron_id,best,not_best)

#Doneloopingoverparent'sgenes

baby=newNEATGenome(selected_links,selected_neurons)

returnbaby

TheaboveimplementationofcrossoverisbasedontheNEATcrossoveroperatorimplementedinEncog.Weprovidetheabovecommentsinordertoexplainthecriticalsectionsofcode.Theprimaryevolutionoccursonthelinkscontainedinthemotherandfather.Anyneuronsneededtosupporttheselinksarebroughtalongwhenthechild

genomeiscreated.Thecodecontainsamainloopthatloopsoverbothparents,therebyselectingthemostsuitablelinkgenefromeachparent.Thelinkgenesfrombothparentsareessentiallystitchedtogethersotheycanfindthemostsuitablegene.Becausetheparentsmightbedifferentlengths,onewilllikelyexhaustitsgenesbeforethisprocessiscomplete.

Eachtimethroughtheloop,ageneischosenfromeitherthemotherorfatheraccordingtothefollowingcriteria:

Ifmomordadhasrunout,choosetheother.Movepastthechosengene.IfmomhasalowerinnovationIDnumber,choosemomifshehasthebestscore.Ineithercase,movepastmom’sgene.IfdadhasalowerinnovationIDnumber,choosedadifhehasthebestscore.Ineithercase,movepastdad’sgene.IfmomanddadhavethesameinnovationID,pickonerandomly,andmovepasttheirgene.

Youcanconsiderthatthemotherandfather’sgenesarebothonalongtape.Amarkerforeachtapeholdsthecurrentposition.Accordingtotherulesabove,themarkerwillmovepastaparent’sgene.Atsomepoint,eachparent’smarkermovestotheendofthetape,andthatparentrunsoutofgenes.

NEATSpeciation

Crossoverisatrickyforcomputerstoproperlyperform.Intheanimalandplantkingdoms,crossoveroccursonlybetweenmembersofthesamespecies.Whatexactlydowemeanbyspecies?Inbiology,scientistsdefinespeciesasmembersofapopulationthatcanproduceviableoffspring.Therefore,acrossoverbetweenahorseandhummingbirdgenomewouldbecatastrophicallyunsuccessful.Yetanaivegeneticalgorithmwouldcertainlytrysomethingjustasdisastrouswithartificialcomputergenomes!

TheNEATspeciationalgorithmhasseveralvariants.Infact,oneofthemostadvancedvariantscangroupthepopulationintoapredefinednumberofclusterswithatypeofk-meansclustering.Youcansubsequentlydeterminetherelativefitnessofeachspecies.Theprogramgiveseachspeciesapercentageofthenextgeneration’spopulationcount.Themembersofeachspeciesthencompeteinvirtualtournamentstodeterminewhichmembersofthespecieswillbeinvolvedincrossoverandmutationforthenextgeneration.

Atournamentisaneffectivewaytoselectparentsfromaspecies.Theprogramperformsacertainnumberoftrials.Typicallyweusefivetrials.Foreachtrial,theprogramselectstworandomgenomesfromthespecies.Thefitterofeachgenomeadvancestothenexttrial.Thisprocessisveryefficientforthreading,anditisalsobiologicallyplausible.Theadvantagetothisselectionmethodisthatthewinnerdoesn’thavetobeatthebestgenomeinthespecies.Ithastobeatthebestgenomeinthetrials.Youmustrunatournamentforeachparentneeded.Mutationrequiresoneparent,and

crossoverneedstwoparents.

Inadditiontothetrials,severalotherfactorsdeterminethespeciesmemberschosenformutationandcrossover.Thealgorithmwillalwayscarryoneormoreelitegenomestothenextspecies.Thenumberofelitegenomesisconfigurable.Theprogramgivesyoungergenomesabonussotheyhaveachancetotrynewinnovations.Interspeciescrossoverwilloccurwithaverylowprobability.

AllofthesefactorstogethermakeNEATaveryeffectiveneuralnetworktype.NEATremovestheneedtodefinehowthehiddenlayersofaneuralnetworkarestructured.TheabsenceofastrictstructureofhiddenlayersallowsNEATneuralnetworkstoevolvetheconnectionsthatareactuallyneeded.

CPPNNetworks

Thecompositionalpattern-producingnetwork(CPPN)wasinventedbyStanley(2007)andisavariationoftheartificialneuralnetwork.CPPNrecognizesonebiologicallyplausiblefact.Innature,genotypesandphenotypesarenotidentical.Inotherwords,thegenotypeistheDNAblueprintforanorganism.Thephenotypeiswhatactuallyresultsfromthatplan.

Innature,thegenomeistheinstructionsforproducingaphenotypethatismuchmorecomplexthanthegenotype.IntheoriginalNEAT,asseeninthelastsection,thegenomedescribeslinkforlinkandneuronforneuronhowtoproducethephenotype.However,CPPNisdifferentbecauseitcreatesapopulationofspecialNEATgenomes.Thesegenomesarespecialintwoways.First,CPPNdoesn’thavethelimitationsofregularNEAT,whichalwaysusesasigmoidactivationfunction.CPPNcanuseanyofthefollowingactivationfunctions:

ClippedlinearBipolarsteepenedsigmoidGaussianSineOthersyoumightdefine

YoucanseetheseactivationfunctionsinFigure8.3:

Figure8.3:CPPNActivationFunctions

TheseconddifferenceisthattheNEATnetworksproducedbythesegenomesarenotthefinalproduct.Theyarenotthephenotype.However,theseNEATgenomesdoknowhowtocreatethefinalproduct.

ThefinalphenotypeisaregularNEATnetworkwithasigmoidactivationfunction.Wecanusetheabovefouractivationfunctionsonlyforthegenomes.Theultimatephenotypealwayshasasigmoidactivationfunction.

CPPNPhenotype

CPPNsaretypicallyusedinconjunctionwithimages,astheCPPNphenotypeisusuallyanimage.ThoughimagesaretheusualproductofaCPPN,theonlyrealrequirementisthattheCPPNcomposesomething,therebyearningitsnameofcompositionalpattern-producingnetwork.TherearecaseswhereaCPPNdoesnotproduceanimage.Themostpopularnon-imageproducingCPPNisHyperNEAT,whichisdiscussedinthenextsection.

Creatingagenomeneuralnetworktoproduceaphenotypeneuralnetworkisacomplexbutworthwhileendeavor.Becausewearedealingwithalargenumberofinputandoutputneurons,thetrainingtimescanbeconsiderable.However,CPPNsarescalable

andcanreducethetrainingtimes.

OnceyouhaveevolvedaCPPNtocreateanimage,thesizeoftheimage(thephenotype)doesnotmatter.Itcanbe320x200,640x480orsomeotherresolutionaltogether.Theimagephenotype,generatedbytheCPPNwillgrowtothesizeneeded.Aswewillseeinthenextsection,CPPNsgiveHyperNEATthesamesortofscalability.

WewillnowlookathowaCPPN,whichisitselfaNEATnetwork,producesanimage,orthefinalphenotype.TheNEATCPPNshouldhavethreeinputvalues:thecoordinateonthehorizontalaxis(x),thecoordinateontheverticalaxis(y),andthedistanceofthecurrentcoordinatefromthecenter(d).Inputtingdprovidesabiastowardssymmetry.Inbiologicalgenomes,symmetryisimportant.TheoutputfromtheCPPNcorrespondstothepixelcoloratthex-coordinateandy-coordinate.TheCPPNspecificationonlydetermineshowtoprocessagrayscaleimagewithasingleoutputthatindicatesintensity.Forafull-colorimage,youcoulduseoutputneuronsforred,green,andblue.Figure8.4showsaCPPNforimages:

Figure8.4:CPPNforImages

YoucanquerytheaboveCPPNforeveryx-coordinateandy-coordinateneeded.Listing8.2showsthepseudocodethatyoucanusetogeneratethephenotype:

Listing8.2:GenerateCPPNImage

defrender_cppn(net,bitmap):

foryfrom1tobitmap.height:

forxfrom1tobitmap.width:

#Normalizexandyto-1,1

norm_x=(2*(x/bitmap.width))-1

norm_y=(2*(y/bitmap.height))-1

#Distancefromcenter

d=sqrt((norm_x/2)^2

+(norm_y/2)^2)

#CallCPPN

input=[x,y,d]

color=net.compute(input)

#Outputpixel

bitmap.plot(x-1,y-1,color)

TheabovecodesimplyloopsovereverypixelandqueriestheCPPNforthecoloratthatlocation.Thex-coordinateandy-coordinatearenormalizedtobeingbetween-1and+1.YoucanseethisprocessinactionatthePicbreederwebsiteatfollowingURL:

http://picbreeder.org/

DependingonthecomplexityoftheCPPN,thisprocesscanproduceimagessimilartoFigure8.5:

Figure8.5:ACPPN-ProducedImage(picbreeder.org)

http://picbreeder.org/

Picbreederallowsyoutoselectoneormoreparentstocontributetothenextgeneration.Weselectedtheimagethatresemblesamouth,aswellastheimagetotheright.Figure8.6showsthesubsequentgenerationthatPicbreederproduced.

Figure8.6:ACPPN-ProducedImage(picbreeder.org)

CPPNnetworkshandlesymmetryjustlikehumanbodies.Withtwohands,twokidneys,twofeet,andotherbodypartpairs,thehumangenomeseemstohaveahierarchyofrepeatedfeatures.Instructionsforcreatinganeyeorvarioustissuesdonotexist.Fundamentally,thehumangenomedoesnothavetodescribeeverydetailofanadulthumanbeing.Rather,thehumangenomeonlyhastodescribehowtobuildanadulthumanbeingbygeneralizingmanyofthesteps.Thisgreatlysimplifiestheamountofinformationthatisneededinagenome.

AnothergreatfeatureoftheimageCPPNisthatyoucancreatetheaboveimagesatanyresolutionandwithoutretraining.Becausethex–coordinateandy–coordinatearenormalizedtobetween-1and+1,youcanuseanyresolution.

HyperNEATNetworks

HyperNEATnetworks,inventedbyStanley,D’Ambrosio,&Gauci(2009),arebasedupontheCPPN;however,insteadofproducinganimage,aHyperNEATnetworkcreatesanotherneuralnetwork.JustliketheCPPNinthelastsection,HyperNEATcaneasilycreatemuchhigherresolutionneuralnetworkswithoutretraining.

HyperNEATSubstrate

Oneinterestinghyper-parameteroftheHyperNEATnetworkisthesubstratethatdefinesthestructureofaHyperNEATnetwork.Asubstratedefinesthex-coordinateandthey-coordinatefortheinputandoutputneurons.StandardHyperNEATnetworksusuallyemploytwoplanestoimplementthesubstrate.Figure8.7showsthesandwichsubstrate,oneofthemostcommonsubstrates:

Figure8.7:HyperNEATSandwichSubstrate

Togetherwiththeabovesubstrate,aHyperNEATCPPNiscapableofcreatingthephenotypeneuralnetwork.Thesourceplanecontainstheinputneurons,andthetargetplanecontainstheoutputneurons.Thex-coordinateandthey-coordinateforeachareinthe-1to+1range.Therecanpotentiallybeaweightbetweeneachofthesourceneuronsandeverytargetneuron.Figure8.8showshowtoquerytheCPPNtodeterminetheseweights:

Figure8.8:CPPNforHyperNEAT

TheinputtotheCPPNconsistsoffourvalues:x1,y1,x2,andy2.Thefirsttwovaluesx1andy1specifytheinputneurononthesourceplane.Thesecondtwovaluesx2andy2specifytheinputneurononthetargetplane.HyperNEATallowsthepresenceofasmanydifferentinputandoutputneuronsasdesired,withoutretraining.JustliketheCPPNimagecouldmapmoreandmorepixelsbetween-1and+1,sotoocanHyperNEATpackinmoreinputandoutputneurons.

HyperNEATComputerVision

ComputervisionisagreatapplicationofHyperNEAT,asdemonstratedbytherectanglesexperimentprovidedintheoriginalHyperNEATpaperbyStanley,KennethO.,etal.(2009).Thisexperimentplacedtworectanglesinacomputer’svisionfield.Ofthesetworectangles,oneisalwayslargerthantheother.Theneuralnetworkistrainedtoplacearedrectanglenearthecenterofthelargerrectangle.Figure8.9showsthisexperimentrunningundertheEncogframework:

Figure8.9:BoxesExperiment(11resolution)

Asyoucanseefromtheaboveimage,theredrectangleisplaceddirectlyinsideofthelargerofthetworectangles.The“NewCase”buttoncanbepressedtomovetherectangles,andtheprogramcorrectlyfindsthelargerrectangle.Whilethisworksquitewellat11x11,thesizecanbeincreasedto33x33.Withthelargersize,noretrainingisneeded,asshowninFigure8.10:

Figure8.10:BoxesExperiment(33resolution)

Whenthedimensionsareincreasedto33x33,theneuralnetworkisstillabletoplacetheredsquareinsideofthelargerrectangle.

Theaboveexampleusesasandwichsubstratewiththeinputandoutputplaneboth

equaltothesizeofthevisualfield,inthiscase33x33.Theinputplaneprovidesthevisualfield.Theneuronintheoutputplanewiththehighestoutputistheprogram’sguessatthecenterofthelargerrectangle.ThefactthatthepositionofthelargerectangledoesnotconfusethenetworkshowsthatHyperNEATpossessessomeofthesamefeaturesastheconvolutionalneuralnetworksthatwewillseeinChapter10,“ConvolutionalNetworks.”

ChapterSummary

ThischapterintroducedNEAT,CPPN,andHyperNEAT.KennethStanley’sEPLEXgroupattheUniversityofCentralFloridaextensivelyresearchesallthreetechnologies.NeuroEvolutionofAugmentingTopologies(NEAT)isanalgorithmthatusesgeneticalgorithmstoautomaticallyevolveneuralnetworkstructures.Oftenthedecisionofthestructureofaneuralnetworkcanbeoneofthemostcomplexaspectsofneuralnetworkdesign.NEATneuralnetworkscanevolvetheirownstructureandevendecidewhatinputfeaturesareimportant.

Thecompositionalpattern-producingnetwork(CPPN)isatypeofneuralnetworkthatisevolvedtocreateotherstructures,suchasimagesorotherneuralnetworks.ImagegenerationisacommontaskforCPPNs.ThePicbreederwebsiteallowsnewimagestobebredbasedonpreviousimagesgeneratedatthissite.CPPNscangeneratemorethanjustimages.TheHyperNEATalgorithmisanapplicationofCPPNsforproducingneuralnetworks.

Hypercube-basedNEAT,orHyperNEAT,isatypeofCPPNthatevolvesotherneuralnetworksthatcaneasilyhandlemuchhigherresolutionsoftheirdimensionsassoonastheyaretrained.HyperNEATallowsaCPPNtobeevolvedthatcancreateneuralnetworks.Beingabletogeneratetheneuralnetworkallowsyoutointroducesymmetry,anditgivesyoutheabilitytochangetheresolutionoftheproblemwithoutretraining.

Neuralnetworkshaverisenanddeclinedinpopularityseveraltimessincetheirintroduction.Currently,thereisinterestinneuralnetworksthatusedeeplearning.Infact,deeplearninginvolvesseveraldifferentconcepts.Thenextchapterintroducesdeepneuralnetworks,andweexpandthistopicthroughouttheremainderofthisbook.

Chapter9:DeepLearningConvolutionalNeuralNetworks&DropoutToolsforDeepLearningContrastiveDivergenceGibb’sSampling

Deeplearningisarelativelynewadvancementinneuralnetworkprogrammingandrepresentsawaytotraindeepneuralnetworks.Essentially,anyneuralnetworkwithmorethantwolayersisdeep.TheabilitytocreatedeepneuralnetworkshasexistedsincePitts(1943)introducedthemultilayerperceptron.However,wehaven’tbeenabletoeffectivelytrainneuralnetworksuntilHinton(1984)becamethefirstresearchertosuccessfullytrainthesecomplexneuralnetworks.

DeepLearningComponents

Deeplearningiscomprisedofanumberofdifferenttechnologies,andthischapterisanoverviewofthesetechnologies.Subsequentchapterswillcontainmoreinformationonthesetechnologies.Deeplearningtypicallyincludesthefollowingfeatures:

PartiallyLabeledDataRectifiedLinearUnits(ReLU)ConvolutionalNeuralNetworksDropout

Thesucceedingsectionsprovideanoverviewofthesetechnologies.

PartiallyLabeledData

Mostlearningalgorithmsareeithersupervisedorunsupervised.Supervisedtrainingdatasetsprovideanexpectedoutcomeforeachdataitem.Unsupervisedtrainingdatasetsdonotprovideanexpectedoutcome,whichiscalledalabel.Theproblemisthatmostdatasetsareamixtureoflabeledandunlabeleddataitems.

Tounderstandthedifferencebetweenlabeledandunlabeleddata,considerthefollowingreal-lifeexample.Whenyouwereachild,youprobablysawmanyvehiclesasyougrewup.Earlyinyourlife,youdidnotknowifyouwereseeingacar,truck,orvan.Yousimplyknewthatyouwereseeingsomesortofvehicle.Youcanconsiderthisexposureastheunsupervisedpartofyourvehicle-learningjourney.Atthatpoint,you

learnedcommonalitiesoffeaturesamongthesevehicles.

Laterinyourlearningjourney,youweregivenlabels.Asyouencountereddifferentvehicles,anadulttoldyouthatyouwerelookingatacar,truck,orvan.Theunsupervisedtrainingcreatedyourfoundation,andyoubuiltuponthatknowledge.Asyoucansee,supervisedandunsupervisedlearningareverycommoninreallife.Initsownway,deeplearningdoeswellwithacombinationofunsupervisedandsupervisedlearningdata.

Somedeeplearningarchitectureshandlepartiallylabeleddataandinitializetheweightsbyusingtheentiretrainingsetwithouttheoutcomes.Youcanindependentlytraintheindividuallayerswithoutthelabels.Becauseyoucantrainthelayersinparallel,thisprocessisscalable.Oncetheunsupervisedphasehasinitializedtheseweights,thesupervisedphasecantweakthem.

RectifiedLinearUnits

TheRectifiedlinearunit(ReLU)hasbecomethestandardactivationfunctionforthehiddenlayersofadeepneuralnetwork.However,therestrictedBoltzmannmachine(RBM)isthestandardforthedeepbeliefneuralnetwork(DBNN).InadditiontotheReLUactivationfunctionsforthehiddenlayers,deepneuralnetworkswillusealinearorsoftmaxactivationfunctionfortheoutputlayer,dependingoniftheneuralnetworksupportsregressionorclassification.WeintroducedReLUsinChapter1,“NeuralNetworkBasics,”andexpandeduponthisinformationin“Chapter6,BackpropagationTraining.”

ConvolutionalNeuralNetworks

Convolutionisanimportanttechnologythatisoftencombinedwithdeeplearning.Hinton(2014)introducedconvolutiontoallowimage-recognitionnetworkstofunctionsimilarlytobiologicalsystemsandachievemoreaccurateresults.Oneapproachissparseconnectivityinwhichwedonotcreateeverypossibleweight.Figure9.1showssparseconnectivity:

Figure9.1:SparseConnectivity

Aregularfeedforwardneuralnetworkusuallycreateseverypossibleweightconnectionbetweentwolayers.Indeeplearningterminology,werefertotheselayersasdenselayers.Inadditiontonotrepresentingeveryweightpossible,convolutionalneuralnetworkswillalsoshareweights,asseeninFigure9.2:

Figure9.2:SharedWeights

Asyoucanseeintheabovefigure,theneuronsshareonlythreeindividualweights.Thered(solid),black(dashed),andblue(dotted)linesindicatetheindividualweights.Sharingweightsallowstheprogramtostorecomplexstructureswhilemaintainingmemoryandcomputationefficiency.

Thissectionpresentedanoverviewofconvolutionalneuralnetworks.Chapter10,“ConvolutionalNeuralNetworks,”isdevotedentirelytothisnetworktype.

NeuronDropout

Dropoutisaregularizationtechniquethatholdsmanybenefitsfordeeplearning.Likemostregularizationtechniques,dropoutcanpreventoverfitting.Youcanalsoapplydropouttoaneuralnetworkinalayer-by-layerfashionasyoudoinconvolution.Youmustdesignateasinglelayerasadropoutlayer.Infact,youcanmixthesedropoutlayerswithregularlayersandconvolutionallayersintheneuralnetwork.Nevermixthedropoutandconvolutionallayerswithinasinglelayer.

Hinton(2012)introduceddropoutasasimpleandeffectiveregularizationalgorithmtoreduceoverfitting.Dropoutworksbyremovingcertainneuronsinthedropoutlayer.Theactofdroppingtheseneuronspreventsotherneuronsfrombecomingoverlydependentonthedroppedneurons.Theprogramremovesthesechosenneurons,alongwithalloftheirconnections.Figure9.3illustratesthisprocess:

Figure9.3:DropoutLayer

Fromlefttoright,theaboveneuralnetworkcontainsaninputlayer,adropoutlayer,andanoutputlayer.Thedropoutlayerhasremovedseveraloftheneurons.Thecircles,madeofdottedlines,indicatetheneuronsthatthedropoutalgorithmremoved.Thedashedconnectorlinesindicatetheweightsthatthedropoutalgorithmremovedwheniteliminatedtheneurons.

Bothdropoutandotherformsofregularizationareextensivetopicsinthefieldofneuralnetworks.Chapter12,“DropoutandRegularization,”coversregularizationwithparticularfocusondropout.ThatchapteralsocontainsanexplanationontheL1andL2regularizationalgorithms.L1andL2discourageneuralnetworksfromtheexcessiveuseoflargeweightsandtheinclusionofcertainirrelevantinputs.Essentially,asingleneuralnetworkcommonlyusesdropoutaswellasotherregularizationalgorithms.

GPUTraining

Hinton(1987)introducedaverynovelwaytotrainthedeepbeliefneuralnetwork(DBNN)efficiently.WeexaminethisalgorithmandDBNNslaterinthischapter.Asmentionedpreviously,deepneuralnetworkshaveexistedalmostaslongastheneuralnetwork.However,untilHinton’salgorithm,noeffectivewaytotraindeepneuralnetworksexisted.Thebackpropagationalgorithmsareveryslow,andthevanishinggradientproblemhindersthetraining.

Thegraphicsprocessingunit(GPU),thepartofthecomputerthatisresponsibleforgraphicsdisplay,isthewaythatresearcherssolvedthetrainingproblemoffeedforwardneuralnetworks.MostofusarefamiliarwithGPUsbecauseofmodernvideogamesthatutilize3Dgraphics.Renderingthesegraphicalimagesismathematicallyintense,and,toperformtheseoperations,earlycomputersreliedonthecentralprocessingunit(CPU).However,thisapproachwasnoteffective.Thegraphicssystemsinmodernvideogamesrequirededicatedcircuitry,whichbecametheGPU,orvideocard.Essentially,modernGPUsarecomputersthatfunctionwithinyourcomputer.

Asresearchersdiscovered,theprocessingpowercontainedinaGPUcanbeharnessedformathematicallyintensetasks,suchasneuralnetworktraining.WerefertothisutilizationoftheGPUforgeneralcomputingtasks,asidefromcomputergraphics,asgeneral-purposeuseoftheGPU(GPGPU).Whenappliedtodeeplearning,theGPUperformsextraordinarilywell.CombiningitwithReLUactivationfunctions,regularization,andregularbackpropagationcanproduceamazingresults.

However,GPGPUcanbedifficulttouse.ProgramswrittenfortheGPUmustemployaverylow-levelprogramminglanguagecalledC99.ThislanguageisverysimilartotheregularCprogramminglanguage.However,inmanyways,theC99requiredbytheGPUismuchmoredifficultthantheregularCprogramminglanguage.Furthermore,GPUsaregoodonlyatcertaintasks—eventhoseconducivetotheGPUbecauseoptimizingtheC99codeischallenging.GPUsmustbalanceseveralclassesofmemory,registers,andsynchronizationofhundredsofprocessorcores.Additionally,GPUprocessinghastwocompetingstandards—CUDAandOpenCL.Twostandardscreatemoreprocessesfortheprogrammertolearn.

Fortunately,youdonotneedtolearnGPUprogrammingtoexploititsprocessingpower.Unlessyouarewillingtodevoteaconsiderableamountofefforttolearnthenuancesofacomplexandevolvingfield,wedonotrecommendthatyoulearntoprogramtheGPUbecauseitisquitedifferentfromCPUprogramming.Goodtechniquesthatproduceefficient,CPU-basedprogramswilloftenproducehorriblyinefficientGPUprograms.Thereverseisalsotrue.IfyouwouldliketouseGPU,youshouldworkwithanoff-the-shelfpackagethatsupportsit.Ifyourneedsdonotfitintoadeeplearningpackage,youmightconsiderusingalinearalgebrapackage,suchasCUBLAS,whichcontainsmanyhighlyoptimizedalgorithmsforthesortsoflinearalgebrathatmachinelearning

commonlyrequires.

TheprocessingpowerofahighlyoptimizedframeworkfordeeplearningandafastGPUcanbeamazing.GPUscanachieveoutstandingresultsbasedonsheerprocessingpower.In2010,theSwissAILabIDSIAshowedthat,despitethevanishinggradientproblem,thesuperiorprocessingpowerofGPUsmadebackpropagationfeasiblefordeepfeedforwardneuralnetworks(Ciresanetal.,2010).ThemethodoutperformedallothermachinelearningtechniquesonthefamousMNISThandwrittendigitproblem.

ToolsforDeepLearning

Oneoftheprimarychallengesofdeeplearningistheprocessingtimetotrainanetwork.Weoftenruntrainingalgorithmsformanyhours,orevendays,seekingneuralnetworksthatfitwelltothedatasets.Weuseseveralframeworksforourresearchandpredictivemodeling.Theexamplesinthisbookalsoutilizetheseframeworks,andwewillpresentallofthesealgorithmsinsufficientdetailforyoutocreateyourownimplementation.However,unlessyourgoalistoconductresearchtoenhancedeeplearningitself,youarebestservedbyworkingwithanestablishedframework.Mostoftheseframeworksaretunedtotrainveryfast.

Wecandividetheexamplesfromthisbookintotwogroups.Thefirstgroupshowsyouhowtoimplementaneuralnetworkortotrainanalgorithm.However,mostoftheexamplesinthisbookarebasedonalgorithms,andweexaminethealgorithmatitslowestlevel.

Applicationexamplesarethesecondtypeofexamplecontainedinthisbook.Thesehigher-levelexamplesshowhowtouseneuralnetworkanddeeplearningalgorithms.Theseexampleswillusuallyutilizeoneoftheframeworksdiscussedinthissection.Inthisway,thebookstrikesabalancebetweentheoryandreal-worldapplication.

H2O

H2Oisamachinelearningframeworkthatsupportsawidevarietyofprogramminglanguages.ThoughH2OisimplementedinJava,itisdesignedasawebservice.H2OcanbeusedwithR,Python,Scala,Java,andanylanguagethatcancommunicatewithH2O’sRESTAPI.

Additionally,H2OcanbeusedwithApacheSparkforbigdataandbigcomputeoperations.TheSparkingWaterpackageallowsH2Otorunlargemodelsinmemoryacrossagridofcomputers.FormoreinformationaboutH2O,refertothefollowingURL:

http://0xdata.com/product/deep-learning/

Inadditiontodeeplearning,H2Osupportsavarietyofothermachinelearningmodels,

http://0xdata.com/product/deep-learning/

suchaslogisticregression,decisiontrees,andgradientboosting.

Theano

TheanoisamathematicalpackageforPython,similartothewidelyusedPythonpackage,Numpy(J.Bergstra,O.Breuleux,F.Bastien,etal.,J.Bergstra,O.Breuleux,F.Bastien,2012).LikeNumpy,Theanoprimarilytargetsmathematics.ThoughTheanodoesnotdirectlyimplementdeepneuralnetworks,itprovidesallofthemathematicaltoolsnecessaryfortheprogrammertocreatedeepneuralnetworkapplications.TheanoalsodirectlysupportsGPGPU.YoucanfindtheTheanopackageatthefollowingURL:

http://deeplearning.net/software/theano/

ThecreatorsofTheanoalsowroteanextensivetutorialfordeeplearning,usingTheanothatcanbefoundatthefollowingURL:

http://deeplearning.net/

LasagneandNoLearn

BecauseTheanodoesnotdirectlysupportdeeplearning,severalpackageshavebeenbuiltuponTheanotomakeiteasyfortheprogrammertoimplementdeeplearning.Onepairofpackages,oftenusedtogether,isLasagneandNolearn.NolearnisapackageforPythonthatprovidesabstractionsaroundseveralmachinelearningalgorithms.Inthisway,Nolearnissimilartothepopularframework,Scikit-Learn.WhileScikit-Learnfocuseswidelyonmachinelearning,Nolearnspecializesonneuralnetworks.OneoftheneuralnetworkpackagessupportedbyNolearnisLasagne,whichprovidesdeeplearningandcanbefoundatthefollowingURL:

https://pypi.python.org/pypi/Lasagne/0.1dev

YoucanaccesstheNolearnpackageatthefollowingURL:

https://github.com/dnouri/nolearn

ThedeeplearningframeworkLasangetakesitsnamefromtheItalianfoodlasagna.Thespellings“lasange”and“lasagna”arebothconsideredvalidspellingsoftheItalianfood.IntheItalianlanguage,“lasange”issingular,and“lasagna”isthepluralform.Regardlessofthespellingused,lasagnaisagoodnameforadeeplearningframework.Figure9.4showsthat,likeadeepneuralnetwork,lasagnaismadeupofmanylayers:

http://deeplearning.net/software/theano/

http://deeplearning.net/

https://pypi.python.org/pypi/Lasagne/0.1dev

https://github.com/dnouri/nolearn

Figure9.4:LasagnaLayers

ConvNetJS

DeeplearningsupporthasalsobeencreatedforJavascript.TheConvNetJSpackageimplementsmanydeeplearningalgorithms,particularlyintheareaofconvolutionalneuralnetworks.ConvNetJSprimarilytargetsthecreationofdeeplearningexamplesonwebsites.WeusedConvNetJStoprovidemanyofthedeeplearningJavaScriptexamplesonthisbook’swebsite:

http://cs.stanford.edu/people/karpathy/convnetjs/

http://cs.stanford.edu/people/karpathy/convnetjs/

DeepBeliefNeuralNetworks

Thedeepbeliefneuralnetwork(DBNN)wasoneofthefirstapplicationsofdeeplearning.ADBNNissimplyaregularbeliefnetworkwithmanylayers.Beliefnetworks,introducedbyNeilin1992aredifferentfromregularfeedforwardneuralnetworks.Hinton(2007)describesDBNNsas“probabilisticgenerativemodelsthatarecomposedofmultiplelayersofstochastic,latentvariables.”Becausethistechnicaldescriptioniscomplicated,wewilldefinesometerms.

Probabilistic–DBNNsareusedtoclassify,andtheiroutputistheprobabilitythataninputbelongstoeachclass.Generative–DBNNscanproduceplausible,randomlycreatedvaluesfortheinputvalues.SomeDBNNliteraturesrefertothistraitasdreaming.Multiplelayers–Likeaneuralnetwork,DBNNscanbemadeofmultiplelayers.Stochastic,latentvariables–DBNNsaremadeupofBoltzmannmachinesthatproducerandom(stochastic)valuesthatcannotbedirectlyobserved(latent).

TheprimarydifferencesbetweenaDBNNandafeedforwardneuralnetwork(FFNN)aresummarizedasfollows:

InputtoaDBNNmustbebinary;inputtoaFFNNisadecimalnumber.TheoutputfromaDBNNistheclasstowhichtheinputbelongs;theoutputfromaFFNNcanbeaclass(classification)oranumericprediction(regression).DBNNscangenerateplausibleinputbasedonagivenoutcome.FFNNscannotperformliketheDBNNs.

Theseareimportantdifferences.ThefirstbulletitemisoneofthemostlimitingfactorsofDBNNs.ThefactthataDBNNcanacceptonlybinaryinputoftenseverelylimitsthetypeofproblemthatitcantackle.YoualsoneedtonotethataDBNNcanbeusedonlyforclassificationandnotforregression.Inotherwords,aDBNNcouldclassifystocksintocategoriessuchasbuy,hold,orsell;however,itcouldnotprovideanumericpredictionaboutthestock,suchastheamountthatmaybeattainedoverthenext30days.Ifyouneedanyofthesefeatures,youshouldconsideraregulardeepfeedforwardnetwork.

Comparedtofeedforwardneuralnetworks,DBNNsmayinitiallyseemsomewhatrestrictive.However,theydohavetheabilitytogenerateplausibleinputcasesbasedonagivenoutput.OneoftheearliestDBNNexperimentswastohaveaDBNNclassifytendigits,usinghandwrittensamples.ThesedigitswerefromtheclassicMNISThandwrittendigitsdatasetthatwasincludedinthisbook’sintroduction.OncetheDBNNistrainedontheMNISTdigits,itcanproducenewrepresentationsofeachdigit,asseeninFigure9.5:

Figure9.5:DBNNDreamingofDigits

TheabovedigitsweretakenfromHinton’s(2006)deeplearningpaper.ThefirstrowshowsavarietyofdifferentzerosthattheDBNNgeneratedfromitstrainingdata.

TherestrictedBoltzmannmachine(RBM)isthecenteroftheDBNN.InputprovidedtotheDBNNpassesthroughaseriesofstackedRBMsthatmakeupthelayersofthenetwork.CreatingadditionalRBMlayerscausesdeeperDBNNs.ThoughRBMsareunsupervised,thedesireisfortheresultingDBNNtobesupervised.Toaccomplishthesupervision,afinallogisticregressionlayerisaddedtodistinguishoneclassfromanother.InthecaseofHinton’sexperiment,showninFigure9.6,theclassesarethetendigits:

Figure9.6:DeepBeliefNeuralNetwork(DBNN)

TheabovediagramshowsaDBNNthatusesthesamehyper-parametersasHinton’sexperiment.Hyper-parametersspecifythearchitectureofaneuralnetwork,suchasthenumberoflayers,hiddenneuroncounts,andothersettings.EachofthedigitimagespresentedtotheDBNNis28x28pixels,orvectorsof784pixels.Thedigitsaremonochrome(black&white)sothese784pixelsaresinglebitsandarethuscompatiblewiththeDBNN’srequirementthatallinputbebinary.TheabovenetworkhasthreelayersofstackedRBMs,containing500neurons,asecond500-neuronlayer,and2,000neurons,respectively.

ThefollowingsectionsdiscussanumberofalgorithmsusedtoimplementDBNNs.

RestrictedBoltzmannMachines

BecauseChapter3,“Hopfield&BoltzmannMachines,”includesadiscussionofBoltzmannmachines,wewillnotrepeatthismaterialhere.ThischapterdealswiththerestrictedversionoftheBoltzmannmachineandstackingtheseRBMstoachievedepth.Figure2.10,fromChapter3,showsanRBM.TheprimarydifferencewithanRBMisthatthevisible(input)neuronsandthehidden(output)neuronshavetheonlyconnections.InthecaseofastackedRBM,thehiddenunitsbecometheoutputtothenextlayer.Figure9.7showshowtwoBoltzmannmachinesarestacked:

Figure9.7:StackedRBMs

WecancalculatetheoutputfromanRBMexactlyasshowninChapter3,“Hopfield&BoltzmannMachines,”inEquation3.6.TheonlydifferenceisnowwehavetwoBoltzmannmachinesstacked.ThefirstBoltzmannmachinereceivesthreeinputspassedtoitsvisibleunits.Thehiddenunitspasstheiroutputdirectlytothetwoinputs(visibleunits)ofthesecondRBM.NoticethattherearenoweightsbetweenthetwoRBMs,andtheoutputfromtheH1andH2unitsinRBM1passdirectlytoI1andI2fromRBM2.

TrainingaDBNN

TheprocessoftrainingaDBNNrequiresanumberofsteps.Althoughthemathematicsbehindthisprocesscanbecomesomewhatcomplex,youdon’tneedtounderstandeverydetailfortrainingDBNNsinordertousethem.Youjustneedtoknowthefollowingkeypoints:

DBNNsundergosupervisedandunsupervisedtraining.Duringtheunsupervisedportion,theDBNNusestrainingdatawithouttheirlabels,whichallowsDBNNstohaveamixofsupervisedandunsuperviseddata.Duringthesupervisedportion,onlytrainingdatawithlabelsareused.EachDBNNlayeristrainedindependentlyduringtheunsupervisedportion.ItispossibletotraintheDBNNlayersconcurrently(withthreads)duringtheunsupervisedportion.Aftertheunsupervisedportioniscomplete,theoutputfromthelayersisrefinedwithsupervisedlogisticregression.Thetoplogisticregressionlayerpredictstheclasstowhichtheinputbelongs.

Armedwiththisknowledge,youcanskipaheadtothedeepbeliefclassificationexampleinthischapter.However,ifyouwishtolearnthespecificdetailsofDBNNtraining,readon.

Figure9.8providesasummaryofthestepsofDBNNtraining:

Figure9.8:DBNNTraining

Layer-WiseSampling

ThefirststepwhenperformingunsupervisedtrainingonanindividuallayeristocalculateallvaluesoftheDBNNuptothatlayer.Youwilldothiscalculationforeverytrainingset,andtheDBNNwillprovideyouwithsampledvaluesatthelayerthatyouarecurrentlytraining.Sampledreferstothefactthattheneuralnetworkrandomlychoosesatrue/falsevaluebasedonaprobability.

Youneedtounderstandthatsamplingusesrandomnumberstoprovideyouwithyourresults.Becauseofthisrandomness,youwillnotalwaysgetthesameresult.IftheDBNNdeterminesthatahiddenneuron’sprobabilityoftrueis0.75,thenyouwillgetavalueoftrue75%ofthetime.Layer-wisesamplingisverysimilartothemethodthatweusedtocalculatetheoutputofBoltzmannmachinesinChapter3,“Hopfield&BoltzmannMachines.”WewilluseEquation3.6,fromchapter3tocomputetheprobability.TheonlydifferenceisthatwewillusetheprobabilitygivenbyEquation3.6togeneratearandomsample.

Thepurposeofthelayer-wisesamplingistoproduceabinaryvectortofeedintothe

contrastivedivergencealgorithm.WhentrainingeachRBM,wealwaysprovidetheoutputofthepreviousRBMastheinputtothecurrentRBM.IfwearetrainingthefirstRBM(closesttotheinput),wesimplyusethetraininginputvectorforcontrastivedivergence.ThisprocessallowseachoftheRBMstobetrained.ThefinalsoftmaxlayeroftheDBNNisnottrainedduringtheunsupervisedphase.Thefinallogisticregressionphasewilltrainthesoftmaxlayer.

ComputingPositiveGradients

Oncethelayer-wisetraininghasprocessedeachoftheRBMlayers,wecanutilizetheup-downalgorithm,orthecontrastivedivergencealgorithm.Thiscompletealgorithmincludesthefollowingsteps,coveredinthenextsectionsofthisbook:

ComputingPositiveGradientsGibbsSamplingUpdateWeightsandBiasesSupervisedBackpropagation

Likemanyofthegradient-descent-basedalgorithmspresentedinChapter6,“BackpropagationTraining,”thecontrastivedivergencealgorithmisalsobasedongradientdescent.Itusesthederivativeofafunctiontofindtheinputstothefunctionthatproducesthelowestoutputforthatfunction.Severaldifferentgradientsareestimatedduringcontrastivedivergence.Wecanusetheseestimatesinsteadofactualcalculationsbecausetherealgradientsaretoocomplextocalculate.Formachinelearning,anestimateisoftengoodenough.

Additionally,wemustcalculatethemeanprobabilityofthehiddenunitsbypropagatingthevisibleunitstothehiddenones.Thiscomputationisthe“up”portionoftheup-downalgorithm.Equation9.1performsthiscalculation:

Equation9.1:PropagateUp

Theaboveequationcalculatesthemeanprobabilityofeachofthehiddenneurons(h).Thebarabovethehdesignatesitasamean,andthepositivesubscriptindicatesthatwearecalculatingthemeanforthepositive(orup)partofthealgorithm.Thebiasisaddedtothesigmoidfunctionvalueoftheweightedsumofallvisibleunits.

Nextavaluemustbesampledforeachofthehiddenneurons.Thisvaluewillrandomlybeeithertrue(1)orfalse(0)withthemeanprobabilityjustcalculated.Equation9.2accomplishesthissampling:

Equation9.2:SampleaHiddenValue

Thisequationassumesthatrisauniformrandomvaluebetween0and1.Auniformrandomnumbersimplymeansthateverypossiblenumberinthatrangehasanequalprobabilityofbeingchosen.

GibbsSampling

Thecalculationofthenegativegradientsisthe“down”phaseoftheup-downalgorithm.Toaccomplishthiscalculation,thealgorithmusesGibbssaplingtoestimatethemeanofthenegativegradients.GemanandGeman(1984)introducedGibbssamplingandnameditafterthephysicistJosiahWillardGibbs.Thetechniqueisaccomplishedbyloopingthroughkiterationsthatimprovethequalityoftheestimate.Eachiterationperformstwosteps:

Samplevisibleneuronsgivehiddenneurons.Samplehiddenneuronsgivevisibleneurons.

ForthefirstiterationofGibbssampling,westartwiththepositivehiddenneuronsamplesobtainedfromthelastsection.Wewillsamplevisibleneuronaverageprobabilitiesfromthese(firstbulletabove).Next,wewillusethesevisiblehiddenneuronstosamplehiddenneurons(secondbulletabove).Thesenewhiddenprobabilitiesarethenegativegradients.Forthenextcycle,wewillusethenegativegradientsinplaceofthepositiveones.Thiscontinuesforeachiterationandproducesbetternegativegradients.Equation9.3accomplishessamplingofthevisibleneurons(firstbullet):

Equation9.3:PropagateDown,SampleVisible(negative)

ThisequationisessentiallythereverseofEquation9.1.Here,wedeterminetheaveragevisiblemeanusingthehiddenvalues.Again,justlikewedidforthepositive

gradients,wesampleanegativeprobabilityusingEquation9.4:

Equation9.4:SampleaVisibleValue

Theaboveequationassumesthatrisauniformrandomnumberbetween0and1.

TheabovetwoequationsareonlyhalfoftheGibbssamplingstep.Theseequationsaccomplishedthefirstbulletpointabovebecausetheysamplevisibleneurons,givenhiddenneurons.Next,wemustaccomplishthesecondbulletpoint.Wemustsamplehiddenneurons,givenvisibleneurons.Thisprocessisverysimilartotheabovesection,“ComputingPositiveGradients.”Thistime,however,wearecalculatingthenegativegradients.

Thevisibleunitsamplesjustcalculatedcanobtainhiddenmeans,asshowninEquation9.5:

Equation9.5:PropagateUp,SampleHidden(negative)

Justasbefore,meanprobabilitycansampleanactualvalue.Inthiscase,weusethehiddenmeantosampleahiddenvalue,asdemonstratedbyEquation9.6:

Equation9.6:SampleaHiddenValue

TheGibbssamplingprocesscontinues.Thenegativehiddensamplescanprocesseachiteration.Oncethiscalculationiscomplete,youhavethefollowingsixvectors:

PositivemeanprobabilitiesofthehiddenneuronsPositivesampledvaluesofthehiddenneuronsNegativemeanprobabilitiesofvisibleneuronsNegativesampledvaluesofvisibleneuronsNegativemeanprobabilitiesofhiddenneuronsNegativesampledvaluesofhiddenneurons

Thesevalueswillupdatetheneuralnetwork’sweightsandbiases.

UpdateWeights&Biases

Thepurposeofanyneuralnetworktrainingistoupdatetheweightsandbiases.Thisadjustmentiswhatallowstheneuralnetworktolearntoperformtheintendedtask.ThisisthefinalstepfortheunsupervisedportionoftheDBNNtrainingprocess.Inthisstep,theweightsandbiasesofasinglelayer(Boltzmannmachine)willbeupdated.Aspreviouslymentioned,theBoltzmannlayersaretrainedindependently.

Theweightsandbiasesareupdatedindependently.Equation9.7showshowtoupdateaweight:

Equation9.7:BoltzmannWeightUpdate

Thelearningrate(ε,epsilon)specifieshowmuchofacalculatedchangeshouldbeapplied.Highlearningrateswilllearnquicker,buttheymightskipoveranoptimalsetofweights.Lowerlearningrateslearnmoreslowly,buttheymighthaveahigherqualityresult.Thevaluexrepresentsthecurrenttrainingsetelement.Becausexisavector(array),thexenclosedintwobarsrepresentsthelengthofx.Theaboveequationalsousesthepositivemeanhiddenprobabilities,thenegativemeanhiddenprobabilities,andthenegativesampledvalues.

Equation9.8calculatesthebiasesinasimilarfashion:

Equation9.8:BoltzmannBiasUpdate

Theaboveequationusesthesampledhiddenvaluefromthepositivephaseandthemeanhiddenvaluefromthenegativephase,aswellastheinputvector.Oncetheweightsandbiaseshavebeenupdated,theunsupervisedportionofthetrainingisdone.

DBNNBackpropagation

Uptothispoint,theDBNNtraininghasfocusedonunsupervisedtraining.TheDBNNusedonlythetrainingsetinputs(xvalues).Evenifthedatasetprovidedanexpectedoutput(yvalues),theunsupervisedtrainingdidn’tuseit.NowtheDBNNistrainedwiththeexpectedoutputs.Weuseonlydatasetitemsthatcontainanexpectedoutputduringthislastphase.ThisstepallowstheprogramtouseDBNNnetworkswithdatasetswhereeachitemdoesnotnecessarilyhaveanexpectedoutput.Werefertothedataaspartiallylabeleddatasets.

ThefinallayeroftheDBNNissimplyaneuronforeachclass.TheseneuronshaveweightstotheoutputofthepreviousRBMlayer.Theseoutputneuronsallusesigmoidactivationfunctionsandasoftmaxlayer.Thesoftmaxlayerensuresthattheoutputforeachoftheclassessumto1.

Regularbackpropagationtrainsthisfinallayer.ThefinallayerisessentiallytheoutputlayerofafeedforwardneuralnetworkthatreceivesitsinputfromthetopRBM.BecauseChapter6,“BackpropagationTraining,”containsadiscussionofbackpropagation,wewillnotrepeattheinformationhere.ThemainideaofaDBNNisthatthehierarchyallowseachlayertointerpretthedataforthenextlayer.Thishierarchyallowsthelearningtospreadacrossthelayers.Thehigherlayerslearnmoreabstractnotionswhilethelowerlayersformfromtheinputdata.Inpractice,DBNNscanprocessmuchmorecomplexofpatternsthanaregularbackpropagation-trainedfeedforwardneuralnetwork.

DeepBeliefApplication

ThischapterpresentsasimpleexampleoftheDBNN.Thisexamplesimplyacceptsaseriesofinputpatternsaswellastheclassestowhichtheseinputpatternsbelong.Theinputpatternsareshownbelow:

[[1,1,1,1,0,0,0,0],

[1,1,0,1,0,0,0,0],

[1,1,1,0,0,0,0,0],

[0,0,0,0,1,1,1,1],

[0,0,0,0,1,1,0,1],

[0,0,0,0,1,1,1,0]]

Weprovidetheexpectedoutputfromeachofthesetrainingsetelements.Thisinformationspecifiestheclasstowhicheachoftheaboveelementsbelongsandisshownbelow:

[[1,0],

[1,0],

[1,0],

[0,1],

[0,1],

[0,1]]

Theprogramprovidedinthebook’sexamplecreatesaDBNNwiththefollowingconfiguration:

InputLayerSize:8HiddenLayer#1:2HiddenLayer#2:3OutputLayerSize:2

First,wetraineachofthehiddenlayers.Finally,weperformlogisticregressionontheoutputlayer.Theoutputfromthisprogramisshownhere:

TrainingHiddenLayer#0

TrainingHiddenLayer#1

Iteration:1,Supervisedtraining:error=0.2478464544753616



...


Iteration:288,Supervisedtraining:error=7.821742124428358E-4

[0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0]->[0.9649828726012807,

0.03501712739871941]

[1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0]->[0.9649830045627616,

0.035016995437238435]

[0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0]->[0.03413161595489315,

0.9658683840451069]

[0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0]->[0.03413137188719462,

0.9658686281128055]

Asyoucansee,theprogramfirsttrainedthehiddenlayersandthenwentthrough288iterationsofregression.Theerrorleveldroppedconsiderablyduringtheseiterations.Finally,thesampledataquizzedthenetwork.Thenetworkrespondedwiththeprobabilityoftheinputsamplebeingineachofthetwoclassesthatwespecifiedabove.

Forexample,thenetworkreportedthefollowingelement:

[0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0]

Thiselementhada96%probabilityofbeinginclass1,butithadonlya4%probabilityofbeinginclass2.Thetwoprobabilitiesreportedforeachitemmustsumto100%.

ChapterSummary

Thischapterprovidedahigh-leveloverviewofmanyofthecomponentsofdeeplearning.Adeepneuralnetworkisanynetworkthatcontainsmorethantwohiddenlayers.Althoughdeepnetworkshaveexistedforaslongasmultilayerneuralnetworks,theyhavelackedgoodtrainingmethodsuntilrecently.Newtrainingtechniques,activationfunctions,andregularizationaremakingdeepneuralnetworksfeasible.

Overfittingisacommonproblemformanyareasofmachinelearning;neuralnetworksarenoexception.Regularizationcanpreventoverfitting.Mostformsofregularizationinvolvemodifyingtheweightsofaneuralnetworkasthetrainingoccurs.Dropoutisaverycommonregularizationtechniquefordeepneuralnetworksthatremovesneuronsastrainingprogresses.Thistechniquepreventsthenetworkfrombecomingoverlydependentonanyoneneuron.

Weendedthechapterwiththedeepbeliefneuralnetwork(DBNN),whichclassifiesdatathatmightbepartiallylabeled.First,bothlabeledandunlabeleddatacaninitializetheweightsoftheneuralnetworkwithunsupervisedtraining.Usingtheseweights,alogisticregressionlayercanfine-tunethenetworktothelabeleddata.

Wealsodiscussedtheconvolutionalneuralnetworks(CNN)inthischapter.Thistypeofneuralnetworkcausestheweightstobesharedbetweenthevariousneuronsinthenetwork.ThisneuralnetworkallowstheCNNtodealwiththetypesofoverlappingfeaturesthatareverycommonincomputervision.WeprovidedonlyageneraloverviewofCNNbecausewewillexaminetheCNNsingreaterdetailinthenextchapter.

Chapter10:ConvolutionalNeuralNetworks

SparseConnectivitySharedWeightsMax-pooling

Theconvolutionalneuralnetwork(CNN)isaneuralnetworktechnologythathasprofoundlyimpactedtheareaofcomputervision(CV).Fukushima(1980)introducedtheoriginalconceptofaconvolutionalneuralnetwork,andLeCun,Bottou,Bengio&Haffner(1998)greatlyimprovedthiswork.Fromthisresearch,YanLeCunintroducedthefamousLeNet-5neuralnetworkarchitecture.ThischapterfollowstheLeNet-5styleofconvolutionalneuralnetwork.

AlthoughcomputervisionprimarilyusesCNNs,thistechnologyhassomeapplicationoutsideofthefield.YouneedtorealizethatifyouwanttoutilizeCNNsonnon-visualdata,youmustfindawaytoencodeyourdatasothatitcanmimicthepropertiesofvisualdata.

CNNsaresomewhatsimilartotheself-organizingmap(SOM)architecturethatweexaminedinChapter2,“Self-OrganizingMaps.”Theorderofthevectorelementsiscrucialtothetraining.Incontrast,mostneuralnetworksthatarenotCNNsorSOMstreattheirinputdataasalongvectorofvalues,andtheorderthatyouarrangetheincomingfeaturesinthisvectorisirrelevant.Forthesetypesofneuralnetworks,youcannotchangetheorderafteryouhavetrainedthenetwork.Inotherwords,CNNsandSOMsdonotfollowthestandardtreatmentofinputvectors.

TheSOMnetworkarrangedtheinputsintoagrid.Thisarrangementworkedwellwithimagesbecausethepixelsincloserproximitytoeachotherareimportanttoeachother.Obviously,theorderofpixelsinanimageissignificant.Thehumanbodyisarelevantexampleofthistypeoforder.Forthedesignoftheface,weareaccustomedtoeyesbeingneartoeachother.Inthesameway,neuralnetworktypeslikeSOMsadheretoanorderofpixels.Consequently,theyhavemanyapplicationstocomputervision.

AlthoughSOMsandCNNsaresimilarinthewaythattheymaptheirinputinto2Dgridsorevenhigher-dimensionobjectssuchas3Dboxes,CNNstakeimagerecognitiontohigherlevelofcapability.ThisadvanceinCNNsisduetoyearsofresearchonbiologicaleyes.Inotherwords,CNNsutilizeoverlappingfieldsofinputtosimulatefeaturesofbiologicaleyes.Untilthisbreakthrough,AIhadbeenunabletoreproducethecapabilitiesofbiologicalvision.

Scale,rotation,andnoisehavepresentedchallengesinthepastforAIcomputervisionresearch.Youcanobservethecomplexityofbiologicaleyesintheexamplethatfollows.Afriendraisesasheetofpaperwithalargenumberwrittenonit.Asyourfriendmovesnearertoyou,thenumberisstillidentifiable.Inthesameway,youcanstillidentifythenumberwhenyourfriendrotatesthepaper.Lastly,yourfriendcreatesnoisebydrawinglinesontopofthepage,butyoucanstillidentifythenumber.Asyoucansee,these

examplesdemonstratethehighfunctionofthebiologicaleyeandallowyoutounderstandbettertheresearchbreakthroughofCNNs.Thatis,thisneuralnetworkhastheabilitytoprocessscale,rotation,andnoiseinthefieldofcomputervision.

LeNET-5

WecanusetheLeNET-5architectureprimarilyfortheclassificationofgraphicalimages.Thisnetworktypeissimilartothefeedforwardnetworkthatweexaminedinpreviouschapters.Dataflowfrominputtotheoutput.However,theLeNET-5networkcontainsseveraldifferentlayertypes,asFigure10.1illustrates:

Figure10.1:ALeNET-5Network(LeCun,1998)

SeveralimportantdifferencesexistbetweenafeedforwardneuralnetworkandaLeNET-5network:

Vectorspassthroughfeedforwardnetworks;3DcubespassthroughLeNET-5networks.LeNET-5networkscontainavarietyoflayertypes.ComputervisionistheprimaryapplicationoftheLeNET-5.

However,wehavealsoexploredthemanysimilaritiesbetweenthenetworks.ThemostimportantsimilarityisthatwecantraintheLeNET-5withthesamebackpropagation-basedtechniques.AnyoptimizationalgorithmcantraintheweightsofeitherafeedforwardorLeNET-5network.Specifically,youcanutilizesimulatedannealing,geneticalgorithms,andparticleswarmfortraining.However,LeNET-5frequentlyusesbackpropagationtraining.

ThefollowingthreelayertypescomprisetheoriginalLeNET-5neuralnetworks:

ConvolutionalLayersMax-poolLayersDenseLayers

Otherneuralnetworkframeworkswilladdadditionallayertypesrelatedtocomputervision.However,wewillnotexploretheseadditionsbeyondtheLeNET-5standard.

Addingnewlayertypesisacommonmeansofaugmentingexistingneuralnetworkresearch.Chapter12,“DropoutandRegularization,”willintroduceanadditionallayertypethatisdesignedtoreduceoverfittingbyaddingadropoutlayer.

Fornow,wefocusourdiscussiononthelayertypesassociatedwithconvolutionalneuralnetworks.Wewillbeginwithconvolutionallayers.

ConvolutionalLayers

Thefirstlayerthatwewillexamineistheconvolutionallayer.Wewillbeginbylookingatthehyper-parametersthatyoumustspecifyforaconvolutionallayerinmostneuralnetworkframeworksthatsupporttheCNN:

NumberoffiltersFilterSizeStridePaddingActivationFunction/Non-Linearity

Theprimarypurposeforaconvolutionallayeristodetectfeaturessuchasedges,lines,blobsofcolor,andothervisualelements.Thefilterscandetectthesefeatures.Themorefiltersthatwegivetoaconvolutionallayer,themorefeaturesitcandetect.

Afilterisasquare-shapedobjectthatscansovertheimage.Agridcanrepresenttheindividualpixelsofagrid.Youcanthinkoftheconvolutionallayerasasmallergridthatsweepslefttorightovereachrowoftheimage.Thereisalsoahyper-parameterthatspecifiesboththewidthandheightofthesquare-shapedfilter.Figure10.1showsthisconfigurationinwhichyouseethesixconvolutionalfilterssweepingovertheimagegrid:

Aconvolutionallayerhasweightsbetweenitandthepreviouslayerorimagegrid.Eachpixeloneachconvolutionallayerisaweight.Therefore,thenumberofweightsbetweenaconvolutionallayeranditspredecessorlayerorimagefieldisthefollowing:

[FilterSize]*[FilterSize]*[#ofFilters]

Forexample,ifthefiltersizewere5(5x4)for10filters,therewouldbe250weights.

Youneedtounderstandhowtheconvolutionalfilterssweepacrossthepreviouslayer’soutputorimagegrid.Figure10.2illustratesthesweep:

Figure10.2:ConvolutionalFilter

Theabovefigureshowsaconvolutionalfilterwithasizeof4andapaddingsizeof1.Thepaddingsizeisresponsiblefortheboarderofzerosintheareathatthefiltersweeps.Eventhoughtheimageisactually8x7,theextrapaddingprovidesavirtualimagesizeof9x8forthefiltertosweepacross.Thestridespecifiesthenumberofpositionsatwhichtheconvolutionalfilterswillstop.Theconvolutionalfiltersmovetotheright,advancingbythenumberofcellsspecifiedinthestride.Oncethefarrightisreached,theconvolutionalfiltermovesbacktothefarleft,thenitmovesdownbythestrideamountandcontinuestotherightagain.

Someconstraintsexistinrelationtothesizeofthestride.Obviously,thestridecannotbe0.Theconvolutionalfilterwouldnevermoveifthestrideweresetto0.Furthermore,neitherthestride,northeconvolutionalfiltersizecanbelargerthanthepreviousgrid.Thereareadditionalconstraintsonthestride(s),padding(p)andthefilterwidth(f)foranimageofwidth(w).Specifically,theconvolutionalfiltermustbeabletostartatthefarleftortopboarder,moveacertainnumberofstrides,andlandonthefarrightorbottomboarder.Equation10.1showsthenumberofstepsaconvolutionaloperatormusttaketocrosstheimage:

Equation10.1:StepsAcrossanImage

Thenumberofstepsmustbeaninteger.Inotherwords,itcannothavedecimalplaces.Thepurposeofthepadding(p)istobeadjustedtomakethisequationbecomeanintegervalue.

Wecanusethesamesetofweightsastheconvolutionalfiltersweepsovertheimage.Thisprocessallowsconvolutionallayerstoshareweightsandgreatlyreducetheamountofprocessingneeded.Inthisway,youcanrecognizetheimageinshiftpositionsbecausethesameconvolutionalfiltersweepsacrosstheentireimage.

Theinputandoutputofaconvolutionallayerareboth3Dboxes.Fortheinputtoaconvolutionallayer,thewidthandheightoftheboxisequaltothewidthandheightoftheinputimage.Thedepthoftheboxisequaltothecolordepthoftheimage.ForanRGBimage,thedepthis3,equaltothecomponentsofred,green,andblue.Iftheinputtotheconvolutionallayerisanotherlayer,thenitwillalsobea3Dbox;however,thedimensionsofthat3Dboxwillbedictatedbythehyper-parametersofthatlayer.

Likeanyotherlayerintheneuralnetwork,thesizeofthe3Dboxoutputbyaconvolutionallayerisdictatedbythehyper-parametersofthelayer.Thewidthandheightofthisboxarebothequaltothefiltersize.However,thedepthisequaltothenumberoffilters.

Max-PoolLayers

Max-poollayersdownsamplea3Dboxtoanewonewithsmallerdimensions.Typically,youcanalwaysplaceamax-poollayerimmediatelyfollowingaconvolutionallayer.Figure10.1showsthemax-poollayerimmediatelyafterlayersC1andC3.Thesemax-poollayersprogressivelydecreasethesizeofthedimensionsofthe3Dboxespassingthroughthem.Thistechniquecanavoidoverfitting(Krizhevsky,Sutskever&Hinton,2012).

Apoolinglayerhasthefollowinghyper-parameters:

SpatialExtent(f)Stride(s)

Unlikeconvolutionallayers,max-poollayersdonotusepadding.Additionally,max-poollayershavenoweights,sotrainingdoesnotaffectthem.Theselayerssimplydownsampletheir3Dboxinput.

The3Dboxoutputbyamax-poollayerwillhaveawidthequaltoEquation10.2:

Equation10.2:WidthMax-poolOutput

Theheightofthe3Dboxproducedbythemax-poollayeriscalculatedsimilarlywith

Equation10.3:

Equation10.3:HeightofMax-poolingOutput

Thedepthofthe3Dboxproducedbythemax-poollayerisequaltothedepththe3Dboxreceivedasinput.

Themostcommonsettingforthehyper-parametersofamax-poollayeraref=2ands=2.Thespatialextent(f)specifiesthatboxesof2x2willbescaleddowntosinglepixels.Ofthesefourpixels,thepixelwiththemaximumvaluewillrepresentthe2x2pixelinthenewgrid.Becausesquaresofsize4arereplacedwithsize1,75%ofthepixelinformationislost.Figure10.3showsthistransformationasa6x6gridbecomesa3x3:

Figure10.3:Max-pooling(f=2,s=2)

Ofcourse,theabovediagramshowseachpixelasasinglenumber.Agrayscaleimagewouldhavethischaracteristic.ForanRGBimage,weusuallytaketheaverageofthethreenumberstodeterminewhichpixelhasthemaximumvalue.

DenseLayers

ThefinallayertypeinaLeNET-5networkisadenselayer.Thislayertypeisexactlythesametypeoflayeraswe’veseenbeforeinfeedforwardneuralnetworks.Adenselayerconnectseveryelement(neuron)inthepreviouslayer’soutput3Dboxtoeachneuroninthedenselayer.Theresultingvectorispassedthroughanactivationfunction.LeNET-5networkswilltypicallyuseaReLUactivation.However,wecanuseasigmoidactivationfunction;thistechniqueislesscommon.Adenselayerwilltypicallycontainthefollowinghyper-parameters:

NeuronCountActivationFunction

Theneuroncountspecifiesthenumberofneuronsthatmakeupthislayer.Theactivationfunctionindicatesthetypeofactivationfunctiontouse.Denselayerscanemploymanydifferentkindsofactivationfunctions,suchasReLU,sigmoidorhyperbolictangent.

LeNET-5networkswilltypicallycontainseveraldenselayersastheirfinallayers.ThefinaldenselayerinaLeNET-5actuallyperformstheclassification.Thereshouldbeoneoutputneuronforeachclass,ortypeofimage,toclassify.Forexample,ifthenetworkdistinguishesbetweendogs,cats,andbirds,therewillbethreeoutputneurons.Youcanapplyafinalsoftmaxfunctiontothefinallayertotreattheoutputneuronsasprobabilities.Softmaxallowseachneurontoprovidetheprobabilityoftheimagerepresentingeachclass.Becausetheoutputneuronsarenowprobabilities,softmaxensuresthattheysumto1.0(100%).Toreviewsoftmax,youcanrereadChapter4,“FeedforwardNeuralNetworks.”

ConvNetsfortheMNISTDataSet

InChapter6,“BackpropagationTraining,”weusedtheMNISThandwrittendigitsasanexampleofusingbackpropagation.InChapter10,wepresentanexampleaboutimprovingourrecognitionoftheMNISTdigits,asadeepconvolutionalneuralnetwork.Theconvolutionalnetwork,beingadeepneuralnetwork,willhavemorelayersthanthefeedforwardneuralnetworkseeninChapter6.Thehyper-parametersforthisnetworkareasfollows:

Input:Acceptsboxof[1,96,96]ConvolutionalLayer:filters=32,filter_size=[3,3]Max-poolLayer:[2,2]ConvolutionalLayer:filters=64,filter_size=[2,2]Max-poolLayer:[2,2]ConvolutionalLayer:filters=128,filter_size=[2,2]Max-poolLayer:[2,2]DenseLayer:500neuronsOutputLayer:30neurons

Thisnetworkusestheverycommonpatterntofolloweachconvolutionallayerwithamax-poollayer.Additionally,thenumberoffiltersdecreasesfromtheinputtotheoutputlayer,therebyallowingasmallernumberofbasicfeatures,suchasedges,lines,andsmallshapestobedetectedneartheinputfield.Successiveconvolutionallayersrollupthesebasicfeaturesintolargerandmorecomplexfeatures.Ultimately,thedenselayercanmapthesehigher-levelfeaturesintoeachx-coordinateandy-coordinateoftheactual15digitfeatures.

Trainingtheconvolutionalneuralnetworktakesconsiderabletime,especiallyifyouarenotusingGPUprocessing.AsofJuly2015,notallframeworkshaveequalsupportofGPUprocessing.Atthistime,usingPythonwithaTheano-basedneuralnetworkframework,suchasLasange,providesthebestresults.ManyofthesameresearcherswhoareimprovingdeepconvolutionalnetworksarealsoworkingwithTheano.Thus,theypromoteitbeforeotherframeworksonotherlanguages.

Forthisexample,weusedTheanowithLasange.Thebook’sexampledownloadmayhaveotherlanguagesavailableforthisexampleaswell,dependingontheframeworksavailableforthoselanguages.TrainingaconvolutionalneuralnetworkfordigitfeaturerecognitiononTheanotooklesstimewithaGPUthanaCPU,asaGPUhelpsconsiderablyforconvolutionalneuralnetworks.Theexactamountofperformancewillvaryaccordingtohardwareandplatform.TheaccuracycomparisonbetweentheconvolutionalneuralnetworkandtheregularReLUnetworkisshownhere:

Relu:

Bestvalidlosswas0.068229atepoch17.

Incorrect170/10000(1.7000000000000002%)

ReLU+Conv:


Incorrect150/10000(1.5%)

IfyoucomparetheresultsfromtheconvolutionalneuralnetworktothestandardfeedforwardneuralnetworkfromChapter6,youwillseetheconvolutionalneuralnetworkperformedbetter.Theconvolutionalneuralnetworkiscapableofrecognizingsub-featuresinthedigitstoboostitsperformanceoverthestandardfeedforwardneuralnetwork.Ofcourse,theseresultswillvary,dependingontheplatformused.

ChapterSummary

Convolutionalneuralnetworksareaveryactiveareainthefieldofcomputervision.Theyallowtheneuralnetworktodetecthierarchiesoffeatures,suchaslinesandsmallshapes.Thesesimplefeaturescanformhierarchiestoteachtheneuralnetworktorecognizecomplexpatternscomposedofthemoresimplefeatures.Deepconvolutionalnetworkscantakeconsiderableprocessingpower.SomeframeworksallowtheuseofGPUprocessingtoenhanceperformance.

YannLeCunintroducedtheLeNET-5,themostcommontypeofconvolutionalnetwork.Thisneuralnetworktypeiscomprisedofdenselayers,convolutionallayersandmax-poollayers.Thedenselayersworkexactlythesamewayastraditionalfeedforwardnetworks.Max-poollayerscandownsampletheimageandremovedetail.Convolutionallayersdetectfeaturesinanypartoftheimagefield.

Therearemanydifferentapproachestodeterminethebestarchitectureforaneuralnetwork.Chapter8,“NEAT,CPPNandHyperNEAT,”introducedaneuralnetwork

algorithmthatcouldautomaticallydeterminethebestarchitecture.Ifyouareusingafeedforwardneuralnetworkyouwillmostlikelyarriveatastructurethroughpruningandmodelselection,whichwediscussinthenextchapter.

Chapter11:PruningandModelSelectionPruningaNeuralNetworkModelSelectionRandomvs.GridSearch

Inpreviouschapters,welearnedthatyoucouldbetterfittheweightsofaneuralnetworkwithvarioustrainingalgorithms.Ineffect,thesealgorithmsadjusttheweightsoftheneuralnetworkinordertolowertheerroroftheneuralnetwork.Weoftenrefertotheweightsofaneuralnetworkastheparametersoftheneuralnetworkmodel.Somemachinelearningmodelsmighthaveparametersotherthanweights.Forexample,logisticregression(whichwediscussedinArtificialIntelligenceforHumans,Volume1)hascoefficientsasparameters.

Whenwetrainthemodel,theparametersofanymachinelearningmodelchange.However,thesemodelsalsohavehyper-parametersthatdonotchangeduringtrainingalgorithms.Forneuralnetworks,thehyper-parametersspecifythearchitectureoftheneuralnetwork.Examplesofhyper-parametersforneuralnetworksincludethenumberofhiddenlayersandhiddenneurons.

Inthischapter,wewillexaminetwoalgorithmsthatcanactuallymodifyorsuggestastructurefortheneuralnetwork.Pruningworksbyanalyzinghowmucheachneuroncontributestotheoutputoftheneuralnetwork.Ifaparticularneuron’sconnectiontoanotherneurondoesnotsignificantlyaffecttheoutputoftheneuralnetwork,theconnectionwillbepruned.Throughthisprocess,connectionsandneuronsthathaveonlyamarginalimpactontheoutputareremoved.

Theotheralgorithmthatweintroduceinthischapterismodelselection.Whilepruningstartswithanalreadytrainedneuralnetwork,modelselectioncreatesandtrainsmanyneuralnetworkswithdifferenthyper-parameters.Theprogramthenselectsthehyper-parametersproducingtheneuralnetworkthatachievesthebestvalidationscore.

UnderstandingPruning

Pruningisaprocessthatmakesneuralnetworksmoreefficient.Unlikethetrainingalgorithmsalreadydiscussedinthisbook,pruningdoesnotincreasethetrainingerroroftheneuralnetwork.Theprimarygoalofpruningistodecreasetheamountofprocessingrequiredtousetheneuralnetwork.Additionally,pruningcansometimeshavearegularizingeffectbyremovingcomplexityfromtheneuralnetwork.Thisregularizationcansometimesdecreasetheamountofoverfittingintheneuralnetwork.Thisdecreasecanhelptheneuralnetworkperformbetterondatathatwerenotpartofthetrainingset.

Pruningworksbyanalyzingtheconnectionsoftheneuralnetwork.Thepruningalgorithmlooksforindividualconnectionsandneuronsthatcanberemovedfromthe

neuralnetworktomakeitoperatemoreefficiently.Bypruningunneededconnections,theneuralnetworkcanbemadetoexecutefasterandpossiblydecreaseoverfitting.Inthenexttwosections,wewillexaminehowtoprunebothconnectionsandneurons.

PruningConnections

Connectionpruningiscentraltomostpruningalgorithms.Theprogramanalyzestheindividualconnectionsbetweentheneuronstodeterminewhichconnectionshavetheleastimpactontheeffectivenessoftheneuralnetwork.Connectionsarenottheonlythingthattheprogramcanprune.Analyzingtheprunedconnectionswillrevealthattheprogramcanalsopruneindividualneurons.

PruningNeurons

Pruningfocusesprimarilyontheconnectionsbetweentheindividualneuronsoftheneuralnetwork.However,topruneindividualneurons,wemustexaminetheconnectionsbetweeneachneuronandtheotherneurons.Ifoneparticularneuronissurroundedentirelybyweakconnections,thereisnoreasontokeepthatneuron.Ifweapplythecriteriadiscussedintheprevioussection,neuronsthathavenoconnectionsaretheendresultbecausetheprogramhasprunedalloftheneuron’sconnections.Thentheprogramcanprunethistypeofaneuron.

ImprovingorDegradingPerformance

Itispossiblethatpruninganeuralnetworkmayimproveitsperformance.Anymodificationstotheweightmatrixofaneuralnetworkwillalwayshavesomeimpactontheaccuracyoftherecognitionsmadebytheneuralnetwork.Aconnectionthathaslittleornoimpactontheneuralnetworkmayactuallybedegradingtheaccuracywithwhichtheneuralnetworkrecognizespatterns.Removingthisweakconnectionmayimprovetheoveralloutputoftheneuralnetwork.

Unfortunately,pruningcanalsodecreasetheeffectivenessoftheneuralnetwork.Thus,youmustalwaysanalyzetheeffectivenessoftheneuralnetworkbeforeandafterpruning.Sinceefficiencyistheprimarybenefitofpruning,youmustbecarefultoevaluatewhetheranimprovementintheprocessingtimeisworthadecreaseintheneuralnetwork’seffectiveness.Wewillevaluatetheoveralleffectivenessoftheneuralnetworkbothbeforeandafterpruninginoneoftheprogrammingexamplesfromthischapter.Thisanalysiswillgiveusanideaoftheimpactthatthepruningprocesshasontheeffectivenessoftheneuralnetwork.

PruningAlgorithm

Wewillnowreviewexactlyhowpruningtakesplace.Pruningworksbyexaminingtheweightmatricesofapreviouslytrainedneuralnetwork.Thepruningalgorithmwillthenattempttoremoveneuronswithoutdisruptingtheoutputoftheneuralnetwork.Figure11.1showsthealgorithmusedforselectivepruning:

Figure11.1:PruningaNeuralNetwork

Asyoucansee,thepruningalgorithmhasatrial-and-errorapproach.Thepruningalgorithmattemptstoremoveneuronsfromtheneuralnetworkuntilitcannotremoveadditionalneuronswithoutdegradingtheperformanceoftheneuralnetwork.

Tobeginthisprocess,theselectivepruningalgorithmloopsthrougheachofthehiddenneurons.Foreachhiddenneuronencountered,theprogramevaluatestheerrorleveloftheneuralnetworkbothwithandwithoutthespecifiedneuron.Iftheerrorratejumpsbeyondapredefinedlevel,theprogramretainstheneuronandevaluatesthenext.Iftheerrorrate

doesnotimprovesignificantly,theprogramremovestheneuron.

Oncetheprogramhasevaluatedallneurons,itrepeatstheprocess.Thiscyclecontinuesuntiltheprogramhasmadeonepassthroughthehiddenneuronswithoutremovingasingleneuron.Oncethisprocessiscomplete,anewneuralnetworkisachievedthatperformsacceptablyclosetotheoriginal,yetithasfewerhiddenneurons.

ModelSelection

Modelselectionistheprocesswheretheprogrammerattemptstofindasetofhyper-parametersthatproducethebestneuralnetwork,orothermachinelearningmodel.Inthisbook,wehavementionedmanydifferenthyper-parametersthatarethesettingsthatyoumustprovidetotheneuralnetworkframework.Examplesofneuralnetworkhyper-parametersinclude:

ThenumberofhiddenlayersTheorderoftheconvolutional,pooling,anddropoutlayersThetypeofactivationfunctionThenumberofhiddenneuronsThestructureofpoolingandconvolutionallayers

Asyou’vereadthroughthesechaptersthatmentionhyper-parameters,you’veprobablybeenwonderinghowyouknowwhichsettingstouse.Unfortunately,thereisnoeasyanswer.Ifeasymethodsexistedthatdeterminethesesettings,programmerswouldhaveconstructedtheneuralnetworkframeworksthatautomaticallysetthesehyper-parametersforyou.

Whilewewillprovidemoreinsightintohyper-parametersinChapter14,“ArchitectingNeuralNetworks,”youwillstillneedtousethemodelselectionprocessesdescribedinthischapter.Unfortunately,modelselectionisverytime-consuming.Wespent90%ofourtimeperformingmodelselectionduringourlastKagglecompetition.Often,successinmodelingiscloselyrelatedtotheamountoftimeyouhavetospendonmodelselection.

GridSearchModelSelection

Gridsearchisatrial-and-error,brute-forcealgorithm.Forthistechnique,youmustspecifyeverycombinationofthehyper-parametersthatyouwouldliketouse.Youmustbejudiciousinyourselectionbecausethenumberofsearchiterationscanquicklygrow.Typically,youmustspecifythehyper-parametersthatyouwouldliketosearch.Thisspecificationmightlooklikethefollowing:

HiddenNeurons:2to10,stepsize2ActivationFunctions:tanh,sigmoid&ReLU

Thefirstitemstatesthatthegridsearchshouldtryhiddenneuroncountsbetween2and10countingby2,whichresultsinthefollowing:2,4,6,8,and10(5totalpossibilities.)Theseconditemstatesthatweshouldalsotrytheactivationfunctionstanh,sigmoid,andReLUforeachneuroncount.Thisprocessresultsinatotaloffifteeniterationsbecausefivepossibilitiestimesthreepossibilitiesisfifteentotal.Thesepossibilitiesarelistedhere:

Iteration#1:[2][sigmoid]





Iteration#6:[2][ReLU]





Iteration#11:[2][tanh]





Eachsetofpossibilitiesiscalledanaxis.Theseaxesrotatethroughallpossiblecombinationsbeforetheyfinish.Youcanvisualizethisprocessbythinkingofacar’sodometer.Thefarleftdial(oraxis)isspinningthefastest.Itcountsbetween0and9.Onceithits9andneedstogotothenextnumber,itforwardbackto0,andthenextplace,totheleft,rollsforwardbyone.Unlessthatnextplacewasalsoon9,therollovercontinuestotheleft.Atsomepoint,alldigitsontheodometerareat9,andtheentiredevicewouldrollbackoverto0.Whenthisfinalrolloveroccurs,thegridsearchisdone.

Mostframeworksallowtwoaxistypes.Thefirsttypeisanumericrangewithastep.Thesecondtypeisalistofvalues,liketheactivationfunctionsabove.ThefollowingJavascriptexampleallowsyoutotryyourownsetsofaxestoseethenumberofiterationsproduced:

http://www.heatonresearch.com/aifh/vol3/grid_iter.html

Listing11.1showsthepseudocodenecessarytorollthroughalliterationsofseveralsetsofvalues:

http://www.heatonresearch.com/aifh/vol3/grid_iter.html

Listing11.1:GridSearch

#Thevariableaxescontainsalistofeachaxis.

#Eachaxes(inaxes)isalistofpossiblevalues

#forthataxis.

#Currentindexofeachaxisiszero,createanarray

#ofzeros.

indexes=zeros(len(axes))

done=false

whilenotdone:

#Preparevectorofcurrentiteration’s

#hyper-parameters.

iteration=[]

forifrom0tolen(axes)

iteration.add(axes[i][indexes[i]])

#Performoneiteration,passinginthehyper-parameters

#thatarestoredintheiterationlist.Thisfunction

#shouldtraintheneuralnetworkaccordingtothe

#hyper-parametersandkeepnoteofthebesttrained

#networksofar.

perform_iteration(iteration)

#Rotatetheaxesforwardoneunit,likeacar’s

#odometer.

indexes[0]=indexes[0]+1;

varcounterIdx=0;

#rollforwardtheotherplaces,ifneeded

whilenotdoneandindexes[counterIdx]>=

len(axes[counterIdx]):

indexes[counterIdx]=0

counterIdx=counterIdx+1

ifcounterIdx>=len(axes):

done=true

else:

indexes[counterIdx]=indexes[counterIdx]+1

Thecodeaboveusestwoloopstopassthrougheverypossiblesetofthehyper-parameters.Thefirstloopcontinueswhiletheprogramisstillproducinghyper-parameters.Eachtimethrough,thisloopincreasesthefirsthyper-parametertothenextvalue.Thesecondloopdetectsifthefirsthyper-parameterhasrolledover.Theinnerloopkeepsmovingforwardtothenexthyper-parameteruntilnomorerolloversoccur.Onceallthehyper-parametersrollover,theprocessisdone.

Asyoucansee,thegridsearchcanquicklyresultinalargenumberofiterations.Considerifyouwishedtosearchfortheoptimalnumberofhiddenneuronsonfivelayers,whereyouallowedupto200neuronsoneachlevel.Thisvaluewouldbeequalto200multipliedbyitselffivetimes,or200tothefifthpower.Thisprocessresultsin320billioniterations.Becauseeachiterationinvolvestraininganeuralnetwork,iterationscantakeminutes,hoursorevendaystoexecute.

Whenperforminggridsearches,multi-threadingandgridprocessingcanbebeneficial.Runningtheiterationsthroughathreadpoolcangreatlyspeedupthesearch.Thethread

poolshouldhaveasizeequaltothenumberofcoresonthecomputer’smachine.Thistraitallowsamachinewitheightcorestoworkoneightneuralnetworkssimultaneously.Thetrainingoftheindividualmodelsmustbesinglethreadedwhenyouruntheiterationssimultaneously.Manyframeworkswilluseallavailablecorestotrainasingleneuralnetwork.Whenyouhavealargenumberofneuralnetworkstotrain,youshouldalwaystrainseveralneuralnetworksinparallel,runningthemoneatimesothateachnetworkusesthemachinescores.

RandomSearchModelSelection

Itisalsopossibletousearandomsearchformodelselection.Insteadofsystematicallytryingeveryhyper-parametercombination,therandomsearchmethodchoosesrandomvaluesforhyper-parameters.Fornumericranges,younolongerneedtospecifyastepvalue,therandommodelselectionwillchooseacontinuousrangeoffloatingpointnumbersbetweenyourspecifiedbeginningandendingpoints.Forarandomsearch,theprogrammertypicallyspecifieseitheratimeoraniterationlimit.Thefollowingshowsarandomsearch,usingthesameaxesasabove,butitislimitedtoteniterations:

Iteration#1:[3.298266736790538][sigmoid]

Iteration#2:[9.569985574809834][ReLU]



Iteration#5:[8.041758658131585][tanh]

Iteration#6:[2.363519841339439][ReLU]

Iteration#7:[9.72388393455185][tanh]

Iteration#8:[3.411276006139815][tanh]



Asyoucansee,thefirstaxis,whichisthehiddenneuroncount,isnowtakingonfloating-pointvalues.Youcansolvethisproblembyroundingtheneuroncounttothenearestwholenumber.Itisalsoadvisabletoavoidretestingthesamehyper-parametersmorethanonce.Asaresult,theprogramshouldkeepalistofpreviouslytriedhyper-parameterssothatitdoesn’trepeatanyhyper-parametersthatwerewithasmallrangeofapreviouslytriedset.

ThefollowingURLusesJavascripttoshowrandomsearchinaction:

http://www.heatonresearch.com/aifh/vol3/random_iter.html

http://www.heatonresearch.com/aifh/vol3/random_iter.html

OtherModelSelectionTechniques

Modelselectionisaveryactiveareaofresearch,and,asaresult,manyinnovativewaysexisttoperformit.Thinkofthehyper-parametersasavectorofvaluesandtheprocessoffindingthebestneuralnetworkscoreforthosehyper-parametersasanobjectivefunction.Youcanconsiderthesehyper-parametersasanoptimizationproblem.Wehavepreviouslyexaminedmanyoptimizationalgorithmsinearliervolumesofthisbookseries.Thesealgorithmsarethefollowing:

AntColonyOptimization(ACO)GeneticAlgorithmsGeneticProgrammingHillClimbingNelder-MeadParticleSwarmOptimization(PSO)SimulatedAnnealing

WeexaminedmanyofthesealgorithmsindetailinVolumes1and2ofArtificialIntelligenceforHumans.Althoughthelistofalgorithmsislong,therealityisthatmostofthesealgorithmsarenotsuitedformodelselectionbecausetheobjectivefunctionformodelselectioniscomputationallyexpensive.Itmighttakeminutes,hoursorevendaystotrainaneuralnetworkanddeterminehowwellagivensetofhyper-parameterscantrainaneuralnetwork.

Nelder-Meadandsometimeshillclimbingturnouttobethebestoptionsifyouwishtoapplyanoptimizationfunctiontomodelselection.Thesealgorithmsattempttominimizecallstotheobjectivefunction.Callstotheobjectivefunctionareveryexpensiveforaparametersearchbecauseaneuralnetworkmustbetrained.Agoodtechniqueforoptimizationistogenerateasetofhyper-parameterstouseasastartingpointforNelder-MeadandallowNelder-Meadtoimprovethesehyper-parameters.Nelder-Meadisagoodchoiceforahyper-parametersearchbecauseitresultsinarelativelysmallnumberofcallstotheobjectivefunction.

ModelselectionisaverycommonpartofKaggledatasciencecompetitions.Basedoncompetitiondiscussionsandreports,mostparticipantsusegridandrandomsearchesformodelselection..Nelder-Meadisalsopopular.AnothertechniquethatisgaininginpopularityistheuseofBayesianoptimization,asdescribedbySnoek,Larochelle,Hugo&Adams(2012).Animplementationofthisalgorithm,writteninPython,iscalledSpearmint,andyoucanfinditatthefollowingURL:

https://github.com/JasperSnoek/spearmint

Bayesianoptimizationisarelativelynewtechniqueformodelselectiononwhichwehaveonlyrecentlyconductedresearch.Therefore,thiscurrentbookdoesnotcontainamoreprofoundexaminationofit.Futureeditionsmayincludemoreinformationofthistechnique.

https://github.com/JasperSnoek/spearmint

ChapterSummary

Asyoulearnedinthischapter,itispossibletopruneneuralnetworks.Pruninganeuralnetworkremovesconnectionsandneuronsinordertomaketheneuralnetworkmoreefficient.Executionspeed,numberofconnections,anderrorareallmeasuresofefficiency.Althoughneuralnetworksmustbeeffectiveatrecognizingpatterns,efficiencyisthemaingoalofpruning.Severaldifferentalgorithmscanpruneaneuralnetwork.Inthischapter,weexaminedtwoofthesealgorithms.Ifyourneuralnetworkisalreadyoperatingsufficientlyfast,youmustevaluatewhetherthepruningisjustified.Evenwhenefficiencyisimportant,youmustweighthetrade-offsbetweenefficiencyandareductionintheeffectivenessofyourneuralnetwork.

Modelselectionplaysasignificantroleinneuralnetworkdevelopment.Hyper-parametersaresettingssuchashiddenneuron,layercount,andactivationfunctionselection.Modelselectionistheprocessoffindingthesetofhyper-parametersthatwillproducethebest-trainedneuralnetwork.Avarietyofalgorithmscansearchthroughthepossiblesettingsofthehyper-parametersandfindthebestset.

Pruningcansometimesleadtoadecreaseinthetendencyforneuralnetworkstooverfit.Thisreductioninoverfittingistypicallyonlyabyproductofthepruningprocess.Algorithmsthatreduceoverfittingarecalledregularizationalgorithms.Althoughpruningwillsometimeshavearegularizingeffect,anentiregroupofalgorithms,calledregularizationalgorithms,existtoreduceoverfitting.Wewillfocusexclusivelyonthesealgorithmsinthenextchapter.

Chapter12:DropoutandRegularizationRegularizationL1&L2RegularizationDropoutLayers

Regularizationisatechniquethatreducesoverfitting,whichoccurswhenneuralnetworksattempttomemorizetrainingdata,ratherthanlearnfromit.Humansarecapableofoverfittingaswell.Beforeweexaminethewaysthatamachineaccidentallyoverfits,wewillfirstexplorehowhumanscansufferfromit.

Humanprogrammersoftentakecertificationexamstoshowtheircompetenceinagivenprogramminglanguage.Tohelppreparefortheseexams,thetestmakersoftenmakepracticeexamsavailable.Consideraprogrammerwhoentersaloopoftakingthepracticeexam,studyingmore,andthentakingthepracticeexamagain.Atsomepoint,theprogrammerhasmemorizedmuchofthepracticeexam,ratherthanlearningthetechniquesnecessarytofigureouttheindividualquestions.Theprogrammerhasnowoverfittothepracticeexam.Whenthisprogrammertakestherealexam,hisactualscorewilllikelybelowerthanwhatheearnedonthepracticeexam.

Acomputercanoverfitaswell.Althoughaneuralnetworkreceivedahighscoreonitstrainingdata,thisresultdoesnotmeanthatthesameneuralnetworkwillscorehighondatathatwasnotinsidethetrainingset.Regularizationisoneofthetechniquesthatcanpreventoverfitting.Anumberofdifferentregularizationtechniquesexist.Mostworkbyanalyzingandpotentiallymodifyingtheweightsofaneuralnetworkasittrains.

L1andL2Regularization

L1andL2regularizationaretwocommonregularizationtechniquesthatcanreducetheeffectsofoverfitting(Ng,2004).Bothofthesealgorithmscaneitherworkwithanobjectivefunctionorasapartofthebackpropagationalgorithm.Inbothcasestheregularizationalgorithmisattachedtothetrainingalgorithmbyaddinganadditionalobjective.

Bothofthesealgorithmsworkbyaddingaweightpenaltytotheneuralnetworktraining.Thispenaltyencouragestheneuralnetworktokeeptheweightstosmallvalues.BothL1andL2calculatethispenaltydifferently.Forgradient-descent-basedalgorithms,suchasbackpropagation,youcanaddthispenaltycalculationtothecalculatedgradients.Forobjective-function-basedtraining,suchassimulatedannealing,thepenaltyisnegativelycombinedwiththeobjectivescore.

BothL1andL2workdifferentlyinthewaythattheypenalizethesizeofaweight.L1willforcetheweightsintoapatternsimilartoaGaussiandistribution;theL2willforcetheweightsintoapatternsimilartoaLaplacedistribution,asdemonstratedbyFigure

12.1:

Figure12.1:L1vsL2

Asyoucansee,L1algorithmismoretolerantofweightsfurtherfrom0,whereastheL2algorithmislesstolerant.WewillhighlightotherimportantdifferencesbetweenL1andL2inthefollowingsections.YoualsoneedtonotethatbothL1andL2counttheirpenaltiesbasedonlyonweights;theydonotcountpenaltiesonbiasvalues.

UnderstandingL1Regularization

YoushoulduseL1regularizationtocreatesparsityintheneuralnetwork.Inotherwords,theL1algorithmwillpushmanyweightconnectionstonear0.Whenaweightisnear0,theprogramdropsitfromthenetwork.Droppingweightedconnectionswillcreateasparseneuralnetwork.

Featureselectionisausefulbyproductofsparseneuralnetworks.Featuresarethevaluesthatthetrainingsetprovidestotheinputneurons.Oncealltheweightsofaninputneuronreach0,theneuralnetworktrainingdeterminesthatthefeatureisunnecessary.Ifyourdatasethasalargenumberofinputfeaturesthatmaynotbeneeded,L1regularizationcanhelptheneuralnetworkdetectandignoreunnecessaryfeatures.

Equation12.1showsthepenaltycalculationperformedbyL1:

Equation12.1:L1ErrorTermObjective

Essentially,aprogrammermustbalancetwocompetinggoals.Heorshemustdecidethegreatervalueofachievingalowscorefortheneuralnetworkorregularizingtheweights.Bothresultshavevalue,buttheprogrammerhastochoosetherelativeimportance.Ifregularizationisthemaingoal,theλ(lambda)valuedeterminesthattheL1objectiveismoreimportantthantheneuralnetwork’serror.Avalueof0meansL1regularizationisnotconsideredatall.Inthiscase,alownetworkerrorwouldhavemoreimportance.Avalueof0.5meansL1regularizationishalfasimportantastheerrorobjective.TypicalL1valuesarebelow0.1(10%).

ThemaincalculationperformedbyL1isthesummingoftheabsolutevalues(asindicatedbytheverticalbars)ofalltheweights.Thebiasvaluesarenotsummed.

Ifyouareusinganoptimizationalgorithm,suchassimulatedannealing,youcansimplycombinethevaluereturnedbyEquation12.1tothescore.Youshouldcombinethisvaluetothescoreinsuchawaysothatithasanegativeeffect.Ifyouaretryingtominimizethescore,thenyoushouldaddtheL1value.Similarly,ifyouaretryingtomaximizethescore,thenyoushouldsubtracttheL1value.

IfyouareusingL1regularizationwithagradient-descent-basedtrainingalgorithm,suchasbackpropagation,youneedtouseaslightlydifferenterrorterm,asshownbyEquation12.2:

Equation12.2:L1ErrorTerm

Equation12.2isnearlythesameasEquation12.1exceptthatwedividebyn.Thevaluenrepresentsthenumberoftrainingsetevaluations.Forexample,iftherewere100trainingsetelementsandthreeoutputneurons,nwouldbe300.Wederivethisnumberbecausetheprogramhasthreevaluestoevaluateforeachofthose100elements.ItisnecessarytodividebynbecausetheprogramappliesEquation12.2ateverytrainingevaluation.ThischaracteristiccontrastswithEquation12.1,whichisappliedoncepertrainingiteration.

TouseEquation12.2,weneedtotakeitspartialderivativewithrespecttotheweight.Equation12.3showsthepartialderivativeofEquation12.2:

Equation12.3:L1WeightPartialDerivative

Tousethisgradient,weaddthisvaluetoeveryweightgradientcalculatedbythegradient-descentalgorithm.Thisadditionisonlyperformedforweightvalues;thebiasesareleftalone.

UnderstandingL2Regularization

YoushoulduseL2regularizationwhenyouarelessconcernedaboutcreatingaspacenetworkandaremoreconcernedaboutlowweightvalues.Thelowerweightvalueswilltypicallyleadtolessoverfitting.

Equation12.4showsthepenaltycalculationperformedbyL2:

Equation12.4:L2ErrorTermObjective

LiketheL1algorithm,theλ(lambda)valuedetermineshowimportanttheL2objectiveiscomparedtotheneuralnetwork’serror.TypicalL2valuesarebelow0.1(10%).ThemaincalculationperformedbyL2isthesummingofthesquaresofalloftheweights.Thebiasvaluesarenotsummed.

Ifyouareusinganoptimizationalgorithm,suchassimulatedannealing,youcansimplycombinethevaluereturnedbyEquation12.4tothescore.Youshouldcombinethisvaluewiththescoreinsuchawaysothatithasanegativeeffect.Ifyouaretryingtominimizethescore,thenyoushouldaddtheL2value.Similarly,ifyouaretryingtomaximizethescore,thenyoushouldsubtracttheL2value.

IfyouareusingL2regularizationwithagradient-descent-basedtrainingalgorithm,suchasbackpropagation,youneedtouseaslightlydifferenterrorterm,asshownbyEquation12.5:

Equation12.5:L2ErrorTerm

Equation12.5isnearlythesameasEquation12.4,exceptthat,unlikeL1,wetakethesquaresoftheweights.TouseEquation12.5,weneedtotakethepartialderivativewithrespecttotheweight.Equation12.6showsthepartialderivativeofEquation12.6:

Equation12.6:L2WeightPartialDerivative

Tousethisgradient,youneedtoaddthisvaluetoeveryweightgradientcalculatedbythegradient-descentalgorithm.Thisadditionisonlyperformedonweightvalues;thebiasesareleftalone.

DropoutLayers

Hinton,Srivastava,Krizhevsky,Sutskever,&Salakhutdinov(2012)introducedthedropoutregularizationalgorithm.AlthoughdropoutworksinadifferentwaythanL1andL2,itaccomplishesthesamegoal—thepreventionofoverfitting.However,thealgorithmgoesaboutthetaskbyactuallyremovingneuronsandconnections—atleasttemporarily.UnlikeL1andL2,noweightpenaltyisadded.Dropoutdoesnotdirectlyseektotrainsmallweights.

Dropoutworksbycausinghiddenneuronsoftheneuralnetworktobeunavailableduringpartofthetraining.Droppingpartoftheneuralnetworkcausestheremainingportiontobetrainedtostillachieveagoodscoreevenwithoutthedroppedneurons.Thisdecreasescoadaptionbetweenneurons,whichresultsinlessoverfitting.

DropoutLayer

Mostneuralnetworkframeworksimplementdropoutasaseparatelayer.Dropoutlayersfunctionasaregular,denselyconnectedneuralnetworklayer.Theonlydifferenceisthatthedropoutlayerswillperiodicallydropsomeoftheirneuronsduringtraining.Youcanusedropoutlayersonregularfeedforwardneuralnetworks.Infact,theycanalsobecomelayersinconvolutionalLeNET-5networkslikewestudiedinChapter10,“ConvolutionalNeuralNetworks.”

Theusualhyper-parametersforadropoutlayerarethefollowing:

NeuronCountActivationFunctionDropoutProbability

Theneuroncountandactivationfunctionhyper-parametersworkexactlythesamewayastheircorrespondingparametersinthedenselayertypementionedinChapter10,“ConvolutionalNeuralNetworks.”Theneuroncountsimplyspecifiesthenumberofneuronsinthedropoutlayer.Thedropoutprobabilityindicatesthelikelihoodofaneurondroppingoutduringthetrainingiteration.Justasitdoesforadenselayer,theprogramspecifiesanactivationfunctionforthedropoutlayer.

ImplementingaDropoutLayer

Theprogramimplementsadropoutlayerasadenselayerthatcaneliminatesomeofitsneurons.Contrarytopopularbeliefaboutthedropoutlayer,theprogramdoesnotpermanentlyremovethesediscardedneurons.Adropoutlayerdoesnotloseanyofitsneuronsduringthetrainingprocess,anditwillstillhaveexactlythesamenumberofneuronsaftertraining.Inthisway,theprogramonlytemporarilymaskstheneuronsratherthandroppingthem.

Figure12.2showshowadropoutlayermightbesituatedwithotherlayers:

Figure12.2:DropoutLayer

Thediscardedneuronsandtheirconnectionsareshownasdashedlines.Theinputlayerhastwoinputneuronsaswellasabiasneuron.Thesecondlayerisadenselayerwiththreeneuronsaswellasabiasneuron.Thethirdlayerisadropoutlayerwithsixregularneuronseventhoughtheprogramhasdropped50%ofthem.Whiletheprogramdropstheseneurons,itneithercalculatesnortrainsthem.However,thefinalneuralnetworkwillusealloftheseneuronsfortheoutput.Aspreviouslymentioned,theprogramonlytemporarilydiscardstheneurons.

Duringsubsequenttrainingiterations,theprogramchoosesdifferentsetsofneuronsfromthedropoutlayer.Althoughwechoseaprobabilityof50%fordropout,thecomputerwillnotnecessarilydropthreeneurons.Itisasifweflippedacoinforeachofthedropoutcandidateneuronstochooseifthatneuronwasdroppedout.Youmustknowthattheprogramshouldneverdropthebiasneuron.Onlytheregularneuronsonadropoutlayerarecandidates.

Theimplementationofthetrainingalgorithminfluencestheprocessofdiscardingneurons.Thedropoutsetfrequentlychangesoncepertrainingiterationorbatch.Theprogramcanalsoprovideintervalswhereallneuronsarepresent.Someneuralnetworkframeworksgiveadditionalhyper-parameterstoallowyoutospecifyexactlytherateofthisinterval.

Whydropoutiscapableofdecreasingoverfittingisacommonquestion.Theansweristhatdropoutcanreducethechanceofacodependencydevelopingbetweentwoneurons.Twoneuronsthatdevelopacodependencywillnotbeabletooperateeffectivelywhenoneisdroppedout.Asaresult,theneuralnetworkcannolongerrelyonthepresenceofevery

neuron,andittrainsaccordingly.Thischaracteristicdecreasesitsabilitytomemorizetheinformationpresentedtoit,therebyforcinggeneralization.

Dropoutalsodecreasesoverfittingbyforcingabootstrappingprocessupontheneuralnetwork.Bootstrappingisaverycommonensembletechnique.WewilldiscussensemblingingreaterdetailinChapter16,“ModelingwithNeuralNetworks.”Basically,ensemblingisatechniqueofmachinelearningthatcombinesmultiplemodelstoproduceabetterresultthanthoseachievedbyindividualmodels.Ensembleisatermthatoriginatesfromthemusicalensemblesinwhichthefinalmusicproductthattheaudiencehearsisthecombinationofmanyinstruments.

Bootstrappingisoneofthemostsimpleensembletechniques.Theprogrammerusingbootstrappingsimplytrainsanumberofneuralnetworkstoperformexactlythesametask.However,eachoftheseneuralnetworkswillperformdifferentlybecauseofsometrainingtechniquesandtherandomnumbersusedintheneuralnetworkweightinitialization.Thedifferenceinweightscausestheperformancevariance.Theoutputfromthisensembleofneuralnetworksbecomestheaverageoutputofthememberstakentogether.Thisprocessdecreasesoverfittingthroughtheconsensusofdifferentlytrainedneuralnetworks.

Dropoutworkssomewhatlikebootstrapping.Youmightthinkofeachneuralnetworkthatresultsfromadifferentsetofneuronsbeingdroppedoutasanindividualmemberinanensemble.Astrainingprogresses,theprogramcreatesmoreneuralnetworksinthisway.However,dropoutdoesnotrequirethesameamountofprocessingasdoesbootstrapping.Thenewneuralnetworkscreatedaretemporary;theyexistonlyforatrainingiteration.Thefinalresultisalsoasingleneuralnetwork,ratherthananensembleofneuralnetworkstobeaveragedtogether.

UsingDropout

Inthischapter,wewillcontinuetoevolvethebook’sMNISThandwrittendigitsexample.Weexaminedthisdatasetinthebookintroductionanduseditinseveralexamples.

Theexampleforthischapterusesthetrainingsettofitadropoutneuralnetwork.Theprogramsubsequentlyevaluatesthetestsetonthistrainednetworktoviewtheresults.Bothdropoutandnon-dropoutversionsoftheneuralnetworkhaveresultstoexamine.

Thedropoutneuralnetworkusedthefollowinghyper-parameters:

ActivationFunction:ReLUInputLayer:784(28x28)HiddenLayer1:1000DropoutLayer:500units,50%HiddenLayer2:250OutputLayer:10(becausethereare10digits)

Weselectedtheabovehyper-parametersthroughexperimentation.Byroundingthenumberofinputneuronsuptothenextevenunit,wechoseafirsthiddenlayerof1000.Thenextthreelayersconstrainedthisamountbyhalfeachtime.Placingthedropoutlayerbetweenthetwohiddenlayersprovidedthebestimprovementintheerrorrate.Wealsotriedplacingitbothbeforehiddenlayer1andafterhiddenlayer2.Mostoftheoverfittingoccurredbetweenthetwohiddenlayers.

Weusedthefollowinghyper-parametersfortheregularneuralnetwork.Thisprocessisessentiallythesameasthedropoutnetworkexceptthatanadditionalhiddenlayerreplacesthedropoutlayer.

ActivationFunction:ReLUInputLayer:784(28x28)HiddenLayer1:1000HiddenLayer2:500HiddenLayer3:250OutputLayer:10(becausethereare10digits)

Theresultsareshownhere:

Relu:


Incorrect170/10000(1.7000000000000002%)

ReLU+Dropout:


Incorrect120/10000(1.2%)

Asyoucansee,dropoutneuralnetworkachievedabettererrorratethantheReLUonlyneuralnetworkfromearlierinthebook.Byreducingtheamountofoverfitting,thedropoutnetworkgotabetterscore.Youshouldalsonoticethat,althoughthenon-dropoutnetworkdidachieveabettertrainingscore,thisresultisnotgood.Itindicatesoverfitting.Ofcourse,theseresultswillvary,dependingontheplatformused.

ChapterSummary

Weintroducedseveralregularizationtechniquesthatcanreduceoverfitting.Whentheneuralnetworkmemorizestheinputandexpectedoutput,overfittingoccursbecausetheprogramhasnotlearnedtogeneralize.Manydifferentregularizationtechniquescanforcetheneuralnetworktolearntogeneralize.WeexaminedL1,L2,anddropout.L1andL2worksimilarlybyimposingpenaltiesforweightsthataretoolarge.Thepurposeofthesepenaltiesistoreducecomplexityintheneuralnetwork.Dropouttakesanentirelydifferentapproachbyrandomlyremovingvariousneuronsandforcingthetrainingtocontinuewithapartialneuralnetwork.

TheL1algorithmpenalizeslargeweightsandforcesmanyoftheweightstoapproach

0.Weconsidertheweightsthatcontainazerovaluetobedroppedfromtheneuralnetwork.Thisreductionproducesasparseneuralnetwork.Ifallweightedconnectionsbetweenaninputneuronandthenextlayerareremoved,youcanassumethatthefeatureconnectedtothatinputneuronisunimportant.Featureselectionischoosinginputfeaturesbasedontheirimportancetotheneuralnetwork.TheL2algorithmpenalizeslargeweights,butitdoesnottendtoproduceneuralnetworksthatareassparseasthoseproducedbytheL1algorithm.

Dropoutrandomlydropsneuronsinadesignateddropoutlayer.Theneuronsthatweredroppedfromthenetworkarenotgoneastheywereinpruning.Instead,thedroppedneuronsaretemporarilymaskedfromtheneuralnetwork.Thesetofdroppedneuronschangesduringeachtrainingiteration.Dropoutforcestheneuralnetworktocontinuefunctioningwhenneuronsareremoved.Thismakesitdifficultfortheneuralnetworktomemorizeandoverfit.

Sofar,wehaveexploredonlyfeedforwardneuralnetworksinthisvolume.Inthistypeofnetwork,theconnectionsonlymoveforwardfromtheinputlayertohiddenlayersandultimatelytotheoutputlayer.Recurrentneuralnetworksallowbackwardconnectionstopreviouslayers.Wewillanalyzethistypeofneuralnetworkinthenextchapter.

Additionally,wehavefocusedprimarilyonusingneuralnetworkstorecognizepatterns.Wecanalsoteachneuralnetworkstopredictfuturetrends.Byprovidinganeuralnetworkwithaseriesoftime-basedvalues,itcanpredictsubsequentvalues.Inthenextchapter,wewillalsodemonstratepredictiveneuralnetworks.Werefertothistypeofneuralnetworkasatemporalneuralnetwork.Recurrentneuralnetworkscanoftenmaketemporalpredictions.

Chapter13:TimeSeriesandRecurrentNetworks

TimeSeriesElmanNetworksJordanNetworksDeepRecurrentNetworks

Inthischapter,wewillexaminetimeseriesencodingandrecurrentnetworks,twotopicsthatarelogicaltoputtogetherbecausetheyarebothmethodsfordealingwithdatathatspansovertime.Timeseriesencodingdealswithrepresentingeventsthatoccurovertimetoaneuralnetwork.Therearemanydifferentmethodstoencodedatathatoccurovertimetoaneuralnetwork.Thisencodingisnecessarybecauseafeedforwardneuralnetworkwillalwaysproducethesameoutputvectorforagiveninputvector.Recurrentneuralnetworksdonotrequireencodingoftimeseriesdatabecausetheyareabletoautomaticallyhandledatathatoccurovertime.

Thevariationintemperatureduringtheweekisanexampleoftimeseriesdata.Forinstance,ifweknowthattoday’stemperatureis25degrees,andtomorrow’stemperatureis27degrees,therecurrentneuralnetworksandtimeseriesencodingprovideanotheroptiontopredictthecorrecttemperaturefortheweek.Conversely,atraditionalfeedforwardneuralnetworkwillalwaysrespondwiththesameoutputforagiveninput.Ifafeedforwardneuralnetworkistrainedtopredicttomorrow’stemperature,itshouldrespond27for25.Thefactthatitwillalwaysoutput27whengiven25mightbeahindrancetoitspredictions.Surelythetemperatureof27willnotalwaysfollow25.Itwouldbebetterfortheneuralnetworktoconsiderthetemperaturesforaseriesofdaysbeforethedaybeingpredicted.Perhapsthetemperatureoverthelastweekmightallowustopredicttomorrow’stemperature.Therefore,recurrentneuralnetworksandtimeseriesencodingrepresenttwodifferentapproachestotheproblemofrepresentingdataovertimetoaneuralnetwork.

Sofartheneuralnetworksthatwe’veexaminedhavealwayshadforwardconnections.Theinputlayeralwaysconnectstothefirsthiddenlayer.Eachhiddenlayeralwaysconnectstothenexthiddenlayer.Thefinalhiddenlayeralwaysconnectstotheoutputlayer.Thismannertoconnectlayersisthereasonthatthesenetworksarecalled“feedforward.”Recurrentneuralnetworksarenotsorigid,asbackwardconnectionsarealsoallowed.Arecurrentconnectionlinksaneuroninalayertoeitherapreviouslayerortheneuronitself.Mostrecurrentneuralnetworkarchitecturesmaintainstateintherecurrentconnections.Feedforwardneuralnetworksdon’tmaintainanystate.Arecurrentneuralnetwork’sstateactsasasortofshort-termmemoryfortheneuralnetwork.Consequently,arecurrentneuralnetworkwillnotalwaysproducethesameoutputforagiveninput.

TimeSeriesEncoding

Aswesawinpreviouschapters,neuralnetworksareparticularlygoodatrecognizingpatterns,whichhelpsthempredictfuturepatternsindata.Werefertoaneuralnetworkthatpredictsfuturepatternsasapredictive,ortemporal,neuralnetwork.Thesepredictiveneuralnetworkscananticipatefutureevents,suchasstockmarkettrendsandsunspotcycles.

Manydifferentkindsofneuralnetworkscanpredict.Inthissection,thefeedforwardneuralnetworkwillattempttolearnpatternsindatasoitcanpredictfuturevalues.Likeallproblemsappliedtoneuralnetworks,predictionisamatterofintelligentlydetermininghowtoconfigureinputandinterpretoutputneuronsforaproblem.Becausethetypeoffeedforwardneuralnetworksinthisbookalwaysproducethesameoutputforagiveninput,weneedtomakesurethatweencodetheinputcorrectly.

Awidevarietyofmethodscanencodetimeseriesdataforaneuralnetwork.Theslidingwindowalgorithmisoneofthesimplestandmostpopularencodingalgorithms.However,morecomplexalgorithmsallowthefollowingconsiderations:

WeightingoldervaluesaslessimportantthannewerSmoothing/averagingovertimeOtherdomain-specific(e.g.finance)indicators

Wewillfocusontheslidingwindowalgorithmencodingmethodfortimeseries.Theslidingwindowalgorithmworksbydividingthedataintotwowindowsthatrepresentthepastandthefuture.Youmustspecifythesizesofbothwindows.Forexample,ifyouwanttopredictfuturepriceswiththedailyclosingpriceofastock,youmustdecidehowfarintothepastandfuturethatyouwishtoexamine.Youmightwanttopredictthenexttwodaysusingthelastfiveclosingprices.Inthiscase,youwouldhaveaneuralnetworkwithfiveinputneuronsandtwooutputneurons.

EncodingDataforInputandOutputNeurons

Considerasimpleseriesofnumbers,suchasthesequenceshownhere:

1,2,3,4,3,2,1,2,3,4,3,2,1

Aneuralnetworkthatpredictsnumbersfromthissequencemightusethreeinputneuronsandasingleoutputneuron.Thefollowingtrainingsethasapredictionwindowofsize1andapastwindowsizeof3:

[1,2,3]->[4]

[2,3,4]->[3]

[3,4,3]->[2]

[4,3,2]->[1]

Asyoucansee,theneuralnetworkispreparedtoreceiveseveraldatasamplesinasequence.Theoutputneuronthenpredictshowthesequencewillcontinue.Theideaisthatyoucannowfeedanysequenceofthreenumbers,andtheneuralnetworkwillpredictthefourthnumber.Eachdatapointiscalledatimeslice.Therefore,eachinputneuronrepresentsaknowntimeslice.Theoutputneuronsrepresentfuturetimeslices.

Itisalsopossibletopredictmorethanonevalueintothefuture.Thefollowingtrainingsethasapredictionwindowofsize2andapastwindowsizeof3:

[1,2,3]->[4,3]

[2,3,4]->[3,2]

[3,4,3]->[2,1]

[4,3,2]->[1,2]

Thelasttwoexampleshaveonlyasinglestreamofdata.Itispossibletousemultiplestreamsofdatatopredict.Forexample,youmightpredictthepricewiththepriceofastockanditsvolume.Considerthefollowingtwostreams:

Stream#1:1,2,3,4,3,2,1,2,3,4,3,2,1

Stream#2:10,20,30,40,30,20,10,20,30,40,30,20,10

Youcanpredictstream#1withstream#1and#2.Yousimplyneedtoaddthestream#2valuesnexttothestream#1values.Atrainingsetcanperformthiscalculation.Thefollowingsetshowsapredictionwindowofsize1andapastwindowsizeof3:

[1,10,2,20,3,30]->[4]

[2,20,3,30,4,40]->[3]

[3,30,4,40,3,30]->[2]

[4,40,3,30,2,20]->[1]

Thissametechniqueworksforanynumberofstreams.Inthiscase,stream#1helpstopredictitself.Forexample,youcanusethestockpricesofIBMandAppletopredictMicrosoft.Thistechniqueusesthreestreams.Thestreamthatwe’repredictingdoesn’tneedtobeamongthestreamsprovidingthedatatoformtheprediction.

PredictingtheSineWave

Theexampleinthissectionisrelativelysimple.Wepresentaneuralnetworkthatpredictsthesinewave,whichismathematicallypredictable.However,programmerscaneasilyunderstandthesinewave,anditvariesovertime.Thesequalitiesmakeitagoodintroductiontopredictiveneuralnetworks.

Youcanseethesinewavebyplottingthetrigonometricsinefunction.Figure13.1showsthesinewave:

Figure13.1:Thesinewave

Thesinewavefunctiontrainstheneuralnetwork.Backpropagationwilladjusttheweightstoemulatethesinewave.Whenyoufirstexecutethesinewaveexample,youwillseetheresultsofthetrainingprocess.Typicaloutputfromthesinewavepredictor’strainingprocessfollows:

Iteration#1Error:0.48120350975475823Iteration#2Error:

0.36753445768855236Iteration#3Error:0.3212066601426759


0.2780102928778258Iteration#6Error:0.26556861969786527

Iteration#7Error:0.25605359706505776Iteration#8Er236

IntroductiontoNeuralNetworkswithJava,SecondEdition

ror:0.24842242500053566Iteration#9Error:0.24204767544134156Iteration

#10Error:0.23653845782593882

...


0.02319310934886356Iteration#4992Error:0.023192242246688515


0.023190508573672858Iteration#4995Error:0.02318964200159761


0.02318790938322986Iteration#4998Error:0.023187043335705867

Iteration#4999Error:0.023186177461801745

Inthebeginning,theerrorrateisfairlyhighat48%.Bytheseconditeration,thisratequicklybeginstofallto36.7%.Bythetimethe4,999thiterationhasoccurred,theerrorratehasfallento2.3%.Theprogramisdesignedtostopbeforehittingthe5,000thiteration.Thissucceedsinreducingtheerrorratetolessthan0.03.

Additionaltrainingwouldproduceabettererrorrate;however,bylimitingtheiterations,theprogramisabletofinishinonlyafewminutesonaregularcomputer.ThisprogramtookabouttwominutestoexecuteonanIntelI7computer.

Oncethetrainingiscomplete,thesinewaveispresentedtotheneuralnetworkforprediction.Youcanseetheoutputfromthispredictionhere:

5:Actual=0.76604:Predicted=0.7892166200864351:Difference=2.32%6:A

ctual=0.86602:Predicted=0.8839210963512845:Difference=1.79%7:Ac

tual=0.93969:Predicted=0.934526031234053:Difference=0.52%8:Act

ual=0.9848:Predicted=0.9559577688326862:Difference=2.88%9:Actu

al=1.0:Predicted=0.9615566601973113:Difference=3.84%10:Actual=

0.9848:Predicted=0.9558060932656686:Difference=2.90%11:Actual=

0.93969:Predicted=0.9354447787244102:Difference=0.42%12:Actual

=0.86602:Predicted=0.8894014978439005:Difference=2.34%13:Actua

l=0.76604:Predicted=0.801342405700056:Difference=3.53%14:Actua

l=0.64278:Predicted=0.6633506809125252:Difference=2.06%15:Actu

al=0.49999:Predicted=0.4910483600917853:Difference=0.89%16:Act

ual=0.34202:Predicted=0.31286152780645105:Difference=2.92%17:A

ctual=0.17364:Predicted=0.14608325263568134:Difference=2.76%

18:Actual=0.0:Predicted=-0.008360016796238434:Difference=0.84%

19:Actual=-0.17364:Predicted=-0.15575381460132823:Difference=1.79%


...











Asyoucansee,wepresentboththeactualandpredictedvaluesforeachelement.Wetrainedtheneuralnetworkforthefirst250elements;however,theneuralnetworkisabletopredictbeyondthefirst250.Youwillalsonoticethatthedifferencebetweentheactualvaluesandthepredictedvaluesrarelyexceeds3%.

Slidingwindowisnottheonlywaytoencodetimeseries.Othertimeseriesencodingalgorithmscanbeveryusefulforspecificdomains.Forexample,manytechnicalindicatorsexistthathelptofindpatternsinthevalueofsecuritiessuchasstocks,bonds,andcurrencypairs.

SimpleRecurrentNeuralNetworks

Recurrentneuralnetworksdonotforcetheconnectionstoflowonlyfromonelayertothenext,frominputlayertooutputlayer.Arecurrentconnectionoccurswhenaconnectionisformedbetweenaneuronandoneofthefollowingothertypesofneurons:

TheneuronitselfAneurononthesamelevelAneurononapreviouslevel

Recurrentconnectionscannevertargettheinputneuronsorthebiasneurons.

Theprocessingofrecurrentconnectionscanbechallenging.Becausetherecurrentlinkscreateendlessloops,theneuralnetworkmusthavesomewaytoknowwhentostop.Aneuralnetworkthatenteredanendlessloopwouldnotbeuseful.Topreventendlessloops,wecancalculatetherecurrentconnectionswiththefollowingthreeapproaches:

ContextneuronsCalculatingoutputoverafixednumberofiterationsCalculatingoutputuntilneuronoutputstabilizes

Werefertoneuralnetworksthatusecontextneuronsasasimplerecurrentnetwork(SRN).Thecontextneuronisaspecialneurontypethatremembersitsinputandprovidesthatinputasitsoutputthenexttimethatwecalculatethenetwork.Forexample,ifwegaveacontextneuron0.5asinput,itwouldoutput0.Contextneuronsalwaysoutput0ontheirfirstcall.However,ifwegavethecontextneurona0.6asinput,theoutputwouldbe0.5.Weneverweighttheinputconnectionstoacontextneuron,butwecanweighttheoutputfromacontextneuronjustlikeanyotherconnectioninanetwork.Figure13.2showsatypicalcontextneuron:

Figure13.2:ContextNeuron

Contextneuronsallowustocalculateaneuralnetworkinasinglefeedforwardpass.Contextneuronsusuallyoccurinlayers.Alayerofcontextneuronswillalwayshavethesamenumberofcontextneuronsasneuronsinitssourcelayer,asdemonstratedbyFigure13.3:

Figure13.3:ContextLayer

Asyoucanseefromtheabovelayer,twohiddenneuronsthatarelabeledhidden1andhidden2directlyconnecttothetwocontextneurons.Thedashedlinesontheseconnectionsindicatethatthesearenotweightedconnections.Theseweightlessconnectionsareneverdense.Iftheseconnectionsweredense,hidden1wouldbeconnectedtobothhidden1andhidden2.However,thedirectconnectionsimplyjoinseachhiddenneurontoitscorrespondingcontextneuron.Thetwocontextneuronsformdense,weightedconnectionstothetwohiddenneurons.Finally,thetwohiddenneuronsalsoformdenseconnectionstotheneuronsinthenextlayer.Thetwocontextneuronswouldformtwoconnectionstoasingleneuroninthenextlayer,fourconnectionstotwoneurons,sixconnectionstothreeneurons,andsoon.

Youcancombinecontextneuronswiththeinput,hidden,andoutputlayersofaneuralnetworkinmanydifferentways.Inthenexttwosections,weexploretwocommonSRNarchitectures.

ElmanNeuralNetworks

In1990,Elmanintroducedaneuralnetworkthatprovidespatternrecognitiontotimeseries.Thisneuralnetworktypehasoneinputneuronforeachstreamthatyouareusingtopredict.Thereisoneoutputneuronforeachtimesliceyouaretryingtopredict.Asingle-hiddenlayerispositionedbetweentheinputandoutputlayer.Alayerofcontextneuronstakesitsinputfromthehiddenlayeroutputandfeedsbackintothesamehiddenlayer.Consequently,thecontextlayersalwayshavethesamenumberofneuronsasthehiddenlayer,asdemonstratedbyFigure13.4:

Figure13.4:ElmanSRN

TheElmanneuralnetworkisagoodgeneral-purposearchitectureforsimplerecurrentneuralnetworks.Youcanpairanyreasonablenumberofinputneuronstoanynumberofoutputneurons.Usingnormalweightedconnections,thetwocontextneuronsarefullyconnectedwiththetwohiddenneurons.Thetwocontextneuronsreceivetheirstatefromthetwonon-weightedconnections(dashedlines)fromeachofthetwohiddenneurons.

JordanNeuralNetworks

In1993,Jordanintroducedaneuralnetworktocontrolelectronicsystems.ThisstyleofSRNissimilartoElmannetworks.However,thecontextneuronsarefedfromtheoutputlayerinsteadofthehiddenlayer.WealsorefertothecontextunitsinaJordannetworkasthestatelayer.Theyhavearecurrentconnectiontothemselveswithnoothernodesonthisconnection,asseeninFigure13.5:

Figure13.5:JordanSRN

TheJordanneuralnetworkrequiresthesamenumberofcontextneuronsandoutputneurons.Therefore,ifwehaveoneoutputneuron,theJordannetworkwillhaveasingle

contextneuron.Thisequalitycanbeproblematicifyouhaveonlyasingleoutputneuronbecauseyouwillbeabletohavejustonesingle-contextneuron.

TheElmanneuralnetworkisapplicabletoawiderarrayofproblemsthantheJordannetworkbecausethelargehiddenlayercreatesmorecontextneurons.Asaresult,theElmannetworkcanrecallmorecomplexpatternsbecauseitcapturesthestateofthehiddenlayerfromthepreviousiteration.Thisstateisneverbipolarsincethehiddenlayerrepresentsthefirstlineoffeaturedetectors.

Additionally,ifweincreasethesizeofthehiddenlayertoaccountforamorecomplexproblem,wealsogetmorecontextneuronswithanElmannetwork.TheJordannetworkdoesn’tproducethiseffect.TocreatemorecontextneuronswiththeJordannetwork,wemustaddmoreoutputneurons.Wecannotaddoutputneuronswithoutchangingthedefinitionoftheproblem.

WhentouseaJordannetworkisacommonquestion.Programmersoriginallydevelopedthisnetworktypeforroboticsresearch.Neuralnetworksthataredesignedforroboticstypicallyhaveinputneuronsconnectedtosensorsandoutputneuronsconnectedtoactuators(typicallymotors).Becauseeachmotorhasitsownoutputneuron,neuralnetworksforrobotswillgenerallyhavemoreoutputneuronsthanregressionneuralnetworksthatpredictasinglevalue.

BackpropagationthroughTime

YoucantrainSRNswithavarietyofmethods.BecauseSRNsareneuralnetworks,youcantraintheirweightswithanyoptimizationalgorithm,suchassimulatedannealing,particleswarmoptimization,Nelder-Meadorothers.Regularbackpropagation-basedalgorithmscanalsotrainoftheSRN.Mozer(1995),Robinson&Fallside(1987)andWerbos(1988)eachinventedanalgorithmspecificallydesignedforSRNs.Programmersrefertothisalgorithmasbackpropagationthroughtime(BPTT).Sjoberg,Zhang,Ljung,etal.(1995)determinedthatbackpropagationthroughtimeprovidessuperiortrainingperformancethangeneraloptimizationalgorithms,suchassimulatedannealing.Backpropagationthroughtimeisevenmoresensitivetolocalminimathanstandardbackpropagation.

BackpropagationthroughtimeworksbyunfoldingtheSRNtobecomearegularneuralnetwork.TounfoldtheSRN,weconstructachainofneuralnetworksequaltohowfarbackintimewewishtogo.Westartwithaneuralnetworkthatcontainstheinputsforthecurrenttime,knownast.Nextwereplacethecontextwiththeentireneuralnetwork,uptothecontextneuron’sinput.Wecontinueforthedesirednumberoftimeslicesandreplacethefinalcontextneuronwitha0.Figure13.6illustratesthisprocessfortwotimeslices.

Figure13.6:UnfoldingtoTwoTimeSlices

Thisunfoldingcancontinuedeeper;Figure13.7showsthreetimeslices:

Figure13.7:UnfoldingtoTwoTimeSlices

YoucanapplythisabstractconcepttotheactualSRNs.Figure13.8illustratesatwo-input,two-hidden,one-outputElmanneuralnetworkunfoldedtotwotimeslices:

Figure13.8:ElmanUnfoldedtoTwoTimeSlices

Asyoucansee,thereareinputsforbotht(currenttime)andt-1(onetimesliceinthepast).Thebottomneuralnetworkstopsatthehiddenneuronsbecauseyoudon’tneedeverythingbeyondthehiddenneuronstocalculatethecontextinput.Thebottomnetworkstructurebecomesthecontexttothetopnetworkstructure.Ofcourse,thebottomstructurewouldhavehadacontextaswellthatconnectstoitshiddenneurons.However,becausetheoutputneuronabovedoesnotcontributetothecontext,onlythetopnetwork(currenttime)hasone.

ItisalsopossibletounfoldaJordanneuralnetwork.Figure13.9showsatwo-input,two-hidden,one-outputJordannetworkunfoldedtotwotimeslices.

Figure13.9:JordanUnfoldedtoTwoTimeSlices

UnliketheElmannetwork,youmustcalculatetheentireJordannetworktodeterminethecontext.Asaresult,wecancalculatetheprevioustimeslice(bottomnetwork)allthewaytotheoutputneuron.

TotraintheSRN,wecanuseregularbackpropagationtotraintheunfoldednetwork.However,attheendoftheiteration,weaveragetheweightsofallfoldstoobtaintheweightsfortheSRN.Listing13.1describestheBPTTalgorithm:

Listing13.1:BackpropagationThroughTime(BPTT):

defbptt(a,y)

#a[t]istheinputattimet.y[t]istheoutput

..unfoldthenetworktocontainkinstancesoff

..seeabovefigure..

whilestoppingcriterianomet:

#xisthecurrentcontext

x=[]

fortfrom0ton–1:

#tistime.nisthelengthofthetrainingsequence

..setthenetworkinputstox,a[t],a[t+1],...,a[t+k-1]

p=..forward-propagationoftheinputs

..overthewholeunfoldednetwork

#error=target-prediction

e=y[t+k]-p

..Back-propagatetheerror,e,backacross

..thewholeunfoldednetwork

..Updatealltheweightsinthenetwork

..Averagetheweightsineachinstanceofftogether,

..sothateachfisidentical

#computethecontextforthenexttime-step

x=f(x)

GatedRecurrentUnits

Althoughrecurrentneuralnetworkshaveneverbeenaspopularastheregularfeedforwardneuralnetworks,activeresearchonthemcontinues.Chung,Hyun&Bengio(2014)introducedthegatedrecurrentunit(GRU)toallowrecurrentneuralnetworkstofunctioninconjunctionwithdeepneuralnetworkbysolvingsomeinherentlimitationsofrecurrentneuralnetworks.GRUsareneuronsthatprovideasimilarroletothecontextneuronsseenpreviouslyinthischapter.

ItisdifficulttotrainRNNstocapturelong-termdependenciesbecausethegradientstendtoeithervanish(mostofthetime)orexplode(rarely,butwithsevereeffects),asdemonstratedbyChung,Hyun&Bengio(2015).

Asofthe2015publicationofthisbook,GRUsarelessthanayearold.BecauseofthecuttingedgenatureofGRUs,mostmajorneuralnetworkframeworksdonotcurrentlyincludethem.IfyouwouldliketoexperimentwithGRUs,thePythonTheano-basedframeworkKerasincludesthem.YoucanfindtheframeworkatthefollowingURL:

https://github.com/fchollet/keras

ThoughweusuallyuseLasange,KerasisoneofmanyTheano-basedframeworksforPython,anditisalsooneofthefirsttosupportGRUs.Thissectioncontainsabrief,high-levelintroductiontoGRU,andwewillupdatethebook’sexamplesasneededtosupportthistechnologyasitbecomesavailable.Refertothebook’sexamplecodeforup-to-dateinformationonexampleavailabilityforGRU.

https://github.com/fchollet/keras

AGRUusestwogatestoovercometheselimitations,asshowninFigure13.10:

Figure13.10:GatedRecurrentUnit(GRU)

Thegatesareindicatedbyz,theupdategate,andr,theresetgate.Thevalueshandtilde-hrepresenttheactivation(output)andcandidateactivation.Itisimportanttonotethattheswitchesspecifyranges,ratherthansimplybeingonoroff.

TheprimarydifferencebetweentheGRUandtraditionalrecurrentneuralnetworksisthattheentirecontextvaluedoesnotchangeitsvalueeachiterationasitdoesintheSRN.Rather,theupdategategovernsthedegreeofupdatetothecontextactivationthatoccurs.Additionally,theprogramprovidesaresetgatethatallowsthecontexttobereset.

ChapterSummary

Inthischapter,weintroducedseveralmethodsthatcanhandletimeseriesdatawithneuralnetworks.Afeedforwardneuralnetworkproducesthesameoutputwhenprovidedthesameinput.Asaresult,feedforwardneuralnetworksaresaidtobedeterministic.Thisqualitydoesnotallowafeedforwardneuralnetworktheabilitytoproduceoutput,givenaseriesofinputs.Ifyourapplicationmustprovideoutputbasedonaseriesofinputs,youhavetwochoices.Youcanencodethetimeseriesintoaninputfeaturevectororusearecurrentneuralnetwork.

Encodingatimeseriesisawayofcapturingtimeseriesinformationinafeaturevectorthatisfedtoafeedforwardneuralnetwork.Anumberofmethodsencodetimeseriesdata.Wefocusedonslidingwindowencoding.Thismethodspecifiestwowindows.Thefirstwindowdetermineshowfarintothepasttouseforprediction.Thesecondwindowdetermineshowfarintothefuturetopredict.

Recurrentneuralnetworksareanothermethodtodealwithtimeseriesdata.Encodingisnotnecessarywitharecurrentneuralnetworkbecauseitisabletorememberpreviousinputstotheneuralnetwork.Thisshort-termmemoryallowstheneuralnetworktobeabletoseepatternsintime.Simplerecurrentnetworksuseacontextneurontorememberthestatefrompreviouscomputations.WeexaminedElmanandJordanSRNs.Additionally,weintroducedaverynewneurontypecalledthegatedrecurrentunit(GRU).Thisneuron

typedoesnotimmediatelyupdateitscontextvalueliketheElmanandJordannetworks.Twogatesgovernthedegreeofupdate.

Hyper-parametersdefinethestructureofaneuralnetworkandultimatelydetermineitseffectivenessforaparticularproblem.Inthepreviouschaptersofthisbook,weintroducedhyper-parameterssuchasthenumberofhiddenlayersandneurons,theactivationfunctions,andothergoverningattributesofneuralnetworks.Determiningthecorrectsetofhyper-parametersisoftenadifficulttaskoftrialanderror.However,someautomatedprocessescanmakethisprocesseasier.Additionally,somerulesofthumbcanhelparchitecttheseneuralnetworks.Wecoverthesepointers,aswellasautomatedprocesses,inthenextchapter.

Chapter14:ArchitectingNeuralNetworks

Hyper-parametersLearningRate&MomentumHiddenStructureActivationFunctions

Hyper-parameters,asmentionedinpreviouschapters,arethenumeroussettingsformodelssuchasneuralnetworks.Activationfunctions,hiddenneuroncounts,layerstructure,convolution,max-poolinganddropoutareallexamplesofneuralnetworkhyper-parameters.Findingtheoptimalsetofhyper-parameterscanseemadauntingtask,and,indeed,itisoneofthemosttime-consumingtasksfortheAIprogrammer.However,donotfear,wewillprovideyouwithasummaryofthecurrentresearchonneuralnetworkarchitectureinthischapter.Wewillalsoshowyouhowtoconductexperimentstohelpdeterminetheoptimalarchitectureforyourownnetworks.

Wewillmakearchitecturalrecommendationsintwoways.First,wewillreportonrecommendationsfromscientificliteratureinthefieldofAI.Theserecommendationswillincludecitationssothatyoucanexaminetheoriginalpaper.However,wewillstrivetopresentthekeyconceptofthearticleinanapproachablemanner.Thesecondwaywillbethroughexperimentation.Wewillrunseveralcompetingarchitecturesandreporttheresults.

Youneedtorememberthatafewhardandfastrulesdonotdictatetheoptimalarchitectureforeveryproject.Everydatasetisdifferent,and,asaresult,theoptimalneuralnetworkforeverydatasetisalsodifferent.Thus,youmustalwaysperformsomeexperimentationtodetermineagoodarchitectureforyournetwork.

EvaluatingNeuralNetworks

Neuralnetworksstartwithrandomweights.Additionally,sometrainingalgorithmsuserandomvaluesaswell.Allconsidered,we’redealingwithquiteabitofrandomnessinordertomakecomparisons.Randomnumberseedsareacommonsolutiontothisissue;however,aconstantseeddoesnotprovideanequalcomparison,giventhatweareevaluatingneuralnetworkswithdifferentneuroncounts.

Let’scompareaneuralnetworkwith32connectionsagainstanothernetworkwith64connections.Whiletheseedguaranteesthatthefirst32connectionsretainthesamevalue,therearenow32additionalconnectionsthatwillhavenewrandomvalues.Furthermore,those32weightsinthefirstnetworkmightnotbeinthesamelocationsinthesecondnetworkiftherandomseedismaintainedbetweenonlythetwoinitialweightsets.

Tocomparearchitectures,wemustperformseveraltrainingrunsandaveragethefinalresults.Becausetheseextratrainingrunsaddtothetotalruntimeoftheprogram,excessivenumbersofrunswillquicklybecomeimpractical.Itcanalsobebeneficialtochooseatrainingalgorithmthatisdeterministic(onethatdoesnotuserandomnumbers).Theexperimentsthatwewillperforminthischapterwillusefivetrainingrunsandtheresilientpropagation(RPROP)trainingalgorithm.RPROPisdeterministic,andfiverunsareanarbitrarychoicethatprovidesareasonablelevelofconsistency.UsingtheXavierweightinitializationalgorithm,introducedinChapter4,“FeedforwardNeuralNetworks,”willalsohelpprovideconsistentresults.

TrainingParameters

Trainingalgorithmsthemselveshaveparametersthatyoumusttune.Wedon’tconsidertheparametersrelatedtotrainingashyper-parametersbecausetheyarenotevidentafteraneuralnetworkhasbeentrained.Youcanexamineatrainedneuralnetworktodetermineeasilywhathyper-parametersarepresent.Asimpleexaminationofthenetworkrevealstheneuroncountsandactivationfunctioninuse.However,determiningtrainingparameterssuchaslearningrateandmomentumisnotpossible.Bothtrainingparametersandhyper-parametersgreatlyaffecttheerrorratesthattheneuralnetworkcanobtain.However,wecanusetrainingparametersonlyduringtheactualtraining.

Thethreemostcommontrainingparametersforneuralnetworksarelistedhere:

LearningRateMomentumBatchSize

Notalllearningalgorithmshavetheseparameters.Additionally,youcanvarythevalueschosenfortheseparametersaslearningprogresses.Wediscussthesetrainingparametersinthesubsequentsections.

LearningRate

Thelearningrateallowsustodeterminehowfareachiterationoftrainingwilltaketheweightvalues.Someproblemsareverysimpletosolve,andahightrainingratewillyieldaquicksolution.Otherproblemsaremoredifficult,andaquicklearningmightdisregardagoodsolution.Otherthantheruntimeofyourprogram,thereisnodisadvantageinchoosingasmalllearningrate.Figure14.1showshowalearningratemightfareonbothasimple(unimodal)andcomplex(multimodal)problem:

Figure14.1:LearningRates

Theabovetwochartsshowtherelationshipbetweenweightandthescoreofanetwork.Astheprogramincreasesordecreasesasingleweight,thescorechanges.Aunimodalproblemistypicallyeasytosolvebecauseitsgraphhasonlyonebump,oroptimalsolution.Inthiscase,weconsideragoodscoretobealowerrorrate.

Amultimodalproblemhasmanybumps,orpossiblegoodsolutions.Iftheproblemissimple(unimodal),afastlearningrateisoptimalbecauseyoucanchargeupthehilltoagreatscore.However,hastemakeswasteonthesecondchart,asthelearningratefailstofindthetwooptimums.

Kamiyama,Iijima,Taguchi,Mitsui,etal.(1992)statedthatmostliteratureusealearningrateof0.2andamomentumof0.9.Oftenthislearningrateandmomentumcanbegoodstartingpoints.Infact,manyexamplesdousethesevalues.TheresearcherssuggestthatEquation14.1hasastronglikelihoodofattainingbetterresults.

Equation14.1:SettingLearningRateandMomentum

Thevariableα(alpha)isthemomentum;ε(epsilon)isthelearningrate,andKisaconstantrelatedtothehiddenneurons.Theirresearchsuggeststhatthetuningofmomentum(discussedinthenextsection)andlearningratearerelated.WedefinetheconstantKbythenumberofhiddenneurons.SmallernumbersofhiddenneuronsshouldusealargerK.Inourownexperimentations,wedonotusetheequationdirectlybecauseitisdifficulttochooseaconcretevalueofK.ThefollowingcalculationsshowseverallearningratesbasedonlearningrateandK.

k=0.500000,alpha=0.200000->epsilon=0.400000

k=0.500000,alpha=0.300000->epsilon=0.350000

k=0.500000,alpha=0.400000->epsilon=0.300000

k=1.000000,alpha=0.200000->epsilon=0.800000

k=1.000000,alpha=0.300000->epsilon=0.700000

k=1.000000,alpha=0.400000->epsilon=0.600000

k=1.500000,alpha=0.200000->epsilon=1.200000

k=1.500000,alpha=0.300000->epsilon=1.050000

k=1.500000,alpha=0.400000->epsilon=0.900000

ThelowervaluesofKrepresenthigherhiddenneuroncounts;thereforethehiddenneuroncountisdecreasingasyoumovedownthelist.Asyoucansee,forallmomentums(α,alpha)of0.2,thesuggestedlearningrate(ε,epsilon)increasesasthehiddenneuroncountsdecrease.Thelearningrateandmomentumhaveaninverserelationship.Asyouincreaseone,youshoulddecreasetheother.However,thehiddenneuroncountcontrolshowquicklymomentumandlearningrateshoulddiverge.

Momentum

Momentumisalearningpropertythatcausestheweightchangetocontinueinitscurrentdirection,evenifthegradientindicatesthattheweightchangeshouldreversedirection.Figure14.2illustratesthisrelationship:

Figure14.2:MomentumandaLocalOptima

Apositivegradientencouragestheweighttodecrease.Theweighthasfollowedthenegativegradientupthehillbutnowhassettledintoavalley,oralocaloptima.Thegradientnowmovesto0andpositiveastheothersideofthelocaloptimaishit.Momentumallowstheweighttocontinueinthisdirectionandpossiblyescapefromthelocal-optimavalleyandpossiblyfindthelowerpointtotheright.

Tounderstandexactlyhowlearningrateandmomentumareimplemented,recallEquation6.6,fromChapter6,“BackpropagationTraining,”thatisrepeatedasEquation14.2forconvenience:

Equation14.2:WeightandMomentumApplied

Thisequationshowshowwecalculatethechangeinweightfortrainingiterationt.Thischangeisthesumoftwotermsthatareeachgovernedbythelearningrateε(epsilon)andmomentumα(alpha).Thegradientistheweight’spartialderivativeoftheerrorrate.Thesignofthegradientdeterminesifweshouldincreaseordecreasethegradient.Thelearningratesimplytellsbackpropagationthepercentageofthisgradientthattheprogramshouldapplytotheweightchange.Theprogramalwaysappliesthischangetotheoriginalweightandthenretainsitforthenextiteration.Themomentumα(alpha)subsequentlydeterminesthepercentageofthepreviousiteration’sweightchangethattheprogramshouldapplytothisiteration.Momentumallowsthepreviousiteration’sweightchangetocarrythroughtothecurrentiteration.Asaresult,theweightchangemaintainsitsdirection.Thisprocessessentiallygivesit“momentum.”

Jacobs(1988)discoveredthatlearningrateshouldbedecreasedastrainingprogresses.Additionally,aspreviouslydiscussed,Kamiyama,etal.(1992)assertedthatmomentumshouldbeincreasedasthelearningrateisdecayed.Adecreasinglearningrate,coupledwithanincreasingmomentum,isaverycommonpatterninneuralnetworktraining.Thehighlearningrateallowstheneuralnetworktobeginexploringalargerareaofthesearchspace.Decreasingthelearningrateforcesthenetworktostopexploringandbeginexploitingamorelocalregionofthesearchspace.Increasingmomentumatthispointhelpsguardagainstlocalminimainthissmallersearchregion.

BatchSize

Thebatchsizespecifiesthenumberoftrainingsetelementsthatyoumustcalculatebeforetheweightsareactuallyupdated.Theprogramsumsallofthegradientsforasinglebatchbeforeitupdatestheweights.Abatchsizeof1indicatesthattheweightsareupdatedforeachtrainingsetelement.Werefertothisprocessasonlinetraining.Theprogramoftensetsthebatchsizetothesizeofthetrainingsetforfullbatchtraining.

Agoodstartingpointisabatchsizeequalto10%oftheentiretrainingset.Youcanincreaseordecreasethebatchsizetoseeitseffectontrainingefficiency.Usuallyaneuralnetworkwillhavevastlyfewerweightsthantrainingsetelements.Asaresult,cuttingthebatchsizebyahalf,orevenafourth,willnothaveadrasticeffectontheruntimeofaniterationinstandardbackpropagation.

GeneralHyper-Parameters

Inadditiontothetrainingparametersjustdiscussed,wemustalsoconsiderthehyper-parameters.Theyaresignificantlymoreimportantthantrainingparametersbecausetheydeterminetheneuralnetworksultimatelearningcapacity.Aneuralnetworkwithareducedlearningcapacitycannotovercomethisdeficiencywithfurthertraining.

ActivationFunctions

Currently,theprogramutilizestwoprimarytypesofactivationfunctionsinsideofaneuralnetwork:

Sigmoidal:Logistic(sigmoid)&HyperbolicTangent(tanh)Linear:ReLU

Thesigmoidal(s-shaped)activationfunctionshavebeenamainstayofneuralnetworks,buttheyarenowlosinggroundtotheReLUactivationfunction.Thetwomostcommons-shapedactivationfunctionsarethenamesakesigmoidactivationfunctionandthehyperbolictangentactivationfunction.Thenamecancauseconfusionbecausesigmoidrefersbothtoanactualactivationfunctionandtoaclassofactivationfunctions.Theactualsigmoidactivationfunctionhasarangebetween0and1,whereasthehyperbolictangentfunctionhasarangeof-1and1.Wewillfirsttacklehyperbolictangentversussigmoid(theactivationfunction).Figure14.3showstheoverlayofthesetwoactivations:

Figure14.3:SigmoidandTanh

Asyoucanseefromthefigure,thehyperbolictangentstretchesoveramuchlargerrangethantanh.Yourchoiceofthesetwoactivationswillaffectthewaythatyounormalizeyourdata.Ifyouareusinghyperbolictangentattheoutputlayerofyourneuralnetwork,youmustnormalizetheexpectedoutcomebetween-1and1.Similarly,ifyouareusingthesigmoidfunctionfortheoutputlayer,youmustnormalizetheexpectedoutcomebetween-1and1.Youshouldnormalizetheinputto-1to1forbothoftheseactivationfunctions.Thex-values(input)above+1willsaturateto+1(y-values)forbothsigmoidandhyperbolictangent.Asx-valuesgobelow-1,thesigmoidactivationfunctionsaturatestoy-valuesof0,andhyperbolictangentsaturatestoy-valuesof-1.

Thesaturationofsigmoidtovaluesof0inthenegativedirectioncanbeproblematicfortraining.Asaresult,Kalman&Kwasny(1992)recommendhyperbolictangentinallsituationsinsteadofsigmoid.Thisrecommendationcorrespondswithmanyliteraturesources.However,thisargumentonlyextendstothechoicebetweensigmoidalactivationfunctions.AgrowingbodyofrecentresearchfavorstheReLUactivationfunctioninallcasesoverthesigmoidalactivationfunctions.

Zeileretal.(2014),Maas,Hannun,Awni&Ng(2013)andGlorot,Bordes&Bengio(2013)allrecommendtheReLUactivationfunctionoveritssigmoidalcounterparts.“Chapter9,“DeepLearning,”includestheadvantagesoftheReLUactivationfunction.Inthissection,wewillexamineanexperimentthatcomparestheReLUtothesigmoid,weusedaneuralnetworkwithahiddenlayerof1,000neurons.WeranthisneuralnetworkagainsttheMNISTdataset.Obviously,weadjustedthenumberofinputandoutputneuronstomatchtheproblem.Weraneachactivationfunctionfivetimeswithdifferentrandomweightsandkeptthebestresults:

Sigmoid:


Incorrect192/10000(1.92%)

ReLU:


Incorrect170/10000(1.7000000000000002%)

Theaccuracyratesforeachoftheaboveneuralnetworksonavalidationset.Asyoucansee,theReLUactivationfunctiondidindeedhavethelowesterrorrateandachieveditinfewertrainingiterations/epochs.Ofcourse,theseresultswillvary,dependingontheplatformused.

HiddenNeuronConfigurations

Hiddenneuronconfigurationshavebeenafrequentsourceofquestions.Neuralnetworkprogrammersoftenwonderexactlyhowtostructuretheirnetworks.Asofthewritingofthisbook,aquickscanofStackOverflowshowsover50questionsrelatedtohiddenneuronconfigurations.Youcanfindthequestionsatthefollowinglink:

http://goo.gl/ruWpcb

Althoughtheanswersmayvary,mostofthemsimplyadvisethattheprogrammer“experimentandfindout.”Accordingtotheuniversalapproximationtheorem,asingle-hidden-layerneuralnetworkcantheoreticallylearnanypattern(Hornik,1991).Consequently,manyresearcherssuggestonlysingle-hidden-layerneuralnetworks.Althoughasingle-hidden-layerneuralnetworkcanlearnanypattern,theuniversalapproximationtheoremdoesnotstatethatthisprocessiseasyforaneuralnetwork.Nowthatwehaveefficienttechniquestotraindeepneuralnetworks,theuniversalapproximationtheoremismuchlessimportant.

Toseetheeffectsofhiddenneuronsandneuroncounts,wewillperformanexperimentthatwilllookatone-layerandtwo-layerneuralnetworks.Wewilltryeverycombinationofhiddenneuronsuptotwo50-neuronlayers.ThisneuralnetworkwilluseaReLUactivationfunctionandRPROP.Thisexperimenttookover30hourstorunonanIntelI7quad-core.Figure14.4showsaheatmapoftheresults:

Figure14.4:HeatMapofTwo-LayerNetwork(firstexperiment)

http://goo.gl/ruWpcb

Thebestconfigurationreportedbytheexperimentwas35neuronsinhiddenlayer1,and15neuronsinhiddenlayer2.Theresultsofthisexperimentwillvarywhenrepeated.Theabovediagramshowsthebest-trainednetworksinthelower-leftcorner,asindicatedbythedarkersquares.Thisindicatesthatthenetworkfavorsalargefirsthiddenlayerwithsmallersecondhiddenlayers.Theheatmapshowstherelationshipsbetweenthedifferentconfigurations.Weachievedbetterresultswithsmallerneuroncountsonthesecondhiddenlayer.Thisoccurredbecausetheneuroncountsconstrictedtheinformationflowtotheoutputlayer.Thisapproachisconsistentwithresearchintoauto-encodersinwhichsuccessivelysmallerlayersforcetheneuralnetworktogeneralizeinformation,ratherthanoverfit.Ingeneral,basedontheexperimenthere,weadviseusingatleasttwohiddenlayerswithsuccessivelysmallerlayers.

LeNet-5Hyper-Parameters

TheLeNet-5convolutionalneuralnetworksintroduceadditionallayertypesthatbringmorechoicesintheconstructionofneuralnetworks.Boththeconvolutionalandmax-poolinglayerscreateotherchoicesforhyper-parameters.Chapter10,“ConvolutionalNeuralNetworks”containsacompletelistofhyper-parametersthattheLeNet-5networkintroduces.Inthissection,wewillreviewLeNet-5architecturalrecommendationsrecentlysuggestedinscientificpapers.

MostliteratureonLeNet-5networkssupportstheuseofamax-poollayertofolloweveryconvolutionallayer.Ideally,severalconvolutional/max-poollayersreducetheresolutionateachstep.Chapter6,“ConvolutionalNeuralNetworks”includesthisdemonstration.However,veryrecentliteratureseemstoindicatethatmax-poollayersshouldnotbeusedatall.

OnNovember7,2014,thewebsiteRedditfeaturedDr.GeoffreyHintonforan“askmeanything(AMA)”session.Dr.Hintonistheforemostresearcherindeeplearningandneuralnetworks.DuringtheAMAsession,Dr.Hintonwasaskedaboutmax-poollayers.Youcanreadhiscompleteresponsehere:

https://goo.gl/TgBakL

Overall,Dr.Hintonbeginshisanswersaying,“Thepoolingoperationusedinconvolutionalneuralnetworksisabigmistake,andthefactthatitworkssowellisadisaster.”Hethenproceedswithatechnicaldescriptionofwhyyoushouldneverusemax-pooling.Atthetimeofthisbook’spublication,hisresponseisfairlyrecentandcontroversial.Thereforewesuggestthatyoutrytheconvolutionalneuralnetworksbothwithandwithoutmax-poollayers,astheirfuturelooksuncertain.

https://goo.gl/TgBakL

ChapterSummary

Selectingagoodsetofhyper-parametersisoneofthemostdifficulttasksfortheneuralnetworkprogrammer.Thenumberofhiddenneurons,activationfunctions,andlayerstructuresareallexamplesofneuralnetworkhyper-parametersthattheprogrammermustadjustandfine-tune.Allthesehyper-parameterscanaffecttheoverallcapacityoftheneuralnetworktolearnpatterns.Asaresult,youmustchoosethemcorrectly.

MostcurrentliteraturesuggestsusingtheReLUactivationfunctioninplaceofthesigmoidal(s-shaped)activationfunctions.Ifyouaregoingtouseasigmoidalactivation,mostliteraturerecommendsthatyouusethehyperbolictangentactivationfunctioninsteadofthesigmoidalactivationfunction.TheReLUactivationfunctionismorecompatiblewithdeepneuralnetworks.

Thenumberofhiddenlayersandneuronsisalsoanimportanthyper-parameterforneuralnetworks.Itisgenerallyadvisablethatsuccessivehiddenlayerscontainasmallernumberofneuronsthantheimmediatelypreviouslayer.Thisadjustmenthastheeffectofconstrainingthedatafromtheinputsandforcingtheneuralnetworktogeneralizeandnotmemorize,whichresultsinoverfitting.

Wedonotconsidertrainingparametersashyper-parametersbecausetheydonotaffectthestructureoftheneuralnetwork.However,youstillmustchooseapropersetoftrainingparameters.Thelearningrateandmomentumaretwoofthemostcommontrainingparametersforneuralnetworks.Generally,youshouldinitiallysetthelearningratehighanddecreaseitastrainingcontinues.Youshouldmovethemomentumvalueinverselywiththelearningrate.

Inthischapter,weexaminedhowtostructureneuralnetworks.Whileweprovidedsomegeneralrecommendations,thedatasetgenerallydrivesthearchitectureoftheneuralnetwork.Consequently,youmustanalyzethedataset.Wewillintroducethet-SNEdimensionreductionalgorithminthenextchapter.Thisalgorithmwillallowyoutovisualizegraphicallyyourdatasetandseeissuesthatoccurwhileyouarecreatinganeuralnetworkforthatdataset.

Chapter15:VisualizationConfusionMatricesPCAt-SNE

Wefrequentlyreceivethefollowingquestionaboutneuralnetworks:“I’vecreatedaneuralnetwork,butwhenItrainit,myerrornevergoestoanacceptablelevel.WhatshouldIdo?”Thefirststepinthisinvestigationistodetermineifoneofthefollowingcommonerrorshasoccurred.

CorrectnumberofinputandoutputneuronsDatasetnormalizedcorrectlySomefataldesigndecisionoftheneuralnetwork

Obviously,youmusthavethecorrectnumberofinputneuronstomatchhowyourdataarenormalized.Likewise,youshouldhaveasingle-outputneuronforregressionproblemsorusuallyoneoutputneuronperclassforaclassificationproblem.Youshouldnormalizeinputdatatofittheactivationfunctionthatyouuse.Inasimilarway,fatalmistakes,suchasnohiddenlayeroralearningrateof0,cancreateabadsituation.

However,onceyoueliminatealltheseerrors,youmustlooktoyourdata.Forclassificationproblems,yourneuralnetworkmayhavedifficultiesdifferentiatingbetweencertainpairsofclasses.Tohelpyouresolvethisissue,somevisualizationalgorithmsexistthatallowyoutoseetheproblemsthatyourneuralnetworkmightencounter.Thetwovisualizationspresentedinthischapterwillshowthefollowingissueswithdata:

ClassesthatareeasilyconfusedforothersNoisydataDissimilaritybetweenclasses

Wedescribeeachissueinthesubsequentsectionsandoffersomepotentialsolutions.Wewillpresentthesepotentialsolutionsintheformoftwoalgorithmsofincreasingcomplexity.Notonlyisthetopicofvisualizationimportantfordataanalysis,itwasalsochosenasatopicbythereadersofthisbook,whichearneditsinitialfundingthroughaKickstartercampaign.Theproject’soriginal653backerschosevisualizationfromamongseveralcompetingprojecttopics.Asaresult,wewillpresenttwovisualizations.BothexampleswillusetheMNISThandwrittendigitsdatasetthatwehaveexaminedinpreviouschaptersofthisbook.

ConfusionMatrix

AneuralnetworktrainedfortheMNISTdatasetshouldbeabletotakeahandwrittendigitandpredictwhatdigitwasactuallywritten.Somedigitsaremoreeasilyconfusedforothers.Anyclassificationneuralnetworkhasthepossibilityofmisclassifyingdata.Aconfusionmatrixcanmeasurethesemisclassifications.

ReadingaConfusionMatrix

Aconfusionmatrixisalwayspresentedasasquaregrid.Thenumberofrowsandcolumnswillbothbeequaltothenumberofclassesinyourproblem.ForMNIST,thiswillbea10x10grid,asshownbyFigure15.1:

Figure15.1:MNISTConfusionMatrix

Aconfusionmatrixusesthecolumnstorepresentpredictions.Therowsrepresentwhatwouldhavebeenacorrectprediction.Ifyoulookatrow0column0,youwillseethenumber1,432.Thisresultmeansthattheneuralnetworkcorrectlypredicteda“0”1,432times.Ifyoulookatrow3column2,youwillseethata“2”waspredicted49timeswhenitshouldhavebeena“3.”Theproblemoccurredbecauseit’seasytomistakeahandwritten“3”fora“2,”especiallywhenapersonwithbadpenmanshipwritesthenumbers.Theconfusionmatrixletsyouseewhichdigitsarecommonlymistakenforeachother.Anotherimportantaspectoftheconfusionmatrixisthediagonalfrom(0,0)to(9,9).Iftheprogramtrainstheneuralnetworkproperly,thelargestnumbersshouldbeinthediagonal.Thus,aperfectlytrainedneuralnetworkwillonlyhavenumbersinthediagonal.

GeneratingaConfusionMatrix

Youcancreateaconfusionmatrixwiththefollowingsteps:

Separatethedatasetintotrainingandvalidation.Trainaneuralnetworkonthetrainingset.Settheconfusionmatrixtoallzeros.Loopovereveryelementinthevalidationset.Foreveryelement,increasethecell:row=expected,column=predicted.Reporttheconfusionmatrix.

Listing15.1showsthisprocessinthefollowingpseudocode:

Listing15.1:ComputeaConfusionMatrix

#x-containsdatasetinputs

#y-containsdatasetexpectedvalues(ordinals,notstrings)

defconfusion_matrix(x,y,network):

#Createsquarematrixequaltonumberofclassifications

confusion=matrix(network.num_classes,network.num_classes)

#Loopovereveryelement

forifrom0tolen(x):

prediction=net.compute(x[i])

target=y[i]

confusion[target][prediction]=confusion[target][prediction]+1

#Returnresult

returnconfusion

Confusionmatricesareoneoftheclassicvisualizationsforclassificationdataproblems.Youcanusethemwithanyclassificationproblem,notjustneuralnetworks.

t-SNEDimensionReduction

Thet-DistributedStochasticNeighborEmbedding(t-SNE)isatypeofdimensionalityreductionalgorithmthatprogrammersfrequentlyuseforvisualization.Wewillfirstdefinedimensionreductionandshowitsadvantagesforvisualizationandproblemsimplification.

Thedimensionsofadatasetarethenumberofinput(x)valuesthattheprogramusestomakepredictions.Theclassicirisdatasethasfourdimensionsbecausewemeasuretheirisflowersinfourdimensions.Chapter4,“FeedforwardNetworks,”hasanexplanationoftheirisdataset.TheMNISTdigitsareimagesof28x28grayscalepixels,whichresultinatotalof784inputneurons(28x28).Asaresult,theMNISTdatasethas784dimensions.

Fordimensionalityreduction,weneedtoaskthefollowingquestion:“Dowereallyneed784dimensionsorcouldweprojectthisdatasetintofewerdimensions?”Projectionsareverycommonincartography.Earthexistsinatleastthreedimensionsthatwecandirectlyobserve.Theonlytruethree-dimensionalmapofEarthisaglobe.However,globesareinconvenienttostoreandtransport.Aslongasitstillcontainstheinformationthatwerequire,aflat(2D)representationofEarthisusefulforspaceswhereaglobewillnotfit.Wecanprojecttheglobeona2Dsurfaceinmanyways.Figure15.2showstheLambertprojection(fromWikipedia)ofEarth:

Figure15.2:LambertProjection(cone)

JohannHeinrichLambertintroducedtheLambertprojectionin1772.Conceptually,thisprojectionworksbyplacingaconeoversomeregionoftheglobeandprojectingtheimageontotheglobe.Oncetheconeisunrolled,youhaveaflat2Dmap.Accuracyisbetternearthetipoftheconeandworsenstowardsthebaseofthecone.TheLambertprojectionisnottheonlywaytoprojecttheglobeandproduceamap,Figure15.3showsthepopularMercatorprojection:

Figure15.3:MercatorProjection(cylinder)

GerardusMercatorpresentedtheMercatorprojectionin1569.Thisprojectionworksbyessentiallywrappingacylinderabouttheglobeattheequator.Accuracyisbestattheequatorandworsensnearthepoles.YoucanseethischaracteristicbyexaminingtherelativesizeofGreenlandinbothprojections.Alongwiththetwoprojectionsjustmentioned,manyothertypesexist.EachisdesignedtoshowEarthinwaysthatareusefulfordifferentapplications.

Theprojectionsabovearenotstrictly2Dbecausetheycreateatypeofthirddimensionwithotheraspectslikecolor.Themapprojectionscanconveyadditionalinformationsuchasaltitude,groundcover,orevenpoliticaldivisionswithcolor.Computerprojectionsalsoutilizecolor,aswewilldiscoverinthenextsection.

t-SNEasaVisualization

IfwecanreducetheMNIST764dimensionsdowntotwoorthreewithadimensionreductionalgorithm,thenwecanvisualizethedataset.Reducingtotwodimensionsispopularbecauseanarticleorabookcaneasilycapturethevisualization.Itisimportanttorememberthata3Dvisualizationisnotactually3D,astrue3Ddisplaysareextremelyrare,asofthewritingofthisbook.A3Dvisualizationwillberenderedontoa2Dmonitor.Asaresult,itisnecessaryto“fly”throughthespaceandseehowpartsofthevisualizationreallyappear.Thisflightthroughspaceisverysimilartoacomputervideogamewhereyoudonotseeallaspectsofasceneuntilyouflycompletelyaroundtheobjectbeingviewed.Evenintherealworld,youcannotseeboththefrontandbackofanobjectyouareholding—itisnecessarytorotatetheobjectwithyourhandstoseeallsides.

KarlPearsonin1901inventedoneofthemostcommondimensionalityreductionalgorithms.Principalcomponentanalysis(PCA)createsanumberofprincipalcomponentsthatmatchthenumberofdimensionstobereduced.Fora2Dreduction,therewouldbetwoprincipalcomponents.Conceptually,PCAisattemptingtopackthehigher-dimensionalitemsintotheprincipalcomponentsthatmaximizetheamountofvariabilityinthedata.Byensuringthatthedistantvaluesinhigh-dimensionalspaceremaindistant,PCAcancompleteitsfunction.Figure15.4showsaPCAreductionoftheMNISTdigitstotwodimensions:

Figure15.4:2DPCAVisualizationofMNIST

Thefirstprincipalcomponentisthex-axis(leftandright).Asyoucansee,thematrixpositionsthebluedots(0’s)atthefarleft,andthereddots(1’s)areplacedtowardstheright.Handwritten1’sand0’saretheeasiesttodifferentiate—theyhavethehighest

variability.Thesecondprincipalcomponentisthey-axis(upanddown).Onthetop,youhavegreen(2’s)andbrown(3’s),whichlooksomewhatsimilar.Onthebottomarepurple(4’s),gray(9’s)andblack(7’s),whichalsolooksimilar.Yetthevariabilitybetweenthesetwogroupsishigh—itiseasiertotell2’sand3’sfrom4’s,9’sand7’s.

Colorisveryimportanttotheaboveimage.Ifyouarereadingthisbookinablack-and-whiteform,thisimagemaynotmakeasmuchsense.ThecolorrepresentsthedigitthatPCAclassified.YoumustnotethatPCAandt-SNEarebothunsupervised;therefore,theydonotknowtheidentitiesoftheinputvectors.Inotherwords,theydon’tknowwhichdigitwasselected.TheprogramaddsthecolorssothatwecanseehowwellPCAclassifiedthedigits.Iftheabovediagramisblackandwhiteinyourversion,youcanseethattheprogramdidnotplacethedigitsintomanydistinctgroups.WecanthereforeconcludethatPCAdoesnotworkwellasaclusteringalgorithm.

Theabovefigureisalsoverynoisybecausethedotsoverlapinlargeregions.Themostwell-definedregionisblue,wherethe“1”digitsreside.Youcanalsoseethatpurple(4),black(7),andgray(9)areeasytoconfuse.Additionally,brown(3),green(2),andyellow(8)canbemisleading.

PCAanalyzesthepair-wisedistancesofalldatapointsandpreserveslargedistances.Aspreviouslystated,iftwopointsaredistantinPCA,theywillremaindistant.However,wehavetoquestiontheimportanceofdistance.ConsiderFigure15.5thatshowstwopointsthatappeartobesomewhatclose:

Figure15.5:ApparentClosenessonaSpiral

Thepointsinquestionarethetwored,solidpointsthatareconnectedbyaline.The

twopoints,whenconnectedbyastraightline,aresomewhatclose.However,iftheprogramfollowsthepatterninthedata,thepointsareactuallyfarapart,asindicatedbythesolidspirallinethatfollowsallofthepoints.PCAwouldattempttokeepthesetwopointscloseastheyappearinFigure15.5.Thet-SNEalgorithminventedbyvanderMaaten&Hinton(2008),workssomewhatdifferently.Figure15.6showsthet-SNEvisualizationforthesamedatasetasfeaturedforPCA:

Figure15.6:2DPCAVisualizationofMNIST

Thet-SNEfortheMNISTdigitsshowsamuchclearervisualforthedifferentdigits.Again,theprogramaddscolortoindicatewherethedigitslanded.However,eveninblackandwhite,youwouldseesomedivisionsbetweenclusters.Digitslocatednearertoeachothersharesimilarities.Theamountofnoiseisreducedgreatly,butyoucanstillseesomereddots(0’s)sprinkledintheyellowcluster(8’s)andcyancluster(6’s),aswellasotherclusters.YoucanproduceavisualizationforaKaggledatasetusingthet-SNEalgorithm.WewillexaminethisprocessinChapter16,“ModelingwithNeuralNetworks.”

Implementationsoft-SNEexistformostmodernprogramminglanguages.LaurensvanderMaaten’shomepagecontainsalistatthefollowingURL:

http://lvdmaaten.github.io/tsne/

http://lvdmaaten.github.io/tsne/

t-SNEBeyondVisualization

Althought-SNEisprimarilyanalgorithmforreducingdimensionsforvisualization,featureengineeringalsoutilizesit.Thealgorithmcanevenserveasamodelcomponent.Featureengineeringoccurswhenyoucreateadditionalinputfeatures.Averysimpleexampleoffeatureengineeringiswhenyouconsiderhealthinsuranceapplicants,andyoucreateanadditionalfeaturecalledBMI,basedonthefeaturesweightandheight,asseeninequation15.1:

Equation15.1:BMICalculation

BMIissimplyacalculatedfieldthatallowshumanstocombineheightandweighttodeterminehowhealthysomeoneis.Suchfeaturescansometimeshelpneuralnetworksaswell.Youcanbuildsomeadditionalfeatureswithadatapoint’slocationineither2Dor3Dspace.

InChapter16,“ModelingwithNeuralNetworks,”wewilldiscussbuildingneuralnetworksfortheOttoGroupKagglechallenge.SeveralKaggletop-tensolutionsforthiscompetitionusedfeaturesthatwereengineeredwitht-SNE.Forthischallenge,youhadtoorganizedatapointsintonineclasses.Thedistancebetweenanitemandthenearestneighborofeachofthenineclassesona3Dt-SNEprojectionwasabeneficialfeature.Tocalculatethisfeature,wesimplymaptheentiretrainingsetintot-SNEspaceandobtainthe3Dt-SNEcoordinatesforeachfeature.ThenwegenerateninefeatureswiththeEuclideandistancebetweenthecurrentdatapointanditsnearestneighborofeachofthesenineclasses.Finally,theprogramaddstheseninefieldstothe92fieldsalreadybeingpresentedtotheneuralnetwork.

Asavisualizationoraspartoftheinputtoanothermodel,thet-SNEalgorithmprovidesagreatdealofinformationtotheprogram.Theprogrammercanusethisinformationtoseehowthedataarestructured,andthemodelgainsmoredetailsonthestructureofthedata.Mostimplementationsoft-SNEalsocontainadaptionsforlargedatasetsorforveryhighdimensions.Beforeyouconstructaneuralnetworktoanalyzedata,youshouldconsiderthet-SNEvisualization.Afteryoutraintheneuralnetworktoanalyzeitsresults,youcanusetheconfusionmatrix.

ChapterSummary

Visualizationisanimportantpartofneuralnetworkprogramming.Eachdatasetpresentsuniquechallengestoamachinelearningalgorithmoraneuralnetwork.Visualizationcanexposethesechallenges,allowingyoutodesignyourapproachtoaccountforknownissuesinthedataset.Wedemonstratedtwovisualizationtechniquesinthischapter.

Theconfusionmatrixisaverycommonvisualizationformachinelearningclassification.Itisalwaysasquarematrixwithrowsandcolumnsequaltothenumberofclassesintheproblem.Therowsrepresenttheexpectedvalues,andthecolumnsrepresentthevaluethattheneuralnetworkactuallyclassified.Thediagonal,wheretherowandcolumnnumbersareequal,representsthenumberoftimestheneuralnetworkcorrectlyclassifiedthatparticularclass.Awell-trainedneuralnetworkwillhavethelargestnumbersalongthediagonal.Theothercellscountthenumberoftimesamisclassificationoccurredbetweeneachexpectedclassandactualvalue.

Althoughyouusuallyruntheconfusionmatricesaftertheprogramgeneratesaneuralnetwork,youcanrunthedimensionreductionvisualizationsbeforehandtoexposesomechallengesthatmightbepresentinyourdataset.Youcanreducethedimensionsofyourdatasetto2Dor3Dwiththet-SNEalgorithm.However,itbecomeslesseffectiveindimensionshigherthan3D.Withthe2Ddimensionreduction,youcancreateinformativescatterplotsthatwillshowtherelationshipbetweenseveralclasses.

Inthenextchapter,wewillpresentaKagglechallengeasawaytosynthesizemanyofthetopicspreviouslydiscussed.Wewillusethet-SNEvisualizationasaninitial.Additionally,wewilldecreasetheneuralnetwork’stendencytooverfitwiththeuseofdropoutlayers.

Chapter16:ModelingwithNeuralNetworks

DataScienceKaggleEnsembleLearning

Inthischapter,wepresentacapstoneprojectonmodeling,abusiness-orientedapproachforartificialintelligence,andsomeaspectsofdatascience.DrewConway(2013),aleadingdatascientist,characterizesdatascienceastheintersectionofhackingskills,mathandstatisticsknowledge,andsubstantiveexpertise.Figure16.1depictsthisdefinition:

Figure16.1:Conway’sDataScienceVennDiagram

Hackingskillsareessentiallyasubsetofcomputerprogramming.Althoughthedatascientistdoesnotnecessarilyneedtheinfrastructureknowledgeofaninformationtechnology(IT)professional,thesetechnicalskillswillpermithimorhertocreateshort,effectiveprogramsforprocessingdata.Inthefieldofdatascience,werefertoinformationprocessingasdatawrangling.

Mathandstatisticsknowledgecoversstatistics,probability,andotherinferentialmethods.Substantiveknowledgedescribesthebusinessknowledgeaswellasthecomprehensionofactualdata.Ifyoucombineonlytwoofthesetopics,youdon’thaveallthecomponentsfordatascience,asFigure16.1illustrates.Inotherwords,thecombinationofstatisticsandsubstantiveexpertiseissimplytraditionalresearch.Thosetwoskillsalonedon’tencompassthecapabilities,suchasmachinelearning,requiredfordatascience.

Thisbookseriesdealswithhackingskillsandmathandstatisticalknowledge,twoofthecirclesinFigure16.1.Additionally,itteachesyoutocreateyourownmodels,whichismorepertinenttothefieldofcomputersciencethandatascience.Substantiveexpertiseismoredifficulttoobtainbecauseitisdependentontheindustrythatutilizesthedatascienceapplications.Forexample,ifyouwanttoapplydatascienceintheinsuranceindustry,substantiveknowledgereferstotheactualbusinessoperationsofthesecompanies.

Toprovideadatasciencecapstoneproject,wewillusetheKaggleOttoGroupProductClassificationChallenge.Kaggleisaplatformforcompetitivedatascience.YoucanfindtheOttoGroupProductClassificationChallengeatthefollowingURL:

https://www.kaggle.com/c/otto-group-product-classification-challenge

TheOttoGroupwasthefirst(andcurrentlyonly)non-tutorialKagglecompetitioninwhichwe’vecompeted.Afterobtainingatop10%finish,weachievedoneofthecriteriafortheKaggleMasterdesignation.TobecomeaKaggleMaster,onemustplaceinthetop10ofacompetitiononceandinthetop10%oftwoothercompetitions.Figure16.2showstheresultsofourcompetitionentryontheleaderboard:

Figure16.2:ResultsintheOttoGroupProductClassificationChallenge

Theabovelineshowsseveralpiecesofinformation.

Wewereinposition331of3514(9.4%).Wedroppedthreespotsinthefinalday.Ourmulti-classloglossscorewas0.42881.Wemade52submissions,uptoMay18,2015.

WewillbrieflydescribetheOttoGroupProductClassificationChallenge.Foracompletedescription,refertotheKagglechallengewebsite(foundabove).TheOttoGroup,theworld’slargestmailordercompanyandcurrentlyoneofthebiggeste-commercecompanies,introducedthischallenge.Becausethegrouphasmanyproductssoldovernumerouscountries,theywantedtoclassifytheseproductsintoninecategorieswith93features(columns).These93columnsrepresentedcountsandwereoften0.

Thedatawerecompletelyredacted(hidden).Thecompetitorsdidnotknowtheninecategoriesnordidtheyknowthemeaningbehindthe93features.Theyknewonlythatthefeatureswereintegercounts.LikemostKagglecompetitions,thischallengeprovidedthe

https://www.kaggle.com/c/otto-group-product-classification-challenge

competitorswithatestandtrainingdataset.Forthetrainingdataset,thecompetitorsreceivedtheoutcomes,orcorrectanswers.Forthetestset,theygotonlythe93features,andtheyhadtoprovidetheoutcome.

Thecompetitiondividedthetestandtrainingsetsinthefollowingway:

TestData:144KrowsTrainingData:61Krows

Duringthecompetition,participantsdidnotsubmittheiractualmodelstoKaggle.Instead,theysubmittedtheirmodel’spredictionsbasedonthetestdata.Asaresult,theycouldhaveusedanyplatformtomakethesepredictions.Forthiscompetitiontherewereninecategories,sothecompetitorssubmittedanine-numbervectorthatheldtheprobabilityofeachoftheseninecategoriesbeingthecorrectanswer.

Theanswerinthevectorthatheldthehighestprobabilitywasthechosenclass.Asyoucanobserve,thiscompetitionwasnotlikeamultiple-choicetestinschoolwherestudentsmustsubmittheiranswerasA,B,C,orD.Instead,Kagglecompetitorshadtosubmittheiranswersinthefollowingway:

A:80%probabilityB:16%probabilityC:2%probabilityD:2%probability

CollegeexamswouldnotbesohorrendousifstudentscouldsubmitanswerslikethoseintheKagglecompetition.Inmanymultiple-choicetests,studentshaveconfidenceabouttwooftheanswersandeliminatetheremainingtwo.TheKaggle-likemultiple-choicetestwouldallowstudentstoassignaprobabilitytoeachanswer,andtheycouldachieveapartialscore.Intheaboveexample,ifAwerethecorrectanswer,studentswouldearn80%ofthepoints.

Nevertheless,theactualKagglescoreisslightlymorecomplex.Theprogramgradestheanswerswithalogarithm-basedscale,andparticipantsfaceheavypenaltiesiftheyhavealowerprobabilityonthecorrectanswer.YoucanseetheKaggleformatfromthefollowingCSVfilesubmission:

1,0.0003,0.2132,0.2340,0.5468,6.2998e-05,0.0001,0.0050,0.0001,4.3826e-05

2,0.0011,0.0029,0.0010,0.0003,0.0001,0.5207,0.0013,0.4711,0.0011

3,3.2977e-06,4.1419e-06,7.4524e-06,2.6550e-06,5.0014e-07,0.9998,5.2621e-

06,0.0001,6.6447e-06

4,0.0001,0.6786,0.3162,0.0039,3.3378e-05,4.1196e-05,0.0001,0.0001,0.0006

5,0.1403,0.0002,0.0002,6.734e-05,0.0001,0.0027,0.0009,0.0297,0.8255

Asyoucansee,eachlinestartswithanumberthatspecifiesthedataitemthatisbeinganswered.Thesampleaboveshowstheanswersforitemsonethroughfive.Thenextninevaluesaretheprobabilitiesforeachoftheproductclasses.Theseprobabilitiesmustaddupto1.0(100%).

LessonsfromtheChallenge

HavingsuccessinKagglerequiresyoutounderstandthefollowingtopicsandthecorrespondingtools:

DeepLearning-UsingH2OandLasagneGradientBoostingMachines(GBM)-UsingXGBOOSTEnsembleLearning-UsingNumPyFeatureEngineering-UsingNumPyandScikit-LearnGPUisreallyimportantfordeeplearning.Itisbesttouseadeeplearningpackagethatsupportsit,suchasH2O,TheanoorLasagne.Thet-SNEvisualizationisawesomeforhigh-dimensionvisualizationandcreatingfeatures.Ensemblingisveryimportant.

Foroursubmission,weusedPythonwithScikit-Learn.However,youcanuseanylanguagecapableofgeneratingaCSVfile.Kaggledoesnotactuallyrunyourcode;theyscoreasubmissionfile.ThetwomostcommonlyusedprogramminglanguagesforKaggleareRandPython.Bothoftheselanguageshavestrongdatascienceframeworksavailableforthem.Risactuallyadomain-specificlanguage(DSL)forstatisticalanalysis.

Duringthischallenge,welearnedthemostaboutGBMparametertuningandensemblelearning.GBMshavequiteafewhyper-parameterstotune,andwebecameproficientattuningaGBM.TheindividualscoresforourGBMswereinlinewiththoseofthetop10%oftheteams.However,thesolutioninthischapterwilluseonlydeeplearning.GBMisbeyondthescopeofthisbook.Inafuturevolumeoreditionofthisseries,weplantoexamineGBM.

Althoughcomputerprogrammersanddatascientistsmighttypicallyutilizeasinglemodellikeneuralnetworks,participantsinKaggleneedtousemultiplemodelstobesuccessfulinthecompetition.Theseensembledmodelsproducebetterresultsthaneachofthemodelscouldgenerateindependently.

Weworkedwitht-SNE,examinedinChapter15,“Visualization,”forthefirsttimeinthiscompetition.Thismodelworkslikeprincipalcomponentanalysis(PCA)inthatitiscapableofreducingdimensions.However,thedatapointsseparateinsuchawaythatthevisualizationisoftenclearerthanPCA.Theprogramachievestheclearvisualizationbyusingastochasticnearestneighborprocess.Figure16.3showsthedatafromtheOttoGroupProductClassificationChallengevisualizedint-SNE:

Figure16.3:Challenget-SNE

TheWinningApproachtotheChallenge

Kaggleisverycompetitive.Ourprimaryobjectiveasweenteredthechallengewastolearn.However,wealsohopedtorankinthetop10%bytheendinordertoreachoneofthestepsinbecomingaKagglemaster.Earningatop10%wasdifficult;inthelastfewweeksofthechallenge,othercompetitorsknockedusoutofthebracketalmostdaily.Thelastthreedayswereespeciallyturbulent.Beforewerevealoursolution,wewillshowyouthewinningone.Thefollowingdescriptionisbasedontheinformationpublicallypostedaboutthewinningsolution.

ThewinnersoftheOttoGroupProductClassificationChallengewereGilbertoTitericz&StanislavSemenov.Theycompetedasateamandusedathree-levelensemble,asseeninFigure16.4:

Figure16.4:ChallengeWinningEnsemble

Wewillprovideonlyahigh-leveloverviewoftheirapproach.YoucanfindthefulldescriptionatthefollowingURL:

https://goo.gl/fZrJA0

ThewinningapproachemployedboththeRandPythonprogramminglanguages.Level1usedatotalof33differentmodels.Eachofthese33modelsprovideditsoutputtothreemodelsinlevel2.Additionally,theprogramgeneratedeightcalculatedfeatures.Anengineeredfeatureisonethatiscalculatedbasedontheothers.Asimpleexampleofanengineeredfeaturemightbebodymassindex(BMI),whichiscalculatedbasedonanindividual’sheightandweight.TheBMIvalueprovidesinsightsthatheightandweightalonemightnot.

Thesecondlevelcombinedthefollowingthreemodeltypes:

XGBoost–GradientboostingLasangeNeuralNetwork–DeeplearningADABoostExtraTrees

Thesethreeusedtheoutputof33modelsandeightfeaturesasinput.Theoutputfromthesethreemodelswasthesamenine-numberprobabilityvectorpreviouslydiscussed.Itwasasifeachmodelwerebeingusedindependently,therebyproducinganine-numbervectorthatwouldhavebeensuitableasananswersubmissiontoKaggle.Theprogramaveragedtogethertheseoutputvectorswiththethirdlayer,whichwassimplyaweighting.Asyoucansee,thewinnersofthechallengeusedalargeandcomplexensemble.MostofthewinningsolutionsinKagglefollowedasimilarpattern.

https://goo.gl/fZrJA0

Acompletediscussiononexactlyhowtheyconstructedthismodelisbeyondthescopeofthisbook.Quitehonestly,suchadiscussionisalsobeyondourowncurrentknowledgeofensemblelearning.AlthoughthesecomplexensemblesareveryeffectiveforKaggle,theyarenotalwaysnecessaryforgeneraldatasciencepurposes.Thesetypesofmodelsaretheblackestofblackboxes.Itisimpossibletoexplainthereasonsbehindthemodel’spredictions.

However,learningaboutthesecomplexmodelsisfascinatingforresearch,andfuturevolumesofthisserieswilllikelyincludemoreinformationaboutthesestructures.

OurApproachtotheChallenge

Sofar,we’veworkedonlywithsinglemodelsystems.Thesemodelsthatcontainensemblesthatare“builtin”,suchasrandomforestsandgradientboostingmachines(GBM).However,itispossibletocreatehigher-levelensemblesofthesemodels.Weusedatotalof20models,whichincludedtendeepneuralnetworksandtengradientboostingmachines.Ourdeepneuralnetworksystemprovidedoneprediction,andthegradientboostingmachinesprovidedtheother.Theprogramblendedthesetwopredictionswithasimpleratio.Thenwenormalizedtheresultingpredictionvectorsothatthesumequaled1.0(100%).Figure16.5showstheensemblemodel:

Figure16.5:OurChallengeGroupEntry

Youcanfindourentry,writteninPython,atthefollowingURL:

https://github.com/jeffheaton/kaggle-otto-group

https://github.com/jeffheaton/kaggle-otto-group

ModelingwithDeepLearning

Tostaywithinthescopeofthisbook,wewillpresentasolutiontotheKagglecompetitionbasedonourentry.Becausegradientboostingmachines(GBM)arebeyondthesubjectmatterofthisbook,wewillfocusonusingadeepneuralnetwork.Tointroduceensemblelearning,wewillusebaggingtocombinetentrainedneuralnetworkstogether.Ensemblemethods,suchasbagging,willusuallycausetheaggregateoftenneuralnetworkstoscorebetterthanasingleneuron.Ifyouwouldliketousegradientboostingmachinesandreplicateoursolution,seethelinkprovidedaboveforthesourcecode.

NeuralNetworkStructure

Forthisneuralnetwork,weusedadeeplearningstructurecomposedofdenselayersanddropoutlayers.Becausethisstructurewasnotanimagenetwork,wedidnotuseconvolutionallayersormax-poollayers.Theselayertypesrequiredthatinputneuronsincloseproximityhavesomerelevancetoeachother.However,the93inputvaluesthatcomprisedthedatasetmightnothavebeenrelevant.Figure16.6showsthestructureofthedeepneuralnetwork:

Figure16.6:DeepNeuralNetworkfortheChallenge

Asyoucansee,theinputlayeroftheneuralnetworkhad93neuronsthatcorrespondedtothe93inputcolumnsinthedataset.Threehiddenlayershad256,128and64neuronseach.Additionally,twodropoutlayerseachhadlayersof256and128neuronsandadropoutprobabilityof20%.Theoutputwasasoftmaxlayerthatclassifiedthenineoutputgroups.Wenormalizedtheinputdatatotheneuralnetworktotaketheirz-scores.

Ourstrategywastousetwodropoutlayerstuckedbetweenthreedenselayers.Wechoseapowerof2forthefirstdenselayer.Inthiscaseweused2tothepowerof8(256).Thenwedividedby2toobtaineachofthenexttwodenselayers.Thisprocessresultedin256,128andthen64.Thepatternofusingapowerof2forthefirstlayerandtwomoredenselayersdividingby2,workedwell.Astheexperimentscontinued,wetriedotherpowersof2inthefirstdenselayer.

Wetrainedthenetworkwithstochasticgradientdescent(SGD).Theprogramdividedthetrainingdataintoavalidationsetandatrainingset.TheSGDtrainingusedonlythetrainingdataset,butitmonitoredthevalidationset’serror.Wetraineduntilourvalidationset’serrordidnotimprovefor200iterations.Atthispoint,thetrainingstopped,andtheprogramselectedthebest-trainedneuralnetworkoverthose200iterations.Werefertothisprocessasearlystopping,andithelpstopreventoverfitting.Whenaneuralnetworkisnolongerimprovingthescoreonthevalidationset,overfittingislikelyoccurring.

Runningtheneuralnetworkproducesthefollowingoutput:

Input(None,93)produces93outputs

dense0(None,256)produces256outputs

dropout0(None,256)produces256outputs


dropout1(None,128)produces128outputs


output(None,9)roduces9outputs

epochtrainlossvalidlosstrain/valvalidacc

-----------------------------------------------------

11.070190.710041.507230.73697

20.780020.664151.174470.74626

30.725600.641771.130610.75000

40.702950.627891.119550.75353

50.677800.617591.097500.75724

...

4100.404100.507850.795720.80963

4110.408760.509300.802600.80645

Earlystopping.


Wrotesubmissiontofilelas-submit.csv.

Wrotesubmissiontofilelas-val.csv.

BaggedLASmodel:1,score:0.49511558950601003,currentmlog:

0.379456064667434,baggedmlog:0.379456064667434

Earlystopping.





0.38050303230483773,baggedmlog:0.3720715012362133

epochtrainlossvalidlosstrain/valvalidacc

-----------------------------------------------------

11.070710.705421.517850.73658

20.774580.664991.164790.74670

...

3700.414590.506960.817790.80760

3710.408490.508730.802960.80642

3720.413830.508550.813760.80787

Earlystopping.





0.3872396776865103,baggedmlog:0.3721509601621992

...


0.39710688423724777,baggedmlog:0.37481605169768967

...

Ingeneral,theneuralnetworkgraduallydecreasesitstrainingandvalidationerror.Ifyourunthisexample,youmightseedifferentoutput,basedontheprogramminglanguagefromwhichtheexampleoriginates.TheaboveoutputisfromPythonandtheLasange/NoLearnframeworks.

Itisimportanttounderstandwhythereisavalidationerrorandatrainingerror.Mostneuralnetworktrainingalgorithmswillseparatethetrainingsetintoatrainingandvalidationset.Thissplitmightbe80%fortrainingand20%forvalidation.Theneuralnetworkwillusethe80%totrain,andthenitreportsthaterrorasthetrainingerror.Youcanalsousethevalidationsettogenerateanerror,whichisthevalidationerror.Becauseitrepresentstheerroronthedatathatarenottrainedwiththeneuralnetwork,thevalidationerroristhemostimportantmeasure.Astheneuralnetworktrains,thetrainingerrorwillcontinuetodropeveniftheneuralnetworkisoverfitting.However,oncethevalidationerrorstopsdropping,theneuralnetworkisprobablybeginningtooverfit.

BaggingMultipleNeuralNetworks

Baggingisasimpleyeteffectivemethodtoensemblemultiplemodelstogether.Theexampleprogramforthischaptertrainstenneuralnetworksindependently.EachneuralnetworkwillproduceitsownsetofnineprobabilitiesthatcorrespondtothenineclassesprovidedbyKaggle.BaggingsimplytakestheaverageofeachofthesenineKaggle-providedclasses.Listing16.1providesthepseudocodetoperformthebagging:

Listing16.1:BaggingNeuralNetwork

#Finalresultsisamatrixwithrows=torowsintrainingset

#Columns=numberofoutcomes(1forregression,orclasscountfor

classification)

final_results=[][]

forifrom1to5:

network=train_neural_network()

results=evaluate_network(network)

final_results=final_results+results

#Taketheaverage

final_weights=weights/5

WeperformedthebaggingonthetestdatasetprovidedbyKaggle.Althoughthetestprovidedthe93columns,itdidnottellustheclassesthatitsupplied.WehadtoproduceafilethatcontainedtheIDoftheitemforwhichwewereansweringandthenthenineprobabilities.Oneachrow,theprobabilitiesshouldsumto1.0(100%).Ifwesubmittedafilethatdidnotsumto1.0,Kagglewouldhavescaledourvaluessothattheydidsumto1.0.

Toseetheeffectsofbagging,wesubmittedtwotestfilestoKaggle.Thefirsttestfilewasthefirstneuralnetworkthatwetrained.Thesecondtestfilewasthebaggedaverageofallten.Theresultswereasfollows:

BestSingleNetwork:0.3794FiveBaggedNetworks:0.3717

Asyoucansee,thebaggednetworksachievedabetterscorethanasingleneuralnetwork.Thecompleteresultsareshownhere:

BaggedLASmodel:1,score:0.4951,currentmlog:0.3794,baggedmlog:

0.3794


0.3720


0.3721


0.3748


0.3717

Asyoucansee,thefirstneuralnetworkhadamulti-classlogloss(mlog)errorof0.3794.ThemlogmeasurewasdiscussedinChapter5,“Training&Evaluation.”Thebaggedscorewasthesamebecausewehadonlyonenetwork.Theamazingparthappenswhenwebaggedthesecondnetworktothefirst.Thecurrentscoresofthefirsttwonetworkswere0.3794and0.3804.However,whenwebaggedthemtogether,wehad0.3720,whichwaslowerthanbothnetworks.Averagingtheweightsofthesetwo

networksproducedanewnetworkthatwasbetterthanboth.Ultimately,wesettledonabaggedscoreof0.3717,whichwasbetterthananyoftheprevioussinglenetwork(current)scores.

ChapterSummary

Inthefinalchapterofthisbook,weshowedhowtoapplydeeplearningtoareal-worldproblem.WetrainedadeepneuralnetworktoproduceasubmissionfilefortheKaggleOttoGroupProductClassificationChallenge.Weuseddenseanddropoutlayerstocreatethisneuralnetwork.

Wecanutilizeensemblestocombineseveralmodelsintoone.Usually,theresultingensemblemodelwillachievebetterscoresthantheindividualensemblemethods.WealsoexaminedhowtobagtenneuralnetworkstogetherandgenerateaKagglesubmissionCSV.

Afteranalyzingneuralnetworksanddeeplearninginthisfinalchapteraswellasthepreviouschapters,wehopethatyouhavelearnednewandusefulinformation.Ifyouhaveanycommentsaboutthisvolume,wewouldlovetohearfromyou.Inthefuture,weplantocreateadditionaleditionsofthevolumestoincludemoretechnologies.Therefore,wewouldbeinterestedindiscoveringyourpreferencesonthetechnologiesthatyouwouldlikeustoexploreinfutureeditions.Youcancontactusthroughthefollowingwebsite:

http://www.jeffheaton.com

http://www.jeffheaton.com

AppendixA:ExamplesDownloadingExamplesStructureofExampleDownloadKeepingUpdated

ArtificialIntelligenceforHumans

Theseexamplesarepartofaseriesofbooksthatiscurrentlyunderdevelopment.Checkthewebsitetoseewhichvolumeshavebeencompletedandareavailable:

http://www.heatonresearch.com/aifh

Thefollowingvolumesareplannedforthisseries:

Volume0:IntroductiontotheMathofAIVolume1:FundamentalAlgorithmsVolume2:Nature-InspiredAlgorithmsVolume3:DeepLearningandNeuralNetworks

LatestVersions

Inthisappendix,wedescribehowtoobtaintheArtificialIntelligenceforHumans(AIFH)bookseriesexamples.

Thisareaisprobablythemostdynamicofthebook.Computerlanguagesarealwayschangingandaddingnewversions.Wewillupdatetheexamplesasitbecomesnecessary,fixingbugsandmakingcorrections.Asaresult,makesurethatyouarealwaysusingthelatestversionofthebookexamples.

Becausethisareaissodynamic,thisfilemaybecomeoutdated.Youcanalwaysfindthelatestversionatthefollowinglocation:


http://www.heatonresearch.com/aifh


ObtainingtheExamples

Weprovidethebook’sexamplesinmanyprogramminglanguages.CoreexamplepacksexistforJava,C#,C/C++,Python,andRformostvolumes.Volume3,asofpublication,includesJava,C#,andPython.Otherlanguages,suchasRandC/C++areplanned.Wemayhaveaddedotherlanguagessincepublication.Thecommunitymayhaveaddedotherlanguagesaswell.YoucanfindallexamplesattheGitHubrepository:


Youhaveyourchoiceoftwodifferentwaystodownloadtheexamples.

DownloadZIPFile

GitHubprovidesaniconthatallowsyoutodownloadaZIPfilethatcontainsalloftheexamplecodefortheseries.AsingleZIPfilehasalloftheexamplesfortheseries.Asaresult,wefrequentlyupdatethecontentsofthisZIP.Ifyouarestartinganewvolume,itisimportantthatyouverifythatyouhavethelatestcopy.YoucanperformthedownloadfromthefollowingURL:


YoucanseethedownloadlinkinFigureA.1:



FigureA.1:GitHub

ClonetheGitRepository

Youcanobtainalltheexampleswiththesourcecontrolprogramgitifitisinstalledonyoursystem.Thefollowingcommandclonestheexamplestoyourcomputer:(Cloningsimplyreferstotheprocessofcopyingtheexamplefiles.)

gitclonehttps://github.com/jeffheaton/aifh.git

Youcanalsopullthelatestupdateswiththefollowingcommand:

gitpull

Ifyouwouldlikeanintroductiontogit,refertothefollowingURL:

http://git-scm.com/docs/gittutorial

http://git-scm.com/docs/gittutorial

ExampleContents

TheentireArtificialIntelligenceforHumansseriesiscontainedinonedownloadthatisazipfile.

Onceyouopentheexamplesfile,youwillseethecontentsinFigureA.2:

FigureA.2:ExamplesDownload

Thelicensefiledescribesthelicenseforthebookexamples.AlloftheexamplesforthisseriesarereleasedundertheApachev2.0license,afreeandopen-sourcesoftware(FOSS)license.Inotherwords,wedoretainacopyrighttothefiles.However,youcanfreelyreusethesefilesinbothcommercialandnon-commercialprojectswithoutfurtherpermission.

Althoughthebooksourcecodeisprovidedfree,thebooktextisnotprovidedfree.Thesebooksarecommercialproductsthatwesellthroughavarietyofchannels.Consequently,youmaynotredistributetheactualbooks.ThisrestrictionincludesthePDF,MOBI,EPUBandanyotherformatofthebook.However,weprovideallbooksinDRM-freeform.Weappreciateyoursupportofthispolicybecauseitcontributestothefuturegrowthofthesebooks.

ThedownloadalsoincludesaREADMEfile.TheREADME.mdisa“markdown”filethatcontainsimagesandformatting.Thisfilecanbereadeitherasastandardtextfileorinamarkdownviewer.TheGitHubbrowserautomaticallyformatsMDfiles.FormoreinformationonMDfiles,refertothefollowingURL:

https://help.github.com/articles/github-flavored-markdown

YouwillfindaREADMEfileinmanyfoldersofthebook’sexamples.TheREADMEfileintheexamplesroot(seenabove)hasinformationaboutthebookseries.

Youwillalsonoticetheindividualvolumefoldersinthedownload.Thesearenamedvol1,vol2,vol3,etc.Youmaynotseeallofthevolumesinthedownloadbecausetheyhavenotyetbeenwritten.Allofthevolumeshavethesameformat.Forexample,ifyouopenVolume3,youwillseethecontentslistedinFigureA.3.Othervolumeswillhavea

https://help.github.com/articles/github-flavored-markdown

similarlayout,dependingonthelanguagesthatareadded.

FigureA.3:InsideVolume3(othervolumeshavesamestructure)

Again,youseetheREADMEfilethatcontainsinformationuniquetothisparticularvolume.ThemostimportantinformationinthevolumelevelREADMEfilesisthecurrentstatusoftheexamples.Thecommunityoftencontributesexamplepacks.Asaresult,someoftheexamplepacksmaynotbecomplete.TheREADMEforthevolumewillletyouknowthisimportantinformation.ThevolumeREADMEalsocontainstheFAQforavolume.

Youshouldalsoseeafilenamed“aifh_vol3.RMD”.ThisfilecontainstheRmarkdownsourcecodethatweusedtocreatemanychartsinthebook.WeproducednearlyallthegraphsandchartsinthebookwiththeRprogramminglanguage.Thefileultimatelyallowsyoutoseetheequationsbehindthepictures.Nevertheless,wedonottranslatethisfiletootherprogramminglanguages.WeutilizeRsimplyfortheproductionofthebook.Ifweusedanotherlanguage,likePython,toproducesomeofthecharts,youwouldseea“charts.py”alongwiththeRcode.

Additionally,thevolumecurrentlyhasexamplesforC#,Java,andPython.However,youmayseethatweaddotherlanguages.So,alwayschecktheREADMEfileforthelatestinformationonlanguagetranslations.

FigureA.4showsthecontentsofatypicallanguagepack:

FigureA.4:TheJavaLanguagePack

PayattentiontotheREADMEfiles.TheREADMEfilesinalanguagefolderareimportantbecauseyouwillfindinformationabouttheJavaexamples.Ifyouhavedifficultyusingthebook’sexampleswithaparticularlanguage,theREADMEfileshouldbeyourfirststeptosolvingtheproblem.TheotherfilesintheaboveimagearealluniquetoJava.TheREADMEfiledescribesthesefilesinmuchgreaterdetail.

ContributingtotheProject

Ifyouwouldliketotranslatetheexamplestoanewlanguageorifyouhavefoundanerrorinthebook,youcanhelp.ForktheprojectandpushacommitrevisiontoGitHub.Wewillcredityouamongthegrowingnumberofcontributors.

Theprocessbeginswithafork.YoucreateanaccountonGitHubandforktheAIFHproject.ThisstepcreatesanewprojectthathasacopyoftheAIFHfiles.YouwillthencloneyournewprojectthroughGitHub.Onceyoumakeyourchanges,yousubmita“pullrequest.”Whenwereceivethisrequest,wewillevaluateyourchanges/additionsandmergeitwiththemainproject.

YoucanfindamoredetailedarticleoncontributingthroughGitHubatthisURL:

https://help.github.com/articles/fork-a-repo

https://help.github.com/articles/fork-a-repo

ReferencesThissectionliststhereferencematerialsforthisbook.

Ackley,H.,Hinton,E.,&Sejnowski,J.(1985).AlearningalgorithmforBoltzmannmachines.CognitiveScience,147-169.

Bergstra,J.,Breuleux,O.,Bastien,F.,Lamblin,P.,Pascanu,R.,Desjardins,G.Bengio,Y.(2010,June).Theano:aCPUandGPUmathexpressioncompiler.InProceedingsofthepythonforscientificcomputingconference(SciPy).(OralPresentation)

Broomhead,D.,&Lowe,D.(1988).Multivariablefunctionalinterpolationandadaptivenetworks.ComplexSystems,2,321-355.

Chung,J.,Gulcehre,C.,Cho,K.,&Bengio,Y.(2014).Empiricalevaluationofgatedrecurrentneuralnetworksonsequencemodeling.CoRR,abs/1412.3555.

Elman,J.L.(1990).Findingstructureintime.CognitiveScience,14(2),179-211.

Fukushima,K.(1980).Neocognitron:Aself-organizingneuralnetworkmodelforamechanismofpatternrecognitionunaffectedbyshiftinposition.BiologicalCybernetics,36,193-202.

Garey,M.R.,&Johnson,D.S.(1990).Computersandintractability;aguidetothetheoryofnp-completeness.NewYork,NY,USA:W.H.Freeman&Co.

Glorot,X.,Bordes,A.,&Bengio,Y.(2011).Deepsparserectifierneuralnetworks.InG.J.Gordon,D.B.Dunson,&M.Dudk(Eds.),Aistats(Vol.15,p.315-323).JMLR.org.

Hebb,D.(2002).Theorganizationofbehavior:aneuropsychologicaltheory.MahwahN.J.:L.ErlbaumAssociates.

Hinton,G.E.,Srivastava,N.,Krizhevsky,A.,Sutskever,I.,&Salakhutdinov,R.(2012).Improvingneuralnetworksbypreventingco-adaptationoffeaturedetectors.CoRR,abs/1207.0580.

Hopfield,J.J.(1988).Neurocomputing:Foundationsofresearch.InJ.A.Anderson&E.Rosenfeld(Eds.),(pp.457-464).Cambridge,MA,USA:MITPress.

Hopfield,J.J.,&Tank,D.W.(1985).“Neural”computationofdecisionsinoptimizationproblems.BiologicalCybernetics,52,141-152.

Hornik,K.(1991,March).Approximationcapabilitiesofmultilayerfeedforwardnetworks.NeuralNetworks,4(2),251-257.

Jacobs,R.A.(1988).Increasedratesofconvergencethroughlearningrateadaptation.NeuralNetworks,1(4),295-307.

Jacobs,R.,&Jordan,M.(1993,Mar).Learningpiecewisecontrolstrategiesinamodularneuralnetworkarchitecture.IEEETransactionsonSystems,ManandCybernetics,23(2),337-345.

Jordan,M.I.(1986).Serialorder:Aparalleldistributedprocessingapproach(Tech.Rep.No.ICSReport8604).InstituteforCognitiveScience,UniversityofCalifornia,San

Diego.

Kalman,B.,&Kwasny,S.(1992,Jun).WhyTANH:choosingasigmoidalfunction.InNeuralnetworks,1992.IJCNN,InternationalJointConferenceonNeuralNetworks(Vol.4,p.578-581vol.4).

Kamiyama,N.,Iijima,N.,Taguchi,A.,Mitsui,H.,Yoshida,Y.,&Sone,M.(1992,Nov).Tuningoflearningrateandmomentumonback-propagation.InSingaporeICCS/ISITA‘92.‘Communicationsonthemove’(p.528-532,vol.2).

Keogh,E.,Chu,S.,Hart,D.,&Pazzani,M.(1993).Segmentingtimeseries:Asurveyandnovelapproach.Inaneditedvolume,dataminingintimeseriesdatabases.PublishedbyWorldScientificPublishingCompany(pp.1-22).

Kohonen,T.(1988).Neurocomputing:Foundationsofresearch.InJ.A.Anderson&E.Rosenfeld(Eds.),(pp.509-521).Cambridge,MA,USA:MITPress.

Krizhevsky,A.,Sutskever,I.,&Hinton,G.E.(n.d.).Imagenetclassificationwithdeepconvolutionalneuralnetworks.InAdvancesinneuralinformationprocessingsystems(p.2012).

LeCun,Y.,Bottou,L.,Bengio,Y.,&Haner,P.(1998).Gradient-basedlearningappliedtodocumentrecognition.InProceedingsoftheIEEE(pp.2278-2324).

Maas,A.L.,Hannun,A.Y.,&Ng,A.Y.(2013).Rectifiernonlinearitiesimproveneuralnetworkacousticmodels.InInternationalconferenceonmachinelearning(ICML).

vanderMaaten,L.,&Hinton,G.(n.d.).Visualizinghigh-dimensionaldatausingt-SNE.JournalofMachineLearningResearch(JMLR),9,2579-2605.

Marquardt,D.(1963).Analgorithmforleast-squaresestimationofnonlinearparameters.SIAMJournalonAppliedMathematics,11(2),431-441.

Matviykiv,O.,&Faitas,O.(2012).Dataclassificationofspectrumanalysisusingneuralnetwork.LvivPolytechnicNationalUniversity.

McCulloch,W.,&Pitts,W.(1943,December21).Alogicalcalculusoftheideasimmanentinnervousactivity.BulletinofMathematicalBiology,5(4),115-133.

Mozer,M.C.(1995).Backpropagation.InY.Chauvin&D.E.Rumelhart(Eds.),(pp.137{169).Hillsdale,NJ,USA:L.ErlbaumAssociatesInc.

Nesterov,Y.(2004).Introductorylecturesonconvexoptimization:abasiccourse.KluwerAcademicPublishers.

Ng,A.Y.(2004).Featureselection,l1vs.l2regularization,androtationalinvariance.InProceedingsofthetwentyfirstinternationalconferenceonmachinelearning(pp.78-).NewYork,NY,USA:ACM.

Neal,R.M.(1992,July).Connectionistlearningofbeliefnetworks.ArtificialIntelligence,56(1),71-113.

Riedmiller,M.,&Braun,H.(1993).Adirectadaptivemethodforfasterbackpropagationlearning:TheRPROPalgorithm.InIEEEinternationalconferenceonneuralnetworks(pp.586-591).

Robinson,A.J.,&Fallside,F.(1987).Theutilitydrivendynamicerrorpropagationnetwork(Tech.Rep.No.CUED/F-INFENG/TR.1).Cambridge:CambridgeUniversityEngineeringDepartment.

Rumelhart,D.E.,Hinton,G.E.,&Williams,R.J.(1988).Neurocomputing:Foundationsofresearch.InJ.A.Anderson&E.Rosenfeld(Eds.),(pp.696-699).Cambridge,MA,USA:MITPress.

Schmidhuber,J.(2012).Multi-columndeepneuralnetworksforimageclassification.InProceedingsofthe2012IEEEconferenceoncomputervisionandpatternrecognition(cvpr)(pp.3642-3649).Washington,DC,USA:IEEEComputerSociety.

Sjberg,J.,Zhang,Q.,Ljung,L.,Benveniste,A.,Deylon,B.,yvesGlorennec,P.,Juditsky,A.(1995).Nonlinearblack-boxmodelinginsystemidentification:aunifiedoverview.Automatica,31,1691-1724.

Snoek,J.,Larochelle,H.,&Adams,R.P.(2012).Practicalbayesianoptimizationofmachinelearningalgorithms.InF.Pereira,C.Burges,L.Bottou,&K.Weinberger(Eds.),Advancesinneuralinformationprocessingsystems25(pp.2951{2959).CurranAssociates,Inc.

Stanley,K.O.,&Miikkulainen,R.(2002).Evolvingneuralnetworksthroughaugmentingtopologies.EvolutionaryComputation,10(2),99-127.

Stanley,K.O.,DAmbrosio,D.B.,&Gauci,J.(2009,April).Ahypercubebasedencodingforevolvinglarge-scaleneuralnetworks.ArtificialLife,15(2),185-212.

Teh,Y.W.,&Hinton,G.E.(2000).Rate-codedrestrictedBoltzmannmachinesforfacerecognition.InT.K.Leen,T.G.Dietterich,&V.Tresp(Eds.),Nips(p.908-914).MITPress.

Werbos,P.J.(1988).Generalizationofbackpropagationwithapplicationtoarecurrentgasmarketmodel.NeuralNetworks,1.

Zeiler,M.D.,Ranzato,M.,Monga,R.,Mao,M.Z.,Yang,K.,Le,Q.V.,Hinton,G.E.(2013).Onrectifiedlinearunitsforspeechprocessing.InICASSP(p.3517-3521).IEEE.

Documents

Artificial Intelligence for Humans, Volume 3: Deep Learning and Neural Networks