DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND …wcohen/10-605/deep-1.pdf · DEEP LEARNING AND...

DEEPLEARNINGANDNEURALNETWORKS:BACKGROUNDANDHISTORY

On-lineResources• http://neuralnetworksanddeeplearning.com/index.htmlOnlinebookbyMichaelNielsen

• http://matlabtricks.com/post-5/3x3-convolution-kernels-with-online-demo - ofconvolutions

• https://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html - demoofCNN

• http://scs.ryerson.ca/~aharley/vis/conv/ - 3Dvisualization

• http://cs231n.github.io/ StanfordCSclassCS231n:ConvolutionalNeuralNetworksforVisualRecognition.

• http://www.deeplearningbook.org/ MITPressbookfromBengio etal,freeonlineversion

Ahistoryofneuralnetworks• 1940s-60’s:

– McCulloch&Pitts;Hebb:modelingrealneurons– Rosenblatt,Widrow-Hoff::perceptrons– 1969:Minskey &Papert,Perceptrons bookshowedformallimitationsofone-layerlinearnetwork

• 1970’s-mid-1980’s:…• mid-1980’s– mid-1990’s:

– backprop andmulti-layernetworks– Rumelhart andMcClellandPDP bookset– Sejnowski’s NETTalk,BP-basedtext-to-speech– NeuralInfoProcessingSystems(NIPS)conferencestarts

• Mid1990’s-early2000’s:…• Mid-2000’stocurrent:

– Moreandmoreinterestandexperimentalsuccess

Multilayernetworks• Simplestcase:classifierisamultilayernetworkoflogisticunits

• Eachunittakessomeinputsandproducesoneoutputusingalogisticclassifier

• Outputofoneunitcanbetheinputofanother

Input layer

Output layer

Hidden layer

v1=S(wTX)w0,1

v2=S(wTX)

z1=S(wTV)

Learningamultilayernetwork• Definealoss(simplestcase:squarederror)

– Butoveranetworkof“units”• Minimizelosswithgradientdescent

– Youcandothisovercomplexnetworksifyoucantakethegradient ofeachunit:everycomputationisdifferentiable

JX,y (w) = y i − ˆ y i( )i∑

v1=S(wTX)w0,1

v2=S(wTX)

z1=S(wTV)

ANNsinthe90’s• Mostly2-layernetworksorelsecarefullyconstructed“deep”networks(eg CNNs)

• WorkedwellbuttrainingwasslowandfinickyNov1998– Yann LeCunn,Bottou,Bengio,Haffner

ANNsinthe90’s• Mostly2-layernetworksorelsecarefullyconstructed“deep”networks

• Workedwellbuttrainingtypicallytookweekswhenguidedbyanexpert

SVM:98.9-99.2%accurate

CNNs:98.3-99.3%accurate

Learningamultilayernetwork• Definealoss(simplestcase:squarederror)

– Butoveranetworkof“units”• Minimizelosswithgradientdescent

– Youcandothisovercomplexnetworksifyoucantakethegradient ofeachunit:everycomputationisdifferentiable

JX,y (w) = y i − ˆ y i( )i∑

Example:weightupdatesformultilayerANNwithsquarelossandlogisticunits

δk ≡ tk − ak( ) ak 1− ak( )

δ j ≡ δkwkj( )k∑ aj 1− aj( )

For nodes k in output layer:

For nodes j in hidden layer:

For all weights:

“Propagate errors backward”BACKPROP

wkj = wkj −ε δkajwji = wji −ε δ jai

Can carry this recursion out further if you have multiple hidden layers

BACKPROP FORMLPS

BackProp inMatrix-VectorNotation

MichaelNielson:http://neuralnetworksanddeeplearning.com/

Notation

Eachdigitis28x28pixels=784inputs

Notation

wl isweightmatrixforlayerl

Notation

activationbias

al andbl

Notation

Matrix:wl

Vector:alVector:bl

activationbias

weight

vectoràvector function:componentwise logistic

Vector:zl

Computationis“feedforward”

forl=1,2,…L:

Notation

Costfunctiontooptimize:sumoverexamplesx

Matrix:wl

Vector:alVector:bl

Vector:zl

Vector:y22

Notation

BackProp:lastlayer

Levellforl=1,…,LMatrix:wl

Vectors:• biasbl• activational• pre-sigmoidactiv:zl• targetoutputy• “localerror”δl

Matrixform:

componentsarecomponentsare

componentwiseproductofvectors

BackProp:lastlayer

Matrixformforsquareloss:

BackProp:erroratlevellintermsoferroratlevell+1

whichwecanusetocompute

BackProp:summary

Computationpropagateserrorsbackward

forl=L,L-1,…1:

forl=1,2,…L:

EXPRESSIVENESSOFDEEPNETWORKS

DeepANNsareexpressive• OnelogisticunitcanimplementandANDoranORofasubsetofinputs– e.g.,(x3 ANDx5 AND…ANDx19)

• Everyboolean functioncanbeexpressedasanORofANDs– e.g.,(x3 ANDx5 )OR(x7 ANDx19)OR…

• SoonehiddenlayercanexpressanyBF

(But it might need lots and lots of hidden units)

DeepANNsareexpressive• OnelogisticunitcanimplementandANDoranORofasubsetofinputs

– e.g.,(x3 ANDx5 AND…ANDx19)• Everyboolean functioncanbeexpressedasanORofANDs

– e.g.,(x3 ANDx5 )OR(x7 ANDx19)OR…• SoonehiddenlayercanexpressanyBF

• Example:parity(x1,…,xN)=1iff offnumberofxi’saresettoone

Parity(a,b,c,d)=(a&-b&-c&-d)OR(-a&b&-c&-d)OR…#listallthe“1s”(a&b&c&-d)OR(a&b&-c&d)OR…#listallthe“3s”

SizeingeneralisO(2N)

DeeperANNsaremoreexpressive• Atwo-layernetworkneedsO(2N)units• Atwo-layernetworkcanexpressbinaryXOR• A2*logN layernetworkcanexpresstheparityofNinputs(even/oddnumberof1’s)– WithO(logN)unitsinabinarytree

• Deepnetwork+parametertying~=subroutinesx1

Hypotheticalcodeforfacerecognition

http://neuralnetworksanddeeplearning.com/chap1.html

PARALLELTRAININGFORANNS

HowareANNstrained?• Typically,withsomevariantofstreamingSGD– Keepthedataondisk,inapreprocessedform– Loopoveritmultipletimes– Keepthemodelinmemory

• Solutiontobigdata:butlongtrainingtimes!

• However,some parallelismisoftenused….

Recap:logisticregressionwithSGDP(Y =1| X = x) = p = 1

1+ e−x⋅w

Thispartcomputes

innerproduct<x,w>

Thispartlogisticof<x,w>

Recap:logisticregressionwithSGDP(Y =1| X = x) = p = 1

1+ e−x⋅w

Ononeexample:computes

innerproduct<x,w>

There’ssomechancetocomputethisinparallel…canwedomore?

InANNswehavemanymanylogisticregressionnodes

Recap:logisticregressionwithSGD

LetxbeanexampleLetwi betheinputweightsforthei-th hiddenunitThenoutputai =x.wi

LetxbeanexampleLetwi betheinputweightsforthei-th hiddenunitThena =xWisoutputforallmunits w1 w2 w3 … wm

0.1 -0.3 …

-1.7 …

LetX beamatrixwithk examplesLetwi betheinputweightsforthei-th hiddenunitThenA =X Wisoutputforallmunitsforallkexamples

w1 w2 w3 … wm

0.1 -0.3 …

-1.7 …

0.3 …

x1 1 0 1 1x2 …

x1.w1 x1.w2 … x1.wm

xk.w1 … … xk.wm

There’salotofchancestodothisinparallel

ANNsandmulticoreCPUs• Modernlibraries(Matlab,numpy,…)domatrixoperationsfast,inparallel

• ManyANNimplementationsexploitthisparallelismautomatically

• Keyimplementationissueisworkingwithmatricescomfortably

ANNsandGPUs• GPUsdomatrixoperationsveryfast,inparallel– Fordensematrixes,notsparseones!

• TrainingANNsonGPUsiscommon– SGDandminibatch sizesof128

• ModernANNimplementationscanexploitthis• GPUsarenotsuper-expensive– $500forhigh-endone– largemodelswithO(107)parameterscanfitinalarge-memoryGPU(12Gb)

• Speedupsof20x-50xhavebeenreported

ANNsandmulti-GPUsystems• TherearewaystosetupANNcomputationssothattheyarespreadacrossmultipleGPUs– SometimesinvolvessomesortofIPM– SometimesinvolvespartitioningthemodelacrossmultipleGPUs

– Oftenneededforverylargenetworks– Notespeciallyeasytoimplementanddowithmostcurrenttools

WHYAREDEEPNETWORKSHARDTOTRAIN?

Recap:weightupdatesformultilayerANN

δ Lk ≡ tk − ak( ) ak 1− ak( )

δ hj ≡ δ h+1j wkj( )

k∑ aj 1− aj( )

For nodes k in output layer L:

For nodes j in hidden layer h:

What happens as the layers get further and further from the output layer? E.g., what’s gradient for the bias term with several layers after it?

Gradientsareunstable

What happens as the layers get further and further from the output layer? E.g., what’s gradient for the bias term with several layers after it in a trivial net?

Maxat1/4

If weights are usually < 1 then we are multiplying by many numbers < 1 so the gradients get very small.

The vanishing gradient problem

Gradientsareunstable

What happens as the layers get further and further from the output layer? E.g., what’s gradient for the bias term with several layers after it in a trivial net?

Maxat1/4

If weights are usually > 1 then we are multiplying by many numbers > 1 so the gradients get very big.

The exploding gradient problem(lesscommonbutpossible)

AIStats2010

Histogramofgradientsina5-layernetworkforanartificialimagerecognitiontask

output

AIStats2010

Wewillgettothesetrickseventually….

It’seasyforsigmoidunitstosaturate

Learningrateapproacheszero

andunitis“stuck”

componentsare

Forabignetworktherearelotsofweightedinputstoeachneuron.Ifanyofthemaretoolargethentheneuronwillsaturate.Soneuronsget

stuckwithafewlargeinputsORmanysmallones.52

It’seasyforsigmoidunitstosaturate• Ifthereare500non-zeroinputsinitializedwithaGaussian~N(0,1)thentheSDis 500 ≈ 22.4

• SaturationvisualizationfromGlorot &Bengio 2010-- usingasmarterinitializationscheme

Closest-to-outputhiddenlayerstillstuckforfirst100

epochs

WHAT’SDIFFERENTABOUTMODERNANNS?

Somekeydifferences• Useofsoftmax andentropiclossinsteadofquadraticloss.

• Useofalternatenon-linearities– reLU andhyperbolictangent

• Betterunderstandingofweightinitialization• Dataaugmentation– Especiallyforimagedata

Cross-entropyloss

Comparetogradientforsquarelosswhena~=1y=0andx=1

∂C∂w

=σ (z)− y

Cross-entropyloss

Softmax outputlayer

Softmax

Networkoutputsaprobabilitydistribution!Cross-entropylossafterasoftmax layergivesaverysimple,numericallystablegradient

Δwij =(yi-zi)yj

Somekeydifferences• Useofsoftmax andentropiclossinsteadofquadraticloss.– Oftenlearningisfasterandmorestableaswellasgettingbetteraccuraciesinthelimit

• Useofalternatenon-linearities• Betterunderstandingofweightinitialization• Dataaugmentation– Especiallyforimagedata

Alternativenon-linearities• Changessofar– Changedtheloss fromsquareerrortocross-entropy

– Proposedaddinganotheroutputlayer(softmax)• Anewchange:modifyingthenonlinearity– ThelogisticisnotwidelyusedinmodernANNs

Alternativenon-linearities• Anewchange:modifyingthenonlinearity– ThelogisticisnotwidelyusedinmodernANNs

Alternate1:tanh

Likelogisticfunctionbutshiftedtorange[-1,+1]

AIStats2010

Wewillgettothesetrickseventually….

depth5

Alternativenon-linearities• Anewchange:modifyingthenonlinearity– reLU oftenusedinvisiontasks

Alternate2:rectifiedlinearunit

Linearwithacutoffatzero

(Implement:clipthegradientwhenyoupasszero)

Alternativenon-linearities• Anewchange:modifyingthenonlinearity– reLU oftenusedinvisiontasks

Alternate2:rectifiedlinearunit

Softversion:log(exp(x)+1)

Doesn’tsaturate(atoneend)Sparsifies outputsHelpswithvanishinggradient

Forabignetworktherearelotsofweightedinputstoeachneuron.Ifanyofthemaretoolargethentheneuronwillsaturate.Soneuronsget

stuckwithafewlargeinputsORmanysmallones.69

It’seasyforsigmoidunitstosaturate• Ifthereare500non-zeroinputsinitializedwithaGaussian~N(0,1)thentheSDis

• Commonheuristicsforinitializingweights:

500 ≈ 22.4

U −1#inputs

, −1#inputs

&''N 0, 1

#inputs

• SaturationvisualizationfromGlorot &Bengio 2010using

U −1#inputs

, −1#inputs

Initializingtoavoidsaturation• InGlorot andBengio theysuggestweightsiflevelj(withnj inputs)from

Thisisnotalwaysthesolution– butgoodinitializationisveryimportant fordeepnets!

Firstbreakthroughdeeplearningresultswerebasedoncleverpre-traininginitialization

schemes,wheredeepnetworkswereseededwithweightslearnedfromunsupervised

strategies

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND …wcohen/10-605/deep-1.pdf · DEEP LEARNING AND...

Documents

Deep Learning for Neural Networks - viXra · Deep Learning for Neural Networks Today, deep neural networks with different architectures, such as convolutional, recurrent and autoencoder

Deep neural networks

Representation Learning and Deep Neural Networks - …didawiki.di.unipi.it/.../computational-neuroscience/5-deep-hand.pdfRepresentation Learning and Deep Neural Networks Davide Bacciu

Deep Convolutional Neural Networks on Multichannel Time ... · Deep Convolutional Neural Networks On Multichannel Time Series ... as a common input for the neural network ... Deep

Deep Neural Networks - dmi.unibas.ch

Deep Learning in Neural Networks: An Overviejuergen/DeepLearning28May2014.pdf · Deep Learning in Neural Networks: An Overview ... 1 Introduction to Deep Learning ... Neural Fitted

Deep convolutional neural networks - Computer Science- …web.cs.ucdavis.edu/.../lee_lecture19_deeplearning_notes.pdf · Deep convolutional neural networks ... “Deep” architecture

Special Topic - University of Georgiacobweb.cs.uga.edu/~khaled/MLcourse/Deep_Learning_Lecture.pdf · Deep Neural Networks Deep (but not that deep) Neural Network 6. Deep Neural Networks

Troubleshooting Deep Neural Networks - josh-tobin.comjosh-tobin.com/assets/pdf/troubleshooting-deep-neural-networks-01... · Josh Tobin. January 2019. josh-tobin.com/troubleshooting-deep-neural-networks

Practical Deep Neural Networks - DGYrt.dgyblog.com/res/dlworkshop/introduction.pdf · Practical Deep Neural Networks ... ' CS231n Convolutional Neural Networks for Visual Recognition

Parallelized Deep Neural Networks

Deep neural networks - Computer Science- UC Davisweb.cs.ucdavis.edu/.../lee_lecture18_deeplearning.pdf · Deep neural networks ... TradiMonal Image CategorizaMon: Training phase Training

Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/images/a/ae/AN2DL_02_2021_Prceptron_2_Fee… · Artificial Neural Networks and Deep Learning . 2 «Deep Learning

Deep Neural Networks Convolutional Networks II - Deep Learningdeeplearning.cs.cmu.edu/document/slides/lec9.CNN.pdf · Deep Neural Networks Convolutional Networks II Bhiksha Raj Spring

Deep Neural Networks Convolutional Networks IIbhiksha/courses/deep... · Deep Neural Networks Convolutional Networks II Bhiksha Raj 1. Story so far • Pattern classification tasks

Compositional Convolutional Neural Networks: A Deep ...ayuille/JHUcourses/...models and deep networks into compositional convo-lutional neural networks, a uniﬁed deep model with

Deep Neural Networks Convolutional Networks II

Debugging deep neural networks

Introduction to Deep Neural Networks - Deep Learning · Introduction to Deep Neural Networks 0. Logistics Spring 2020 1. Neural Networks are taking over! •Neural networks have become

Introduction to Deep Neural Networksbhiksha/courses/deep... · Introduction to Deep Neural Networks 0. Logistics Spring 2020 1. Neural Networks are taking over! •Neural networks