DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND …wcohen/10-605/deep-1.pdf · DEEP LEARNING AND...

Preview:

Citation preview

DEEPLEARNINGANDNEURALNETWORKS:BACKGROUNDANDHISTORY

1

On-lineResources• http://neuralnetworksanddeeplearning.com/index.htmlOnlinebookbyMichaelNielsen

• http://matlabtricks.com/post-5/3x3-convolution-kernels-with-online-demo - ofconvolutions

• https://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html - demoofCNN

• http://scs.ryerson.ca/~aharley/vis/conv/ - 3Dvisualization

• http://cs231n.github.io/ StanfordCSclassCS231n:ConvolutionalNeuralNetworksforVisualRecognition.

• http://www.deeplearningbook.org/ MITPressbookfromBengio etal,freeonlineversion

2

Ahistoryofneuralnetworks• 1940s-60’s:

– McCulloch&Pitts;Hebb:modelingrealneurons– Rosenblatt,Widrow-Hoff::perceptrons– 1969:Minskey &Papert,Perceptrons bookshowedformallimitationsofone-layerlinearnetwork

• 1970’s-mid-1980’s:…• mid-1980’s– mid-1990’s:

– backprop andmulti-layernetworks– Rumelhart andMcClellandPDP bookset– Sejnowski’s NETTalk,BP-basedtext-to-speech– NeuralInfoProcessingSystems(NIPS)conferencestarts

• Mid1990’s-early2000’s:…• Mid-2000’stocurrent:

– Moreandmoreinterestandexperimentalsuccess

3

4

5

6

7

Multilayernetworks• Simplestcase:classifierisamultilayernetworkoflogisticunits

• Eachunittakessomeinputsandproducesoneoutputusingalogisticclassifier

• Outputofoneunitcanbetheinputofanother

Input layer

Output layer

Hidden layer

v1=S(wTX)w0,1

x1

x2

1

v2=S(wTX)

z1=S(wTV)

w1,1

w2,1

w0,2

w1,2

w2,2

w1

w2

8

Learningamultilayernetwork• Definealoss(simplestcase:squarederror)

– Butoveranetworkof“units”• Minimizelosswithgradientdescent

– Youcandothisovercomplexnetworksifyoucantakethegradient ofeachunit:everycomputationisdifferentiable

JX,y (w) = y i − ˆ y i( )i∑

2

v1=S(wTX)w0,1

x1

x2

1

v2=S(wTX)

z1=S(wTV)

w1,1

w2,1

w0,2

w1,2

w2,2

w1

w2

9

ANNsinthe90’s• Mostly2-layernetworksorelsecarefullyconstructed“deep”networks(eg CNNs)

• WorkedwellbuttrainingwasslowandfinickyNov1998– Yann LeCunn,Bottou,Bengio,Haffner

10

ANNsinthe90’s• Mostly2-layernetworksorelsecarefullyconstructed“deep”networks

• Workedwellbuttrainingtypicallytookweekswhenguidedbyanexpert

SVM:98.9-99.2%accurate

CNNs:98.3-99.3%accurate

11

Learningamultilayernetwork• Definealoss(simplestcase:squarederror)

– Butoveranetworkof“units”• Minimizelosswithgradientdescent

– Youcandothisovercomplexnetworksifyoucantakethegradient ofeachunit:everycomputationisdifferentiable

JX,y (w) = y i − ˆ y i( )i∑

2

12

Example:weightupdatesformultilayerANNwithsquarelossandlogisticunits

δk ≡ tk − ak( ) ak 1− ak( )

δ j ≡ δkwkj( )k∑ aj 1− aj( )

For nodes k in output layer:

For nodes j in hidden layer:

For all weights:

“Propagate errors backward”BACKPROP

wkj = wkj −ε δkajwji = wji −ε δ jai

Can carry this recursion out further if you have multiple hidden layers

13

BACKPROP FORMLPS

14

BackProp inMatrix-VectorNotation

MichaelNielson:http://neuralnetworksanddeeplearning.com/

15

Notation

16

Notation

Eachdigitis28x28pixels=784inputs

17

Notation

wl isweightmatrixforlayerl

18

Notation

activationbias

wl

al andbl

19

Notation

Matrix:wl

Vector:alVector:bl

activationbias

weight

vectoràvector function:componentwise logistic

Vector:zl

20

Computationis“feedforward”

forl=1,2,…L:

21

Notation

Costfunctiontooptimize:sumoverexamplesx

where

Matrix:wl

Vector:alVector:bl

Vector:zl

Vector:y22

Notation

23

BackProp:lastlayer

Levellforl=1,…,LMatrix:wl

Vectors:• biasbl• activational• pre-sigmoidactiv:zl• targetoutputy• “localerror”δl

Matrixform:

componentsarecomponentsare

componentwiseproductofvectors

24

BackProp:lastlayer

Levellforl=1,…,LMatrix:wl

Vectors:• biasbl• activational• pre-sigmoidactiv:zl• targetoutputy• “localerror”δl

Matrixformforsquareloss:

25

BackProp:erroratlevellintermsoferroratlevell+1

Levellforl=1,…,LMatrix:wl

Vectors:• biasbl• activational• pre-sigmoidactiv:zl• targetoutputy• “localerror”δl

whichwecanusetocompute

26

BackProp:summary

Levellforl=1,…,LMatrix:wl

Vectors:• biasbl• activational• pre-sigmoidactiv:zl• targetoutputy• “localerror”δl

27

Computationpropagateserrorsbackward

forl=L,L-1,…1:

forl=1,2,…L:

28

EXPRESSIVENESSOFDEEPNETWORKS

29

DeepANNsareexpressive• OnelogisticunitcanimplementandANDoranORofasubsetofinputs– e.g.,(x3 ANDx5 AND…ANDx19)

• Everyboolean functioncanbeexpressedasanORofANDs– e.g.,(x3 ANDx5 )OR(x7 ANDx19)OR…

• SoonehiddenlayercanexpressanyBF

(But it might need lots and lots of hidden units)

30

DeepANNsareexpressive• OnelogisticunitcanimplementandANDoranORofasubsetofinputs

– e.g.,(x3 ANDx5 AND…ANDx19)• Everyboolean functioncanbeexpressedasanORofANDs

– e.g.,(x3 ANDx5 )OR(x7 ANDx19)OR…• SoonehiddenlayercanexpressanyBF

• Example:parity(x1,…,xN)=1iff offnumberofxi’saresettoone

Parity(a,b,c,d)=(a&-b&-c&-d)OR(-a&b&-c&-d)OR…#listallthe“1s”(a&b&c&-d)OR(a&b&-c&d)OR…#listallthe“3s”

SizeingeneralisO(2N)

31

DeeperANNsaremoreexpressive• Atwo-layernetworkneedsO(2N)units• Atwo-layernetworkcanexpressbinaryXOR• A2*logN layernetworkcanexpresstheparityofNinputs(even/oddnumberof1’s)– WithO(logN)unitsinabinarytree

• Deepnetwork+parametertying~=subroutinesx1

x2

x3

x4

x5

x6

x7

x832

Hypotheticalcodeforfacerecognition

http://neuralnetworksanddeeplearning.com/chap1.html

….

….

33

PARALLELTRAININGFORANNS

34

HowareANNstrained?• Typically,withsomevariantofstreamingSGD– Keepthedataondisk,inapreprocessedform– Loopoveritmultipletimes– Keepthemodelinmemory

• Solutiontobigdata:butlongtrainingtimes!

• However,some parallelismisoftenused….

35

Recap:logisticregressionwithSGDP(Y =1| X = x) = p = 1

1+ e−x⋅w

36

Thispartcomputes

innerproduct<x,w>

Thispartlogisticof<x,w>

Recap:logisticregressionwithSGDP(Y =1| X = x) = p = 1

1+ e−x⋅w

37

Ononeexample:computes

innerproduct<x,w>

There’ssomechancetocomputethisinparallel…canwedomore?

a z

InANNswehavemanymanylogisticregressionnodes

38

Recap:logisticregressionwithSGD

39

ai zi

LetxbeanexampleLetwi betheinputweightsforthei-th hiddenunitThenoutputai =x.wi

Recap:logisticregressionwithSGD

40

ai zi

LetxbeanexampleLetwi betheinputweightsforthei-th hiddenunitThena =xWisoutputforallmunits w1 w2 w3 … wm

0.1 -0.3 …

-1.7 …

..

W=

Recap:logisticregressionwithSGD

41

LetX beamatrixwithk examplesLetwi betheinputweightsforthei-th hiddenunitThenA =X Wisoutputforallmunitsforallkexamples

w1 w2 w3 … wm

0.1 -0.3 …

-1.7 …

0.3 …

1.2

x1 1 0 1 1x2 …

xk

XW=

x1.w1 x1.w2 … x1.wm

xk.w1 … … xk.wm

There’salotofchancestodothisinparallel

ANNsandmulticoreCPUs• Modernlibraries(Matlab,numpy,…)domatrixoperationsfast,inparallel

• ManyANNimplementationsexploitthisparallelismautomatically

• Keyimplementationissueisworkingwithmatricescomfortably

42

ANNsandGPUs• GPUsdomatrixoperationsveryfast,inparallel– Fordensematrixes,notsparseones!

• TrainingANNsonGPUsiscommon– SGDandminibatch sizesof128

• ModernANNimplementationscanexploitthis• GPUsarenotsuper-expensive– $500forhigh-endone– largemodelswithO(107)parameterscanfitinalarge-memoryGPU(12Gb)

• Speedupsof20x-50xhavebeenreported

43

ANNsandmulti-GPUsystems• TherearewaystosetupANNcomputationssothattheyarespreadacrossmultipleGPUs– SometimesinvolvessomesortofIPM– SometimesinvolvespartitioningthemodelacrossmultipleGPUs

– Oftenneededforverylargenetworks– Notespeciallyeasytoimplementanddowithmostcurrenttools

44

WHYAREDEEPNETWORKSHARDTOTRAIN?

45

Recap:weightupdatesformultilayerANN

δ Lk ≡ tk − ak( ) ak 1− ak( )

δ hj ≡ δ h+1j wkj( )

k∑ aj 1− aj( )

For nodes k in output layer L:

For nodes j in hidden layer h:

What happens as the layers get further and further from the output layer? E.g., what’s gradient for the bias term with several layers after it?

46

Gradientsareunstable

What happens as the layers get further and further from the output layer? E.g., what’s gradient for the bias term with several layers after it in a trivial net?

Maxat1/4

If weights are usually < 1 then we are multiplying by many numbers < 1 so the gradients get very small.

The vanishing gradient problem

47

Gradientsareunstable

What happens as the layers get further and further from the output layer? E.g., what’s gradient for the bias term with several layers after it in a trivial net?

Maxat1/4

If weights are usually > 1 then we are multiplying by many numbers > 1 so the gradients get very big.

The exploding gradient problem(lesscommonbutpossible)

48

AIStats2010

Histogramofgradientsina5-layernetworkforanartificialimagerecognitiontask

input

output

49

AIStats2010

Wewillgettothesetrickseventually….

50

It’seasyforsigmoidunitstosaturate

Learningrateapproacheszero

andunitis“stuck”

componentsare

51

It’seasyforsigmoidunitstosaturate

Forabignetworktherearelotsofweightedinputstoeachneuron.Ifanyofthemaretoolargethentheneuronwillsaturate.Soneuronsget

stuckwithafewlargeinputsORmanysmallones.52

It’seasyforsigmoidunitstosaturate• Ifthereare500non-zeroinputsinitializedwithaGaussian~N(0,1)thentheSDis 500 ≈ 22.4

53

• SaturationvisualizationfromGlorot &Bengio 2010-- usingasmarterinitializationscheme

It’seasyforsigmoidunitstosaturate

Closest-to-outputhiddenlayerstillstuckforfirst100

epochs

54

WHAT’SDIFFERENTABOUTMODERNANNS?

55

Somekeydifferences• Useofsoftmax andentropiclossinsteadofquadraticloss.

• Useofalternatenon-linearities– reLU andhyperbolictangent

• Betterunderstandingofweightinitialization• Dataaugmentation– Especiallyforimagedata

56

Cross-entropyloss

Comparetogradientforsquarelosswhena~=1y=0andx=1

∂C∂w

=σ (z)− y

a z

57

Cross-entropyloss

58

Cross-entropyloss

59

Softmax outputlayer

Softmax

Networkoutputsaprobabilitydistribution!Cross-entropylossafterasoftmax layergivesaverysimple,numericallystablegradient

Δwij =(yi-zi)yj

60

Somekeydifferences• Useofsoftmax andentropiclossinsteadofquadraticloss.– Oftenlearningisfasterandmorestableaswellasgettingbetteraccuraciesinthelimit

• Useofalternatenon-linearities• Betterunderstandingofweightinitialization• Dataaugmentation– Especiallyforimagedata

61

Somekeydifferences• Useofsoftmax andentropiclossinsteadofquadraticloss.– Oftenlearningisfasterandmorestableaswellasgettingbetteraccuraciesinthelimit

• Useofalternatenon-linearities– reLU andhyperbolictangent

• Betterunderstandingofweightinitialization• Dataaugmentation– Especiallyforimagedata

62

Alternativenon-linearities• Changessofar– Changedtheloss fromsquareerrortocross-entropy

– Proposedaddinganotheroutputlayer(softmax)• Anewchange:modifyingthenonlinearity– ThelogisticisnotwidelyusedinmodernANNs

63

Alternativenon-linearities• Anewchange:modifyingthenonlinearity– ThelogisticisnotwidelyusedinmodernANNs

Alternate1:tanh

Likelogisticfunctionbutshiftedtorange[-1,+1]

64

AIStats2010

Wewillgettothesetrickseventually….

depth5

65

Alternativenon-linearities• Anewchange:modifyingthenonlinearity– reLU oftenusedinvisiontasks

Alternate2:rectifiedlinearunit

Linearwithacutoffatzero

(Implement:clipthegradientwhenyoupasszero)

66

Alternativenon-linearities• Anewchange:modifyingthenonlinearity– reLU oftenusedinvisiontasks

Alternate2:rectifiedlinearunit

Softversion:log(exp(x)+1)

Doesn’tsaturate(atoneend)Sparsifies outputsHelpswithvanishinggradient

67

Somekeydifferences• Useofsoftmax andentropiclossinsteadofquadraticloss.– Oftenlearningisfasterandmorestableaswellasgettingbetteraccuraciesinthelimit

• Useofalternatenon-linearities– reLU andhyperbolictangent

• Betterunderstandingofweightinitialization• Dataaugmentation– Especiallyforimagedata

68

It’seasyforsigmoidunitstosaturate

Forabignetworktherearelotsofweightedinputstoeachneuron.Ifanyofthemaretoolargethentheneuronwillsaturate.Soneuronsget

stuckwithafewlargeinputsORmanysmallones.69

It’seasyforsigmoidunitstosaturate• Ifthereare500non-zeroinputsinitializedwithaGaussian~N(0,1)thentheSDis

• Commonheuristicsforinitializingweights:

500 ≈ 22.4

U −1#inputs

, −1#inputs

"

#$$

%

&''N 0, 1

#inputs

!

"##

$

%&&

70

• SaturationvisualizationfromGlorot &Bengio 2010using

U −1#inputs

, −1#inputs

"

#$$

%

&''

It’seasyforsigmoidunitstosaturate

71

Initializingtoavoidsaturation• InGlorot andBengio theysuggestweightsiflevelj(withnj inputs)from

Thisisnotalwaysthesolution– butgoodinitializationisveryimportant fordeepnets!

Firstbreakthroughdeeplearningresultswerebasedoncleverpre-traininginitialization

schemes,wheredeepnetworkswereseededwithweightslearnedfromunsupervised

strategies

72

Recommended