Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
DEEPLEARNINGANDNEURALNETWORKS:BACKGROUNDANDHISTORY
1
On-lineResources• http://neuralnetworksanddeeplearning.com/index.htmlOnlinebookbyMichaelNielsen
• http://matlabtricks.com/post-5/3x3-convolution-kernels-with-online-demo - ofconvolutions
• https://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html - demoofCNN
• http://scs.ryerson.ca/~aharley/vis/conv/ - 3Dvisualization
• http://cs231n.github.io/ StanfordCSclassCS231n:ConvolutionalNeuralNetworksforVisualRecognition.
• http://www.deeplearningbook.org/ MITPressbookfromBengio etal,freeonlineversion
2
Ahistoryofneuralnetworks• 1940s-60’s:
– McCulloch&Pitts;Hebb:modelingrealneurons– Rosenblatt,Widrow-Hoff::perceptrons– 1969:Minskey &Papert,Perceptrons bookshowedformallimitationsofone-layerlinearnetwork
• 1970’s-mid-1980’s:…• mid-1980’s– mid-1990’s:
– backprop andmulti-layernetworks– Rumelhart andMcClellandPDP bookset– Sejnowski’s NETTalk,BP-basedtext-to-speech– NeuralInfoProcessingSystems(NIPS)conferencestarts
• Mid1990’s-early2000’s:…• Mid-2000’stocurrent:
– Moreandmoreinterestandexperimentalsuccess
3
4
5
6
7
Multilayernetworks• Simplestcase:classifierisamultilayernetworkoflogisticunits
• Eachunittakessomeinputsandproducesoneoutputusingalogisticclassifier
• Outputofoneunitcanbetheinputofanother
Input layer
Output layer
Hidden layer
v1=S(wTX)w0,1
x1
x2
1
v2=S(wTX)
z1=S(wTV)
w1,1
w2,1
w0,2
w1,2
w2,2
w1
w2
8
Learningamultilayernetwork• Definealoss(simplestcase:squarederror)
– Butoveranetworkof“units”• Minimizelosswithgradientdescent
– Youcandothisovercomplexnetworksifyoucantakethegradient ofeachunit:everycomputationisdifferentiable
€
JX,y (w) = y i − ˆ y i( )i∑
2
v1=S(wTX)w0,1
x1
x2
1
v2=S(wTX)
z1=S(wTV)
w1,1
w2,1
w0,2
w1,2
w2,2
w1
w2
9
ANNsinthe90’s• Mostly2-layernetworksorelsecarefullyconstructed“deep”networks(eg CNNs)
• WorkedwellbuttrainingwasslowandfinickyNov1998– Yann LeCunn,Bottou,Bengio,Haffner
10
ANNsinthe90’s• Mostly2-layernetworksorelsecarefullyconstructed“deep”networks
• Workedwellbuttrainingtypicallytookweekswhenguidedbyanexpert
SVM:98.9-99.2%accurate
CNNs:98.3-99.3%accurate
11
Learningamultilayernetwork• Definealoss(simplestcase:squarederror)
– Butoveranetworkof“units”• Minimizelosswithgradientdescent
– Youcandothisovercomplexnetworksifyoucantakethegradient ofeachunit:everycomputationisdifferentiable
€
JX,y (w) = y i − ˆ y i( )i∑
2
12
Example:weightupdatesformultilayerANNwithsquarelossandlogisticunits
δk ≡ tk − ak( ) ak 1− ak( )
δ j ≡ δkwkj( )k∑ aj 1− aj( )
For nodes k in output layer:
For nodes j in hidden layer:
For all weights:
“Propagate errors backward”BACKPROP
wkj = wkj −ε δkajwji = wji −ε δ jai
Can carry this recursion out further if you have multiple hidden layers
13
BACKPROP FORMLPS
14
BackProp inMatrix-VectorNotation
MichaelNielson:http://neuralnetworksanddeeplearning.com/
15
Notation
16
Notation
Eachdigitis28x28pixels=784inputs
17
Notation
wl isweightmatrixforlayerl
18
Notation
activationbias
wl
al andbl
19
Notation
Matrix:wl
Vector:alVector:bl
activationbias
weight
vectoràvector function:componentwise logistic
Vector:zl
20
Computationis“feedforward”
forl=1,2,…L:
21
Notation
Costfunctiontooptimize:sumoverexamplesx
where
Matrix:wl
Vector:alVector:bl
Vector:zl
Vector:y22
Notation
23
BackProp:lastlayer
Levellforl=1,…,LMatrix:wl
Vectors:• biasbl• activational• pre-sigmoidactiv:zl• targetoutputy• “localerror”δl
Matrixform:
componentsarecomponentsare
componentwiseproductofvectors
24
BackProp:lastlayer
Levellforl=1,…,LMatrix:wl
Vectors:• biasbl• activational• pre-sigmoidactiv:zl• targetoutputy• “localerror”δl
Matrixformforsquareloss:
25
BackProp:erroratlevellintermsoferroratlevell+1
Levellforl=1,…,LMatrix:wl
Vectors:• biasbl• activational• pre-sigmoidactiv:zl• targetoutputy• “localerror”δl
whichwecanusetocompute
26
BackProp:summary
Levellforl=1,…,LMatrix:wl
Vectors:• biasbl• activational• pre-sigmoidactiv:zl• targetoutputy• “localerror”δl
27
Computationpropagateserrorsbackward
forl=L,L-1,…1:
forl=1,2,…L:
28
EXPRESSIVENESSOFDEEPNETWORKS
29
DeepANNsareexpressive• OnelogisticunitcanimplementandANDoranORofasubsetofinputs– e.g.,(x3 ANDx5 AND…ANDx19)
• Everyboolean functioncanbeexpressedasanORofANDs– e.g.,(x3 ANDx5 )OR(x7 ANDx19)OR…
• SoonehiddenlayercanexpressanyBF
(But it might need lots and lots of hidden units)
30
DeepANNsareexpressive• OnelogisticunitcanimplementandANDoranORofasubsetofinputs
– e.g.,(x3 ANDx5 AND…ANDx19)• Everyboolean functioncanbeexpressedasanORofANDs
– e.g.,(x3 ANDx5 )OR(x7 ANDx19)OR…• SoonehiddenlayercanexpressanyBF
• Example:parity(x1,…,xN)=1iff offnumberofxi’saresettoone
Parity(a,b,c,d)=(a&-b&-c&-d)OR(-a&b&-c&-d)OR…#listallthe“1s”(a&b&c&-d)OR(a&b&-c&d)OR…#listallthe“3s”
SizeingeneralisO(2N)
31
DeeperANNsaremoreexpressive• Atwo-layernetworkneedsO(2N)units• Atwo-layernetworkcanexpressbinaryXOR• A2*logN layernetworkcanexpresstheparityofNinputs(even/oddnumberof1’s)– WithO(logN)unitsinabinarytree
• Deepnetwork+parametertying~=subroutinesx1
x2
x3
x4
x5
x6
x7
x832
Hypotheticalcodeforfacerecognition
http://neuralnetworksanddeeplearning.com/chap1.html
….
….
33
PARALLELTRAININGFORANNS
34
HowareANNstrained?• Typically,withsomevariantofstreamingSGD– Keepthedataondisk,inapreprocessedform– Loopoveritmultipletimes– Keepthemodelinmemory
• Solutiontobigdata:butlongtrainingtimes!
• However,some parallelismisoftenused….
35
Recap:logisticregressionwithSGDP(Y =1| X = x) = p = 1
1+ e−x⋅w
36
Thispartcomputes
innerproduct<x,w>
Thispartlogisticof<x,w>
Recap:logisticregressionwithSGDP(Y =1| X = x) = p = 1
1+ e−x⋅w
37
Ononeexample:computes
innerproduct<x,w>
There’ssomechancetocomputethisinparallel…canwedomore?
a z
InANNswehavemanymanylogisticregressionnodes
38
Recap:logisticregressionwithSGD
39
ai zi
LetxbeanexampleLetwi betheinputweightsforthei-th hiddenunitThenoutputai =x.wi
Recap:logisticregressionwithSGD
40
ai zi
LetxbeanexampleLetwi betheinputweightsforthei-th hiddenunitThena =xWisoutputforallmunits w1 w2 w3 … wm
0.1 -0.3 …
-1.7 …
..
…
W=
Recap:logisticregressionwithSGD
41
LetX beamatrixwithk examplesLetwi betheinputweightsforthei-th hiddenunitThenA =X Wisoutputforallmunitsforallkexamples
w1 w2 w3 … wm
0.1 -0.3 …
-1.7 …
0.3 …
1.2
x1 1 0 1 1x2 …
…
xk
XW=
x1.w1 x1.w2 … x1.wm
xk.w1 … … xk.wm
There’salotofchancestodothisinparallel
ANNsandmulticoreCPUs• Modernlibraries(Matlab,numpy,…)domatrixoperationsfast,inparallel
• ManyANNimplementationsexploitthisparallelismautomatically
• Keyimplementationissueisworkingwithmatricescomfortably
42
ANNsandGPUs• GPUsdomatrixoperationsveryfast,inparallel– Fordensematrixes,notsparseones!
• TrainingANNsonGPUsiscommon– SGDandminibatch sizesof128
• ModernANNimplementationscanexploitthis• GPUsarenotsuper-expensive– $500forhigh-endone– largemodelswithO(107)parameterscanfitinalarge-memoryGPU(12Gb)
• Speedupsof20x-50xhavebeenreported
43
ANNsandmulti-GPUsystems• TherearewaystosetupANNcomputationssothattheyarespreadacrossmultipleGPUs– SometimesinvolvessomesortofIPM– SometimesinvolvespartitioningthemodelacrossmultipleGPUs
– Oftenneededforverylargenetworks– Notespeciallyeasytoimplementanddowithmostcurrenttools
44
WHYAREDEEPNETWORKSHARDTOTRAIN?
45
Recap:weightupdatesformultilayerANN
δ Lk ≡ tk − ak( ) ak 1− ak( )
δ hj ≡ δ h+1j wkj( )
k∑ aj 1− aj( )
For nodes k in output layer L:
For nodes j in hidden layer h:
What happens as the layers get further and further from the output layer? E.g., what’s gradient for the bias term with several layers after it?
46
Gradientsareunstable
What happens as the layers get further and further from the output layer? E.g., what’s gradient for the bias term with several layers after it in a trivial net?
Maxat1/4
If weights are usually < 1 then we are multiplying by many numbers < 1 so the gradients get very small.
The vanishing gradient problem
47
Gradientsareunstable
What happens as the layers get further and further from the output layer? E.g., what’s gradient for the bias term with several layers after it in a trivial net?
Maxat1/4
If weights are usually > 1 then we are multiplying by many numbers > 1 so the gradients get very big.
The exploding gradient problem(lesscommonbutpossible)
48
AIStats2010
Histogramofgradientsina5-layernetworkforanartificialimagerecognitiontask
input
output
49
AIStats2010
Wewillgettothesetrickseventually….
50
It’seasyforsigmoidunitstosaturate
Learningrateapproacheszero
andunitis“stuck”
componentsare
51
It’seasyforsigmoidunitstosaturate
Forabignetworktherearelotsofweightedinputstoeachneuron.Ifanyofthemaretoolargethentheneuronwillsaturate.Soneuronsget
stuckwithafewlargeinputsORmanysmallones.52
It’seasyforsigmoidunitstosaturate• Ifthereare500non-zeroinputsinitializedwithaGaussian~N(0,1)thentheSDis 500 ≈ 22.4
53
• SaturationvisualizationfromGlorot &Bengio 2010-- usingasmarterinitializationscheme
It’seasyforsigmoidunitstosaturate
Closest-to-outputhiddenlayerstillstuckforfirst100
epochs
54
WHAT’SDIFFERENTABOUTMODERNANNS?
55
Somekeydifferences• Useofsoftmax andentropiclossinsteadofquadraticloss.
• Useofalternatenon-linearities– reLU andhyperbolictangent
• Betterunderstandingofweightinitialization• Dataaugmentation– Especiallyforimagedata
56
Cross-entropyloss
Comparetogradientforsquarelosswhena~=1y=0andx=1
∂C∂w
=σ (z)− y
a z
57
Cross-entropyloss
58
Cross-entropyloss
59
Softmax outputlayer
Softmax
Networkoutputsaprobabilitydistribution!Cross-entropylossafterasoftmax layergivesaverysimple,numericallystablegradient
Δwij =(yi-zi)yj
60
Somekeydifferences• Useofsoftmax andentropiclossinsteadofquadraticloss.– Oftenlearningisfasterandmorestableaswellasgettingbetteraccuraciesinthelimit
• Useofalternatenon-linearities• Betterunderstandingofweightinitialization• Dataaugmentation– Especiallyforimagedata
61
Somekeydifferences• Useofsoftmax andentropiclossinsteadofquadraticloss.– Oftenlearningisfasterandmorestableaswellasgettingbetteraccuraciesinthelimit
• Useofalternatenon-linearities– reLU andhyperbolictangent
• Betterunderstandingofweightinitialization• Dataaugmentation– Especiallyforimagedata
62
Alternativenon-linearities• Changessofar– Changedtheloss fromsquareerrortocross-entropy
– Proposedaddinganotheroutputlayer(softmax)• Anewchange:modifyingthenonlinearity– ThelogisticisnotwidelyusedinmodernANNs
63
Alternativenon-linearities• Anewchange:modifyingthenonlinearity– ThelogisticisnotwidelyusedinmodernANNs
Alternate1:tanh
Likelogisticfunctionbutshiftedtorange[-1,+1]
64
AIStats2010
Wewillgettothesetrickseventually….
depth5
65
Alternativenon-linearities• Anewchange:modifyingthenonlinearity– reLU oftenusedinvisiontasks
Alternate2:rectifiedlinearunit
Linearwithacutoffatzero
(Implement:clipthegradientwhenyoupasszero)
66
Alternativenon-linearities• Anewchange:modifyingthenonlinearity– reLU oftenusedinvisiontasks
Alternate2:rectifiedlinearunit
Softversion:log(exp(x)+1)
Doesn’tsaturate(atoneend)Sparsifies outputsHelpswithvanishinggradient
67
Somekeydifferences• Useofsoftmax andentropiclossinsteadofquadraticloss.– Oftenlearningisfasterandmorestableaswellasgettingbetteraccuraciesinthelimit
• Useofalternatenon-linearities– reLU andhyperbolictangent
• Betterunderstandingofweightinitialization• Dataaugmentation– Especiallyforimagedata
68
It’seasyforsigmoidunitstosaturate
Forabignetworktherearelotsofweightedinputstoeachneuron.Ifanyofthemaretoolargethentheneuronwillsaturate.Soneuronsget
stuckwithafewlargeinputsORmanysmallones.69
It’seasyforsigmoidunitstosaturate• Ifthereare500non-zeroinputsinitializedwithaGaussian~N(0,1)thentheSDis
• Commonheuristicsforinitializingweights:
500 ≈ 22.4
U −1#inputs
, −1#inputs
"
#$$
%
&''N 0, 1
#inputs
!
"##
$
%&&
70
• SaturationvisualizationfromGlorot &Bengio 2010using
U −1#inputs
, −1#inputs
"
#$$
%
&''
It’seasyforsigmoidunitstosaturate
71
Initializingtoavoidsaturation• InGlorot andBengio theysuggestweightsiflevelj(withnj inputs)from
Thisisnotalwaysthesolution– butgoodinitializationisveryimportant fordeepnets!
Firstbreakthroughdeeplearningresultswerebasedoncleverpre-traininginitialization
schemes,wheredeepnetworkswereseededwithweightslearnedfromunsupervised
strategies
72