Upload
simaokasonse
View
151
Download
3
Embed Size (px)
Citation preview
Learning Deep Learning
M1 Sonse Shimaoka
Neural Network
(From the lecture slide of Nando de Freitas )
Machine Learning
Supervised Learning
Input output
[0,0,1,0,1,1,0,0,0,1,1] 1
[1,1,1,0,1,1,1,0,0,1,1] 0
[1,1,1,0,1,1,0,0,0,1,1] 0
[0,0,0,0,1,1,1,0,0,0,0] 1
[1,0,1,0,1,1,0,0,0,0,0] 1
[1,0,1,0,0,0,0,0,0,1,1] 0
[0,0,0,0,1,1,0,1,0,1,1] 1
Training data Input output
[1,0,1,0,1,1,0,0,0,1,0] ?
[1,1,1,1,1,1,1,0,0,1,1] ?
[1,0,1,0,1,1,0,1,0,1,1] ?
Test data
GeneralizaGon
Perceptron
∑ sign
x1
x2
x3
w1
w3
w2
b
y
y = sign wjx jj=1
3
∑ + b"
#$$
%
&''
Perceptron
∑ sign
1
3
−2
2
1.5
1
0.5
1
1*2+3*1−2*1.5+ 0.5= 2.5
y = sign wjx jj=1
3
∑ + b"
#$$
%
&''
Perceptron
(x1, x2, x3) = (1,3,−2)(w1,w2,w3) = (2,1,1.5)b = 0.5
= sign 1*2+3*1− 2*1.5+ 0.5( ) = sign(2.5) =1
y = sign wixii=1
3
∑ + b"
#$
%
&'= sign w1x1 +w2x2 +w3x3 + b( )
Perceptron x1
x2w1x1 +w2x2 + b = 0
Problem with Perceptron x1
x2w1x1 +w2x2 + b = 0
What is the probability that this point belongs to the posiGve class?
Perceptron can’t answer this!
Problem with Perceptron x1
x2
Impossible to separate linearly !!
LogisGc Regression
∑ sigmoid
x1
x2
x3
w1
w3
w2
b
y
y = sigmoid wjx jj=1
3
∑ + b"
#$$
%
&''
LogisGc Regression sigmoid x( ) = 1
1+ exp(−x)
LogisGc Regression
∑ sigmoid
y = sigmoid wjx jj=1
3
∑ + b"
#$$
%
&''
1
3
−2
2
1.5
1
0.5
1*2+3*1−2*1.5+ 0.5= 2.5
0.924
Probability!!
Feature TransformaGon
x1
x2
New Space
ΦNon Linear TransformaGon
φ1(x1, x2 )
φ2 (x1, x2 )
Original Space But, we must sGll design the transformaGon…
Feed Forward Neural Network
∑ f
A neuron
AcGvaGon funcGon
Feed Forward Neural Network
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3
y1
Feed Forward Neural Network
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3
y1
input layer hidden layer output layer
AbstracGon by Layer
Linear Linear
V
f g
W
x h yWx Vh
FFN can learn representaGons!!
FFN can learn representaGons!!
AcGvaGon FuncGons sigmoid x( ) = 1
1+ exp(−x)=exp(x)exp(x)+1
AcGvaGon FuncGons tanh x( ) = exp(x)− exp(−x)
exp(x)+ exp(−x)
AcGvaGon FuncGons rectifier(x) =max(0, x)
AcGvaGon FuncGons
softmax(x1,..., xm )c =exp(xc )
exp(xk )k=1
m
∑
Loss FuncGons • When you want a model to learn to do something, you give it feedback on how well it is doing.
• This funcGon that computes an objecGve measure of the model's performance is called a loss func1on.
• A typical loss funcGon takes in the model's output and the ground truth and computes a value that quanGfies the model's performance.
• The model then corrects itself to have a smaller loss.
L2 norm
(y1,..., yn )
L = 1n
ti − yi2
2i=1
n
∑
(t1,..., tn )
Output:
Target:
Loss:
Task: Regression
Cross Entropy
(y1,..., yn )
L = 1n
−ti log yi − (1− ti )log(1− yi )i=1
n
∑
(t1,..., tn )
Output:
Target:
Loss:
Task: Binary ClassificaGon
Class NegaGve Log Likelihood
(y1,..., yn )
L = − 1n
ti,k log yi,kk
m
∑i=1
n
∑
(t1,..., tn )
Output:
Target:
Loss:
Task: MulG Class ClassificaGon
Output acGvaGon funcGons and Loss funcGons
Task Output ac1va1on
Loss func1on
Regression Linear L2 norm
Binary ClassificaGon
Sigmoid Cross Entropy
MulG Class ClassificaGon
So]max Class NLL
ProbabilisGc PerspecGve
• We can assume NNs are compuGng condiGonal probabiliGes
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3
p(t1 | x1, x2, x3)
ProbabilisGc PerspecGve
• When
NLL = − log p(ti | xi )i=1
n
∏ = − log 12πσ
exp −ti − yi( )2
2σ 2
#
$%%
&
'((
i=1
n
∏
=12σ 2 ti − yi( )2 − n 2πσ
i=1
n
∑
p(t | x) = 12πσ
exp −t − y( )2
2σ 2
"
#$$
%
&''
L2 norm
ProbabilisGc PerspecGve
• When
NLL = − log p(ti | xi )i=1
n
∏ = − log yiti (1−
i=1
n
∏ yi )1−ti
= −ti log yi − (1− ti )log(1− yi )i=1
n
∑
p(t | x) = yt (1− y)1−t
Cross Entropy
ProbabilisGc PerspecGve
• When
NLL = − log p(ti | xi )i=1
n
∏ = − log yti,ki,kk=1
m
∏i=1
n
∏
= − ti,k log yi,kk=1
m
∑i=1
n
∑
p(t | x) = yktk
k=1
m
∏
Class NegaGve Log Likelihood
Gradient Descent
• Gradient
• Gradient Descent
Gradient Descent
FuncGon to be minimized IniGal point Learning rate Update rule
L(w)
winit
wnew ← wold −α∂L∂w w=wold
α
Gradient Descent
Big learning rate Small learning rate
Loss funcGon for LogisGc regression
L(w,b;D) = log ytiii=1
n
∏ (1− yi )1−ti
= ti log yi + (1− ti )log(1− yi )i=1
n
∑
yi =1
1+ exp(−wT xi − b)
Gradient with respect to w ∂L(w,b;D)
∂w=∂∂w
ti log yi + (1− ti )log(1− yi )i=1
n
∑
=∂∂w
ti log yi + (1− ti )log(1− yi )( )i=1
n
∑
=∂yi∂w
∂∂yi
ti log yi + (1− ti )log(1− yi )( )i=1
n
∑
=∂yi∂w
tiyi−1− ti1− yi
$
%&
'
()
i=1
n
∑ =∂yi∂w
ti − yiyi (1− yi )$
%&
'
()
i=1
n
∑
= xiyi (1− yi )ti − yiyi (1− yi )$
%&
'
()
i=1
n
∑
= xi (ti − yi )i=1
n
∑
∵∂yi∂w
=∂∂w
11+ exp(−wT xi − b)#
$%
&
'(
=−∂∂w
1+ exp(−wT xi − b)( )1+ exp(−wT xi − b)( )
2
=xi exp(−w
T xi − b)1+ exp(−wT xi − b)( )
2
= xiyi (1− yi )
Gradient with respect to b ∂L(w,b;D)
∂b=∂∂b
ti log yi + (1− ti )log(1− yi )i=1
n
∑
=∂∂b
ti log yi + (1− ti )log(1− yi )( )i=1
n
∑
=∂yi∂b
∂∂yi
ti log yi + (1− ti )log(1− yi )( )i=1
n
∑
=∂yi∂b
tiyi−1− ti1− yi
$
%&
'
()
i=1
n
∑ =∂yi∂b
ti − yiyi (1− yi )$
%&
'
()
i=1
n
∑
= yi (1− yi )ti − yiyi (1− yi )$
%&
'
()
i=1
n
∑
= ti − yii=1
n
∑
∵∂yi∂b
=∂∂b
11+ exp(−wT xi − b)#
$%
&
'(
=−∂∂b1+ exp(−wT xi − b)( )
1+ exp(−wT xi − b)( )2
=exp(−wT xi − b)
1+ exp(−wT xi − b)( )2
= yi (1− yi )
Gradient Descent for LogisGc Regression
FuncGon to be minimized Update rule
bnew ← bold −α ti − yii=1
n
∑
L(w,b;D)= ti log yi + (1− ti )log(1− yi )i=1
n
∑
wnew ← wold −α xi (ti − yi )i=1
n
∑
Exercise: Gradient Descent for Linear Regression
L(w,b;D) = ti − yi( )2i=1
n
∑
yi = wT xi + b
Answer
FuncGon to be minimized Update rule
L(w,b;D) = ti − yi( )2i=1
n
∑
bnew ← bold −α ti − yii=1
n
∑
wnew ← wold −α xi (ti − yi )i=1
n
∑
BackpropagaGon
How do we compute and ? ∂L∂W
∂L∂V
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3y2
LW2,3 V3,1
∑ g y1
u1
u2
u3
l1
l2
BackpropagaGon
Use the Chain Rule!!! ∂∂xq s x( )( ) = ∂s(x)
∂x∂q(s(x))∂s(x)
x s(x)qs q(p(x))
∂q(s(x))∂s(x)
∂s(x)∂x
∂q(s(x))∂s(x)
BackpropagaGon
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3y2
LW2,3 V3,1
Start from Output layer:
∂L∂y1
∑ g y1
u1
u2
u3
l1
l2
∂L∂y1
BackpropagaGon
Apply Chain Rule :
∂L∂l1
=∂y1∂l1
∂L∂y1
= "g l1( ) ∂L∂y1
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3y2
LW2,3 V3,1
∑ g y1
u1
u2
u3
l1
l2
∂L∂y1
∂L∂l1
BackpropagaGon
Apply Chain Rule :
∂L∂V3,1
=∂l1∂V3,1
∂L∂l1
= h3∂L∂l1
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3y2
LW2,3 V3,1
∑ g y1
u1
u2
u3
l1
l2
∂L∂y1
∂L∂l1
∂L∂V3,1
BackpropagaGon
Apply Chain Rule :
∂L∂h3
=∂l1∂h3
∂L∂l1
+∂l1∂h3
∂L∂l1
=V3,1∂L∂l1
+V3,2∂L∂l2
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3y2
LW2,3 V3,1
∑ g y1
u1
u2
u3
l1
l2
∂L∂y1
∂L∂l1
∂L∂V3,1
∂L∂h3
BackpropagaGon
Apply Chain Rule :
∂L∂u3
=∂h3∂u3
∂L∂h3
= "f u3( ) ∂L∂h3
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3y2
LW2,3 V3,1
∑ g y1
u1
u2
u3
l1
l2
∂L∂y1
∂L∂l1
∂L∂V3,1
∂L∂h3∂L
∂u3
BackpropagaGon
Apply Chain Rule :
∂L∂W2,3
=∂u3∂W2,3
∂L∂u3
= u3∂L∂u3
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3y2
LW2,3 V3,1
∑ g y1
u1
u2
u3
l1
l2
∂L∂y1
∂L∂l1
∂L∂V3,1
∂L∂h3∂L
∂u3
∂L∂W2,3
AbstracGon by Layer
Linear Linear
V
f
∂L∂W
∂L∂V
g
W
Lx
t
∂L∂x
h
∂L∂h
y
∂L∂y
Wx Vh
∂L∂ Wx( )
∂L∂ Vh( )
AbstracGon by Layer
input output
∂loss∂input
∂loss∂outputLayer
AbstracGon by Layer
input outputLayer
Forward ComputaGon
output = Layer. forward input( )
AbstracGon by Layer
∂loss∂input
∂loss∂outputLayer
Backward ComputaGon
∂loss∂input
= Layer. backward input, ∂loss∂output
"
#$
%
&'
input
BackpropagaGon
① Execute the forward computaGon
Linear Linear
V
f g
W
Lx
t
h yWx Vh
BackpropagaGon
② Compute the derivaGve of the loss funcGon with respect to the output
Linear Linear
V
f g
W
Lx
t
h y
∂L∂y
Wx Vh
BackpropagaGon
③ StarGng from the final layer, backpropagate derivaGves through layers
Linear Linear
V
f g
W
Lx
t
h y
∂L∂y
Wx Vh
∂L∂ Vh( )
Classifying Digits
32×32=1024 pixels Class: 10 digits (0~9) Training: 60000 examples TesGng: 60000 examples
Classifying Digits
x ∈ R1024
0000100000
!
"
##############
$
%
&&&&&&&&&&&&&&
t =
Classifying Digits
Linear Linear
V,cW,b
x
t
So]max Tanh
Class NLL
Classifying Digits
Linear Linear
V,cW,b
x
t
So]max Tanh
Class NLL
u =Wx + b
u
Classifying Digits
Linear Linear
V,cW,b
x
t
So]max Tanh
Class NLL
h = Tanh(u)
hu
Classifying Digits
Linear Linear
V,cW,b
x
t
So]max Tanh
Class NLL
l =Vh+ c
h lu
Classifying Digits
Linear Linear
V,cW,b
x
t
So]max Tanh
Class NLL
y = softmax l( )
h ylu
Classifying Digits
Linear Linear
V,cW,b
x
t
So]max Tanh
Class NLL h y
L = tk log ykk=1
10
∑ = tT log y
L
lu
Classifying Digits
Linear Linear
V,cW,b
x
t
So]max Tanh
Class NLL Wx + b h Vh+ c y
∂L∂y
=∂∂ytT log y = t1
y1,..., t10 y10
"#$
%&'
T
L
∂L∂y
Classifying Digits
Linear Linear
V,cW,b
x
t
So]max Tanh
Class NLL h y
∂L∂l
= y ! (t − y)! ∂L∂y
= y1 t1 − y1( ),..., y10 t10 − y10( )#$ %&T! t1 y1
,..., t10 y10#$'
%&(
T
= t1 t1 − y1( ),..., t10 t10 − y10( )#$ %&T
L
∂L∂y
∂L∂l
lu
Classifying Digits
Linear Linear
V,cW,b
x
t
So]max Tanh
Class NLL h y
∂L∂h
=VT ∂L∂l
L
∂L∂y
∂L∂V
=∂L∂lhT ∂L
∂c=∂L∂l
∂L∂h
∂L∂V, ∂L∂c
∂L∂l
lu
Classifying Digits
Linear Linear
V,cW,b
x
t
So]max Tanh
Class NLL h y
∂L∂u
= 1+ h( )! 1− h( )! ∂L∂h
L
∂L∂y
∂L∂h
∂L∂V, ∂L∂c
∂L∂u
∂L∂l
lu
Classifying Digits
Linear Linear
V,cW,b
x
t
So]max Tanh
Class NLL h y
∂L∂W
=∂L∂u
xT
L
∂L∂y
∂L∂h
∂L∂V, ∂L∂c
∂L∂u
∂L∂b
=∂L∂u
∂L∂x
=WT ∂L∂u
∂L∂W
, ∂L∂b
∂L∂x
∂L∂l
lu
Classifying Digits
bnew ← b−α ∂L∂b
Wnew ←W −α∂L∂W
V new ←V −α ∂L∂V
cnew ← c−α ∂L∂c
Torch7
Torch7
Torch7