Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Computation Graphs
Brian Thompsonslides by Philipp Koehn
2 October 2018
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
1Neural Network Cartoon
• A common way to illustrate a neural network
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
2Neural Network Math
• Hidden layerh = sigmoid(W1x+ b1)
• Final layery = sigmoid(W2h+ b2)
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
3Computation Graph
sigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
4Simple Neural Network
11
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
5Computation Graph
sigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
[3.7 3.72.9 2.9
]
[4.5 −5.2
]
[−1.5−4.6
]
[−2.0
]
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
6Processing Input
sigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
[3.7 3.72.9 2.9
]
[4.5 −5.2
]
[−1.5−4.6
]
[−2.0
]
[1.00.0
]
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
7Processing Input
sigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
[3.7 3.72.9 2.9
]
[4.5 −5.2
]
[−1.5−4.6
]
[−2.0
]
[1.00.0
][3.72.9
]
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
8Processing Input
sigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
[3.7 3.72.9 2.9
]
[4.5 −5.2
]
[−1.5−4.6
]
[−2.0
]
[1.00.0
][3.72.9
][2.2−1.6
]
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
9Processing Input
sigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
[3.7 3.72.9 2.9
]
[4.5 −5.2
]
[−1.5−4.6
]
[−2.0
]
[1.00.0
][3.72.9
][2.2−1.6
][.900.168
]
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
10Processing Input
sigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
[3.7 3.72.9 2.9
]
[4.5 −5.2
]
[−1.5−4.6
]
[−2.0
]
[1.00.0
][3.72.9
][2.2−1.6
][.900.168
][3.18
][1.18
][.765
]
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
11Error Function
• For training, we need a measure how well we do
⇒ Cost function
also known as objective function, loss, gain, cost, ...
• For instance L2 normerror =
1
2(t− y)2
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
12Gradient Descent
• We view the error as a function of the trainable parameters
error(λ)
• We want to optimize error(λ) by moving it towards its optimum
λ
error(λ)
gradient = 1
current λoptimal λ
λ
error(λ)
grad
ient =
2
current λoptimal λ
λ
error(λ)
gradient = 0.2
current λoptimal λ
• Why not just set it to its optimum?
– we are updating based on one training example, do not want to overfit to it– we are also changing all the other parameters, the curve will look different
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
13Calculus Refresher: Chain Rule
• Formula for computing derivative of composition of two or more functions
– functions f and g– composition f ◦ g maps x to f(g(x))
• Chain rule
(f ◦ g)′ = (f ′ ◦ g) · g′
or
F ′(x) = f ′(g(x))g′(x)
• Leibniz’s notationdzdx = dz
dy ·dydx
if z = f(y) and y = g(x), thendzdx = dz
dy ·dydx = f ′(y)g′(x) = f ′(g(x))g′(x)
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
14Final Layer Update
• Linear combination of weights s =∑
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t− y)
2
• Derivative of error with regard to one weight wk
dE
dwk=dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
15Error Computation in Computation Graph
L2
tsigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
16Error Propagation in Computation Graph
E
B
A
• Compute derivative at node A: dEdA = dE
dBdBdA
• Assume that we already computed dEdB (backward pass through graph)
• So now we only have to get the formula for dBdA
• For instance B is a square node
– forward computation: B = A2
– backward computation: dBdA = dA2
dA = 2A
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
17Derivatives for Each Node
L2
tsigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
dL2dsigmoid = do
di =ddi
12(t− i)
2 = t− i
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
18Derivatives for Each Node
L2
tsigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
dL2dsigmoid = do
di =ddi
12(t− i)
2 = t− i
dsigmoiddsum = do
di =ddiσ(i) = σ(i)(1− σ(i))
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
19Derivatives for Each Node
L2
tsigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
dL2dsigmoid = do
di =ddi
12(t− i)
2 = t− i
dsigmoiddsum = do
di =ddiσ(i) = σ(i)(1− σ(i))
dsumdprod = do
di1= d
di1i1 + i2 = 1, do
di2= 1
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
20Derivatives for Each Node
L2
tsigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
dL2dsigmoid = do
di =ddi
12(t− i)
2 = t− i
dsigmoiddsum = do
di =ddiσ(i) = σ(i)(1− σ(i))
dsumdprod = do
di1= d
di1i1 + i2 = 1, do
di2= 1
dsumdprod = do
di1= d
di1i1i2 = i2, do
di2= i1
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
21Backward Pass: Derivative Computation
t
b2
W2
b1
W1x
L2
sigmoid
sum
prod
sigmoid
sum
prod
i2 − i1
σ′(i)
1, 1
i2, i1
σ′(i)
1, 1
i2, i1
[3.7 3.72.9 2.9
]
[4.5 −5.2
]
[−1.5−4.6
]
[−2.0
][1.0]
[1.00.0
][3.72.9
][2.2−1.6
][.900.17
][3.18
][1.18
][.765
][.0277
] [.235
]
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
22Backward Pass: Derivative Computation
t
b2
W2
b1
W1x
L2
sigmoid
sum
prod
sigmoid
sum
prod
i2 − i1
σ′(i)
1, 1
i2, i1
σ′(i)
1, 1
i2, i1
[3.7 3.72.9 2.9
]
[4.5 −5.2
]
[−1.5−4.6
]
[−2.0
][1.0]
[1.00.0
][3.72.9
][2.2−1.6
][.900.17
][3.18
][1.18
][.765
][.0277
][.180
]×[.235
]=[.0424
][.235
]
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
23Backward Pass: Derivative Computation
t
b2
W2
b1
W1x
L2
sigmoid
sum
prod
sigmoid
sum
prod
i2 − i1
σ′(i)
1, 1
i2, i1
σ′(i)
1, 1
i2, i1
[3.7 3.72.9 2.9
]
[4.5 −5.2
]
[−1.5−4.6
]
[−2.0
][1.0]
[1.00.0
][3.72.9
][2.2−1.6
][.900.17
][3.18
][1.18
][.765
][.0277
][.0424
],[.0424
][.180
]×[.235
]=[.0424
][.235
]
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
24Backward Pass: Derivative Computation
t
b2
W2
b1
W1x
L2
sigmoid
sum
prod
sigmoid
sum
prod
i2 − i1
σ′(i)
1, 1
i2, i1
σ′(i)
1, 1
i2, i1
[3.7 3.72.9 2.9
]
[4.5 −5.2
]
[−1.5−4.6
]
[−2.0
][1.0]
[1.00.0
][3.72.9
][2.2−1.6
][.900.17
][3.18
][1.18
][.765
][.0277
]
[−.0260−.0260
],[
0171 0−.0308 0
][.0171−.0308
],[.0171−.0308
][.0171−.0308
][.191−.220
],[.0382 .00712
][.0424
],[.0424
][.180
]×[.235
]=[.0424
][.235
]
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
25Gradients for Parameter Update
t
b2
W2
b1
W1x
L2
sigmoid
sum
prod
sigmoid
sum
prod
i2 − i1
σ′(i)
1, 1
i2, i1
σ′(i)
1, 1
i2, i1
[3.7 3.72.9 2.9
]
[4.5 −5.2
]
[−1.5−4.6
]
[−2.0
]
[.0171 0−.0308 0
][.0171−.0308
]
[.0382 .00712
][.0424
]
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
26Parameter Update
t
b2
W2
b1
W1x
L2
sigmoid
sum
prod
sigmoid
sum
prod
i2 − i1
σ′(i)
1, 1
i2, i1
σ′(i)
1, 1
i2, i1
[3.7 3.72.9 2.9
]– µ[.0171 0−.0308 0
]
[4.5 −5.2
]– µ
[.0171−.0308
]
[−1.5−4.6
]– µ
[.0382 .00712
]
[−2.0
]– µ
[.0424
]
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
27
toolkits
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
28Explosion of Deep Learning Toolkits
• University of Montreal: Theano
• Google: Tensorflow
• Microsoft: CNTK
• Facebook: Torch, pyTorch
• Amazon: MX-Net
• CMU: Dynet
• AMU/Edinburgh: Marian
• ... and many more
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
29Toolkits
• Machine learning architectures around computations graphs very powerful
– define a computation graph– provide data and a training strategy (e.g., batching)– toolkit does the rest
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
30Example: Theano
• Deep learning toolkit for Python
• Included as library
> import numpy
> import theano
> import theano.tensor as T
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
31Example: Theano
• Definition of parameters
> x = T.dmatrix()
> W = theano.shared(value=numpy.array([[3.0,2.0],[4.0,3.0]]))
> b = theano.shared(value=numpy.array([-2.0,-4.0]))
• Definition of feed-forward layer
> h = T.nnet.sigmoid(T.dot(x,W)+b)
note: x is matrix→ process several training examples (sequence of vectors).
• Define as callable function
> h function = theano.function([x], h)
• Apply to data
> h function([[1,0]])
array([[ 0.73105858, 0.11920292]])
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
32Example: Theano
• Same setup for hidden→output layer
W2 = theano.shared(value=numpy.array([5.0,-5.0] ))
b2 = theano.shared(-2.0)
y pred = T.nnet.sigmoid(T.dot(h,W2)+b2)
• Define as callable function > predict = theano.function([x], y pred)
• Apply to data
> predict([[1,0]])
array([[ 0.7425526]])
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
33Model Training
• First, define the variable for the correct output
> y = T.dvector()
• Definition of a cost function (we use the L2 norm).
> l2 = (y-y pred)**2
> cost = l2.mean()
• Gradient descent training: computation of the derivative
> gW, gb, gW2, gb2 = T.grad(cost, [W,b,W2,b2])
• Update rule (with learning rate 0.1)
> train = theano.function(inputs=[x,y],outputs=[y pred,cost],
updates=((W, W-0.1*gW), (b, b-0.1*gb),
(W2, W2-0.1*gW2), (b2, b2-0.1*gb2)))
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
34Model Training
• Training data
> DATA X = numpy.array([[0,0],[0,1],[1,0],[1,1]])
> DATA Y = numpy.array([0,1,1,0])
• Predict output for training data
> predict(DATA X)
array([ 0.18333462, 0.7425526 , 0.7425526 , 0.33430961])
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
35Model Training
• Train with training data
> train(DATA X,DATA Y)
[array([ 0.18333462, 0.7425526 , 0.7425526 , 0.33430961]),
array(0.06948320612438118)]
• Prediction after training
> train(DATA X,DATA Y)
[array([ 0.18353091, 0.74260499, 0.74321824, 0.33324929]),
array(0.06923193686092949)]
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
36
example: dynet
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
37Dynet
• Our example: static computation graph, fixed set of data
• But: language requires different computation data for different data items(sentences have different length)
⇒ Dynamically create a computation graph for each data item
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
38Example: Dynetmodel = dy.model()
W p = model.add parameters((20, 100))
b p = model.add parameters(20)
E = model.add lookup parameters((20000, 50))
trainer = dy.SimpleSGDTrainer(model)
for epoch in range(num epochs):
for in words, out label in training data:
dy.renew cg()
W = dy.parameter(W p)
b = dy.parameter(b p)
score sym = dy.softmax(
W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)
loss sym = dy.pickneglogsoftmax(score sym, out label)
loss sym.forward()
loss sym.backward()
trainer.update()
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
39Model Parametersmodel = dy.model()
W p = model.add parameters((20, 100))
b p = model.add parameters(20)
E = model.add lookup parameters((20000, 50))
trainer = dy.SimpleSGDTrainer(model)
for epoch in range(num epochs):
for in words, out label in training data:
dy.renew cg()
W = dy.parameter(W p)
b = dy.parameter(b p)
score sym = dy.softmax(
W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)
loss sym = dy.pickneglogsoftmax(score sym, out label)
loss sym.forward()
loss sym.backward()
trainer.update()
Model holds the values for the weight matrices and weight vectors
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
40Training Setupmodel = dy.model()
W p = model.add parameters((20, 100))
b p = model.add parameters(20)
E = model.add lookup parameters((20000, 50))
trainer = dy.SimpleSGDTrainer(model)
for epoch in range(num epochs):
for in words, out label in training data:
dy.renew cg()
W = dy.parameter(W p)
b = dy.parameter(b p)
score sym = dy.softmax(
W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)
loss sym = dy.pickneglogsoftmax(score sym, out label)
loss sym.forward()
loss sym.backward()
trainer.update()
Defines the model update function (could be also Adagrad, Adam, ...)
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
41Setting up Computation Graphmodel = dy.model()
W p = model.add parameters((20, 100))
b p = model.add parameters(20)
E = model.add lookup parameters((20000, 50))
trainer = dy.SimpleSGDTrainer(model)
for epoch in range(num epochs):
for in words, out label in training data:
dy.renew cg()
W = dy.parameter(W p)
b = dy.parameter(b p)
score sym = dy.softmax(
W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)
loss sym = dy.pickneglogsoftmax(score sym, out label)
loss sym.forward()
loss sym.backward()
trainer.update()
Create a new computation graph. Inform it about parameters.
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
42Operationsmodel = dy.model()
W p = model.add parameters((20, 100))
b p = model.add parameters(20)
E = model.add lookup parameters((20000, 50))
trainer = dy.SimpleSGDTrainer(model)
for epoch in range(num epochs):
for in words, out label in training data:
dy.renew cg()
W = dy.parameter(W p)
b = dy.parameter(b p)
score sym = dy.softmax(
W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)
loss sym = dy.pickneglogsoftmax(score sym, out label)
loss sym.forward()
loss sym.backward()
trainer.update()
Builds the computation graph by defining operations.
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018
43Training Loopmodel = dy.model()
W p = model.add parameters((20, 100))
b p = model.add parameters(20)
E = model.add lookup parameters((20000, 50))
trainer = dy.SimpleSGDTrainer(model)
for epoch in range(num epochs):
for in words, out label in training data:
dy.renew cg()
W = dy.parameter(W p)
b = dy.parameter(b p)
score sym = dy.softmax(
W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)
loss sym = dy.pickneglogsoftmax(score sym, out label)
loss sym.forward()
loss sym.backward()
trainer.update()
Process training data. Computations are done in forward and backward.
Philipp Koehn Machine Translation: Computation Graphs 2 October 2018