Computation Graphs - cs.umd.edu

Preview:

Citation preview

Computation Graphs

Brian Thompsonslides by Philipp Koehn

2 October 2018

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Antonis Anastasopoulos
Antonis Anastasopoulos
Antonis Anastasopoulos
MTMA, May 27th, 2019�

1Neural Network Cartoon

• A common way to illustrate a neural network

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

2Neural Network Math

• Hidden layerh = sigmoid(W1x+ b1)

• Final layery = sigmoid(W2h+ b2)

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

3Computation Graph

sigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

4Simple Neural Network

11

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

5Computation Graph

sigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

6Processing Input

sigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

]

[1.00.0

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

7Processing Input

sigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

]

[1.00.0

][3.72.9

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

8Processing Input

sigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

]

[1.00.0

][3.72.9

][2.2−1.6

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

9Processing Input

sigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

]

[1.00.0

][3.72.9

][2.2−1.6

][.900.168

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

10Processing Input

sigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

]

[1.00.0

][3.72.9

][2.2−1.6

][.900.168

][3.18

][1.18

][.765

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

11Error Function

• For training, we need a measure how well we do

⇒ Cost function

also known as objective function, loss, gain, cost, ...

• For instance L2 normerror =

1

2(t− y)2

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

12Gradient Descent

• We view the error as a function of the trainable parameters

error(λ)

• We want to optimize error(λ) by moving it towards its optimum

λ

error(λ)

gradient = 1

current λoptimal λ

λ

error(λ)

grad

ient =

2

current λoptimal λ

λ

error(λ)

gradient = 0.2

current λoptimal λ

• Why not just set it to its optimum?

– we are updating based on one training example, do not want to overfit to it– we are also changing all the other parameters, the curve will look different

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

13Calculus Refresher: Chain Rule

• Formula for computing derivative of composition of two or more functions

– functions f and g– composition f ◦ g maps x to f(g(x))

• Chain rule

(f ◦ g)′ = (f ′ ◦ g) · g′

or

F ′(x) = f ′(g(x))g′(x)

• Leibniz’s notationdzdx = dz

dy ·dydx

if z = f(y) and y = g(x), thendzdx = dz

dy ·dydx = f ′(y)g′(x) = f ′(g(x))g′(x)

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

14Final Layer Update

• Linear combination of weights s =∑

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t− y)

2

• Derivative of error with regard to one weight wk

dE

dwk=dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

15Error Computation in Computation Graph

L2

tsigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

16Error Propagation in Computation Graph

E

B

A

• Compute derivative at node A: dEdA = dE

dBdBdA

• Assume that we already computed dEdB (backward pass through graph)

• So now we only have to get the formula for dBdA

• For instance B is a square node

– forward computation: B = A2

– backward computation: dBdA = dA2

dA = 2A

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

17Derivatives for Each Node

L2

tsigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

dL2dsigmoid = do

di =ddi

12(t− i)

2 = t− i

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

18Derivatives for Each Node

L2

tsigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

dL2dsigmoid = do

di =ddi

12(t− i)

2 = t− i

dsigmoiddsum = do

di =ddiσ(i) = σ(i)(1− σ(i))

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

19Derivatives for Each Node

L2

tsigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

dL2dsigmoid = do

di =ddi

12(t− i)

2 = t− i

dsigmoiddsum = do

di =ddiσ(i) = σ(i)(1− σ(i))

dsumdprod = do

di1= d

di1i1 + i2 = 1, do

di2= 1

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

20Derivatives for Each Node

L2

tsigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

dL2dsigmoid = do

di =ddi

12(t− i)

2 = t− i

dsigmoiddsum = do

di =ddiσ(i) = σ(i)(1− σ(i))

dsumdprod = do

di1= d

di1i1 + i2 = 1, do

di2= 1

dsumdprod = do

di1= d

di1i1i2 = i2, do

di2= i1

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

21Backward Pass: Derivative Computation

t

b2

W2

b1

W1x

L2

sigmoid

sum

prod

sigmoid

sum

prod

i2 − i1

σ′(i)

1, 1

i2, i1

σ′(i)

1, 1

i2, i1

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

][1.0]

[1.00.0

][3.72.9

][2.2−1.6

][.900.17

][3.18

][1.18

][.765

][.0277

] [.235

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

22Backward Pass: Derivative Computation

t

b2

W2

b1

W1x

L2

sigmoid

sum

prod

sigmoid

sum

prod

i2 − i1

σ′(i)

1, 1

i2, i1

σ′(i)

1, 1

i2, i1

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

][1.0]

[1.00.0

][3.72.9

][2.2−1.6

][.900.17

][3.18

][1.18

][.765

][.0277

][.180

]×[.235

]=[.0424

][.235

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

23Backward Pass: Derivative Computation

t

b2

W2

b1

W1x

L2

sigmoid

sum

prod

sigmoid

sum

prod

i2 − i1

σ′(i)

1, 1

i2, i1

σ′(i)

1, 1

i2, i1

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

][1.0]

[1.00.0

][3.72.9

][2.2−1.6

][.900.17

][3.18

][1.18

][.765

][.0277

][.0424

],[.0424

][.180

]×[.235

]=[.0424

][.235

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

24Backward Pass: Derivative Computation

t

b2

W2

b1

W1x

L2

sigmoid

sum

prod

sigmoid

sum

prod

i2 − i1

σ′(i)

1, 1

i2, i1

σ′(i)

1, 1

i2, i1

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

][1.0]

[1.00.0

][3.72.9

][2.2−1.6

][.900.17

][3.18

][1.18

][.765

][.0277

]

[−.0260−.0260

],[

0171 0−.0308 0

][.0171−.0308

],[.0171−.0308

][.0171−.0308

][.191−.220

],[.0382 .00712

][.0424

],[.0424

][.180

]×[.235

]=[.0424

][.235

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

25Gradients for Parameter Update

t

b2

W2

b1

W1x

L2

sigmoid

sum

prod

sigmoid

sum

prod

i2 − i1

σ′(i)

1, 1

i2, i1

σ′(i)

1, 1

i2, i1

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

]

[.0171 0−.0308 0

][.0171−.0308

]

[.0382 .00712

][.0424

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

26Parameter Update

t

b2

W2

b1

W1x

L2

sigmoid

sum

prod

sigmoid

sum

prod

i2 − i1

σ′(i)

1, 1

i2, i1

σ′(i)

1, 1

i2, i1

[3.7 3.72.9 2.9

]– µ[.0171 0−.0308 0

]

[4.5 −5.2

]– µ

[.0171−.0308

]

[−1.5−4.6

]– µ

[.0382 .00712

]

[−2.0

]– µ

[.0424

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

27

toolkits

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

28Explosion of Deep Learning Toolkits

• University of Montreal: Theano

• Google: Tensorflow

• Microsoft: CNTK

• Facebook: Torch, pyTorch

• Amazon: MX-Net

• CMU: Dynet

• AMU/Edinburgh: Marian

• ... and many more

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

29Toolkits

• Machine learning architectures around computations graphs very powerful

– define a computation graph– provide data and a training strategy (e.g., batching)– toolkit does the rest

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

30Example: Theano

• Deep learning toolkit for Python

• Included as library

> import numpy

> import theano

> import theano.tensor as T

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

31Example: Theano

• Definition of parameters

> x = T.dmatrix()

> W = theano.shared(value=numpy.array([[3.0,2.0],[4.0,3.0]]))

> b = theano.shared(value=numpy.array([-2.0,-4.0]))

• Definition of feed-forward layer

> h = T.nnet.sigmoid(T.dot(x,W)+b)

note: x is matrix→ process several training examples (sequence of vectors).

• Define as callable function

> h function = theano.function([x], h)

• Apply to data

> h function([[1,0]])

array([[ 0.73105858, 0.11920292]])

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

32Example: Theano

• Same setup for hidden→output layer

W2 = theano.shared(value=numpy.array([5.0,-5.0] ))

b2 = theano.shared(-2.0)

y pred = T.nnet.sigmoid(T.dot(h,W2)+b2)

• Define as callable function > predict = theano.function([x], y pred)

• Apply to data

> predict([[1,0]])

array([[ 0.7425526]])

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

33Model Training

• First, define the variable for the correct output

> y = T.dvector()

• Definition of a cost function (we use the L2 norm).

> l2 = (y-y pred)**2

> cost = l2.mean()

• Gradient descent training: computation of the derivative

> gW, gb, gW2, gb2 = T.grad(cost, [W,b,W2,b2])

• Update rule (with learning rate 0.1)

> train = theano.function(inputs=[x,y],outputs=[y pred,cost],

updates=((W, W-0.1*gW), (b, b-0.1*gb),

(W2, W2-0.1*gW2), (b2, b2-0.1*gb2)))

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

34Model Training

• Training data

> DATA X = numpy.array([[0,0],[0,1],[1,0],[1,1]])

> DATA Y = numpy.array([0,1,1,0])

• Predict output for training data

> predict(DATA X)

array([ 0.18333462, 0.7425526 , 0.7425526 , 0.33430961])

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

35Model Training

• Train with training data

> train(DATA X,DATA Y)

[array([ 0.18333462, 0.7425526 , 0.7425526 , 0.33430961]),

array(0.06948320612438118)]

• Prediction after training

> train(DATA X,DATA Y)

[array([ 0.18353091, 0.74260499, 0.74321824, 0.33324929]),

array(0.06923193686092949)]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

36

example: dynet

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

37Dynet

• Our example: static computation graph, fixed set of data

• But: language requires different computation data for different data items(sentences have different length)

⇒ Dynamically create a computation graph for each data item

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

38Example: Dynetmodel = dy.model()

W p = model.add parameters((20, 100))

b p = model.add parameters(20)

E = model.add lookup parameters((20000, 50))

trainer = dy.SimpleSGDTrainer(model)

for epoch in range(num epochs):

for in words, out label in training data:

dy.renew cg()

W = dy.parameter(W p)

b = dy.parameter(b p)

score sym = dy.softmax(

W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)

loss sym = dy.pickneglogsoftmax(score sym, out label)

loss sym.forward()

loss sym.backward()

trainer.update()

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

39Model Parametersmodel = dy.model()

W p = model.add parameters((20, 100))

b p = model.add parameters(20)

E = model.add lookup parameters((20000, 50))

trainer = dy.SimpleSGDTrainer(model)

for epoch in range(num epochs):

for in words, out label in training data:

dy.renew cg()

W = dy.parameter(W p)

b = dy.parameter(b p)

score sym = dy.softmax(

W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)

loss sym = dy.pickneglogsoftmax(score sym, out label)

loss sym.forward()

loss sym.backward()

trainer.update()

Model holds the values for the weight matrices and weight vectors

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

40Training Setupmodel = dy.model()

W p = model.add parameters((20, 100))

b p = model.add parameters(20)

E = model.add lookup parameters((20000, 50))

trainer = dy.SimpleSGDTrainer(model)

for epoch in range(num epochs):

for in words, out label in training data:

dy.renew cg()

W = dy.parameter(W p)

b = dy.parameter(b p)

score sym = dy.softmax(

W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)

loss sym = dy.pickneglogsoftmax(score sym, out label)

loss sym.forward()

loss sym.backward()

trainer.update()

Defines the model update function (could be also Adagrad, Adam, ...)

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

41Setting up Computation Graphmodel = dy.model()

W p = model.add parameters((20, 100))

b p = model.add parameters(20)

E = model.add lookup parameters((20000, 50))

trainer = dy.SimpleSGDTrainer(model)

for epoch in range(num epochs):

for in words, out label in training data:

dy.renew cg()

W = dy.parameter(W p)

b = dy.parameter(b p)

score sym = dy.softmax(

W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)

loss sym = dy.pickneglogsoftmax(score sym, out label)

loss sym.forward()

loss sym.backward()

trainer.update()

Create a new computation graph. Inform it about parameters.

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Antonis Anastasopoulos
The latest version does NOTrequire these, use W_p, b_p directly�

42Operationsmodel = dy.model()

W p = model.add parameters((20, 100))

b p = model.add parameters(20)

E = model.add lookup parameters((20000, 50))

trainer = dy.SimpleSGDTrainer(model)

for epoch in range(num epochs):

for in words, out label in training data:

dy.renew cg()

W = dy.parameter(W p)

b = dy.parameter(b p)

score sym = dy.softmax(

W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)

loss sym = dy.pickneglogsoftmax(score sym, out label)

loss sym.forward()

loss sym.backward()

trainer.update()

Builds the computation graph by defining operations.

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

43Training Loopmodel = dy.model()

W p = model.add parameters((20, 100))

b p = model.add parameters(20)

E = model.add lookup parameters((20000, 50))

trainer = dy.SimpleSGDTrainer(model)

for epoch in range(num epochs):

for in words, out label in training data:

dy.renew cg()

W = dy.parameter(W p)

b = dy.parameter(b p)

score sym = dy.softmax(

W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)

loss sym = dy.pickneglogsoftmax(score sym, out label)

loss sym.forward()

loss sym.backward()

trainer.update()

Process training data. Computations are done in forward and backward.

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Recommended