Computation Graphs

Brian Thompsonslides by Philipp Koehn

2 October 2018

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

1Neural Network Cartoon

• A common way to illustrate a neural network

2Neural Network Math

• Hidden layerh = sigmoid(W1x+ b1)

• Final layery = sigmoid(W2h+ b2)

3Computation Graph

sigmoid

b2prod

W2sigmoid

b1prod

4Simple Neural Network

-2.0-4.6-1.5

3.72.9

5Computation Graph

sigmoid

b2prod

W2sigmoid

b1prod

[3.7 3.72.9 2.9

[4.5 −5.2

[−1.5−4.6

[−2.0

6Processing Input

sigmoid

b2prod

W2sigmoid

b1prod

[3.7 3.72.9 2.9

[4.5 −5.2

[−1.5−4.6

[−2.0

[1.00.0

7Processing Input

sigmoid

b2prod

W2sigmoid

b1prod

[3.7 3.72.9 2.9

[4.5 −5.2

[−1.5−4.6

[−2.0

[1.00.0

][3.72.9

8Processing Input

sigmoid

b2prod

W2sigmoid

b1prod

[3.7 3.72.9 2.9

[4.5 −5.2

[−1.5−4.6

[−2.0

[1.00.0

][3.72.9

][2.2−1.6

9Processing Input

sigmoid

b2prod

W2sigmoid

b1prod

[3.7 3.72.9 2.9

[4.5 −5.2

[−1.5−4.6

[−2.0

[1.00.0

][3.72.9

][2.2−1.6

][.900.168

10Processing Input

sigmoid

b2prod

W2sigmoid

b1prod

[3.7 3.72.9 2.9

[4.5 −5.2

[−1.5−4.6

[−2.0

[1.00.0

][3.72.9

][2.2−1.6

][.900.168

][3.18

][1.18

][.765

11Error Function

• For training, we need a measure how well we do

⇒ Cost function

also known as objective function, loss, gain, cost, ...

• For instance L2 normerror =

2(t− y)2

12Gradient Descent

• We view the error as a function of the trainable parameters

error(λ)

• We want to optimize error(λ) by moving it towards its optimum

error(λ)

gradient = 1

current λoptimal λ

error(λ)

ient =

error(λ)

gradient = 0.2

• Why not just set it to its optimum?

– we are updating based on one training example, do not want to overfit to it– we are also changing all the other parameters, the curve will look different

13Calculus Refresher: Chain Rule

• Formula for computing derivative of composition of two or more functions

– functions f and g– composition f ◦ g maps x to f(g(x))

• Chain rule

(f ◦ g)′ = (f ′ ◦ g) · g′

F ′(x) = f ′(g(x))g′(x)

• Leibniz’s notationdzdx = dz

dy ·dydx

if z = f(y) and y = g(x), thendzdx = dz

dy ·dydx = f ′(y)g′(x) = f ′(g(x))g′(x)

14Final Layer Update

• Linear combination of weights s =∑

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t− y)

• Derivative of error with regard to one weight wk

dwk=dE

15Error Computation in Computation Graph

tsigmoid

b2prod

W2sigmoid

b1prod

16Error Propagation in Computation Graph

• Compute derivative at node A: dEdA = dE

dBdBdA

• Assume that we already computed dEdB (backward pass through graph)

• So now we only have to get the formula for dBdA

• For instance B is a square node

– forward computation: B = A2

– backward computation: dBdA = dA2

dA = 2A

17Derivatives for Each Node

tsigmoid

b2prod

W2sigmoid

b1prod

dL2dsigmoid = do

di =ddi

12(t− i)

2 = t− i

tsigmoid

b2prod

W2sigmoid

b1prod

dL2dsigmoid = do

di =ddi

12(t− i)

2 = t− i

dsigmoiddsum = do

di =ddiσ(i) = σ(i)(1− σ(i))

tsigmoid

b2prod

W2sigmoid

b1prod

dL2dsigmoid = do

di =ddi

12(t− i)

2 = t− i

dsigmoiddsum = do

dsumdprod = do

di1= d

di1i1 + i2 = 1, do

di2= 1

tsigmoid

b2prod

W2sigmoid

b1prod

dL2dsigmoid = do

di =ddi

12(t− i)

2 = t− i

dsigmoiddsum = do

dsumdprod = do

di1= d

di1i1 + i2 = 1, do

di2= 1

dsumdprod = do

di1= d

di1i1i2 = i2, do

di2= i1

21Backward Pass: Derivative Computation

sigmoid

i2 − i1

σ′(i)

i2, i1

σ′(i)

i2, i1

[3.7 3.72.9 2.9

[4.5 −5.2

[−1.5−4.6

[−2.0

][1.0]

[1.00.0

][3.72.9

][2.2−1.6

][.900.17

][3.18

][1.18

][.765

][.0277

] [.235

sigmoid

i2 − i1

σ′(i)

i2, i1

σ′(i)

i2, i1

[3.7 3.72.9 2.9

[4.5 −5.2

[−1.5−4.6

[−2.0

][1.0]

[1.00.0

][3.72.9

][2.2−1.6

][.900.17

][3.18

][1.18

][.765

][.0277

][.180

]×[.235

]=[.0424

][.235

sigmoid

i2 − i1

σ′(i)

i2, i1

σ′(i)

i2, i1

[3.7 3.72.9 2.9

[4.5 −5.2

[−1.5−4.6

[−2.0

][1.0]

[1.00.0

][3.72.9

][2.2−1.6

][.900.17

][3.18

][1.18

][.765

][.0277

][.0424

],[.0424

][.180

]×[.235

]=[.0424

][.235

sigmoid

i2 − i1

σ′(i)

i2, i1

σ′(i)

i2, i1

[3.7 3.72.9 2.9

[4.5 −5.2

[−1.5−4.6

[−2.0

][1.0]

[1.00.0

][3.72.9

][2.2−1.6

][.900.17

][3.18

][1.18

][.765

][.0277

[−.0260−.0260

0171 0−.0308 0

][.0171−.0308

],[.0171−.0308

][.0171−.0308

][.191−.220

],[.0382 .00712

][.0424

],[.0424

][.180

]×[.235

]=[.0424

][.235

25Gradients for Parameter Update

sigmoid

i2 − i1

σ′(i)

i2, i1

σ′(i)

i2, i1

[3.7 3.72.9 2.9

[4.5 −5.2

[−1.5−4.6

[−2.0

[.0171 0−.0308 0

][.0171−.0308

[.0382 .00712

][.0424

26Parameter Update

sigmoid

i2 − i1

σ′(i)

i2, i1

σ′(i)

i2, i1

[3.7 3.72.9 2.9

]– µ[.0171 0−.0308 0

[4.5 −5.2

]– µ

[.0171−.0308

[−1.5−4.6

]– µ

[.0382 .00712

[−2.0

]– µ

[.0424

toolkits

28Explosion of Deep Learning Toolkits

• University of Montreal: Theano

• Google: Tensorflow

• Microsoft: CNTK

• Facebook: Torch, pyTorch

• Amazon: MX-Net

• CMU: Dynet

• AMU/Edinburgh: Marian

• ... and many more

29Toolkits

• Machine learning architectures around computations graphs very powerful

– define a computation graph– provide data and a training strategy (e.g., batching)– toolkit does the rest

30Example: Theano

• Deep learning toolkit for Python

• Included as library

> import numpy

> import theano

> import theano.tensor as T

31Example: Theano

• Definition of parameters

> x = T.dmatrix()

> W = theano.shared(value=numpy.array([[3.0,2.0],[4.0,3.0]]))

> b = theano.shared(value=numpy.array([-2.0,-4.0]))

• Definition of feed-forward layer

> h = T.nnet.sigmoid(T.dot(x,W)+b)

note: x is matrix→ process several training examples (sequence of vectors).

• Define as callable function

> h function = theano.function([x], h)

• Apply to data

> h function([[1,0]])

array([[ 0.73105858, 0.11920292]])

32Example: Theano

• Same setup for hidden→output layer

W2 = theano.shared(value=numpy.array([5.0,-5.0] ))

b2 = theano.shared(-2.0)

y pred = T.nnet.sigmoid(T.dot(h,W2)+b2)

• Define as callable function > predict = theano.function([x], y pred)

• Apply to data

> predict([[1,0]])

array([[ 0.7425526]])

33Model Training

• First, define the variable for the correct output

> y = T.dvector()

• Definition of a cost function (we use the L2 norm).

> l2 = (y-y pred)**2

> cost = l2.mean()

• Gradient descent training: computation of the derivative

> gW, gb, gW2, gb2 = T.grad(cost, [W,b,W2,b2])

• Update rule (with learning rate 0.1)

> train = theano.function(inputs=[x,y],outputs=[y pred,cost],

updates=((W, W-0.1*gW), (b, b-0.1*gb),

(W2, W2-0.1*gW2), (b2, b2-0.1*gb2)))

34Model Training

• Training data

> DATA X = numpy.array([[0,0],[0,1],[1,0],[1,1]])

> DATA Y = numpy.array([0,1,1,0])

• Predict output for training data

> predict(DATA X)

array([ 0.18333462, 0.7425526 , 0.7425526 , 0.33430961])

35Model Training

• Train with training data

> train(DATA X,DATA Y)

[array([ 0.18333462, 0.7425526 , 0.7425526 , 0.33430961]),

array(0.06948320612438118)]

• Prediction after training

> train(DATA X,DATA Y)

[array([ 0.18353091, 0.74260499, 0.74321824, 0.33324929]),

array(0.06923193686092949)]

example: dynet

37Dynet

• Our example: static computation graph, fixed set of data

• But: language requires different computation data for different data items(sentences have different length)

⇒ Dynamically create a computation graph for each data item

38Example: Dynetmodel = dy.model()

W p = model.add parameters((20, 100))

b p = model.add parameters(20)

E = model.add lookup parameters((20000, 50))

trainer = dy.SimpleSGDTrainer(model)

for epoch in range(num epochs):

for in words, out label in training data:

dy.renew cg()

W = dy.parameter(W p)

b = dy.parameter(b p)

score sym = dy.softmax(

W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)

loss sym = dy.pickneglogsoftmax(score sym, out label)

loss sym.forward()

loss sym.backward()

trainer.update()

39Model Parametersmodel = dy.model()

dy.renew cg()

loss sym.forward()

loss sym.backward()

trainer.update()

Model holds the values for the weight matrices and weight vectors

40Training Setupmodel = dy.model()

dy.renew cg()

loss sym.forward()

loss sym.backward()

trainer.update()

Defines the model update function (could be also Adagrad, Adam, ...)

41Setting up Computation Graphmodel = dy.model()

dy.renew cg()

loss sym.forward()

loss sym.backward()

trainer.update()

Create a new computation graph. Inform it about parameters.

42Operationsmodel = dy.model()

dy.renew cg()

loss sym.forward()

loss sym.backward()

trainer.update()

Builds the computation graph by defining operations.

43Training Loopmodel = dy.model()

dy.renew cg()

loss sym.forward()

loss sym.backward()

trainer.update()

Process training data. Computations are done in forward and backward.

Computation Graphs - cs.umd.edu

Documents

Alex Quinn aq aq@cs.umd.edu Crowdsourcing to Build Lists of Things

xdahan.sakura.ne.jpxdahan.sakura.ne.jp › preprint › octonions.pdf · On the computation of the second largest eigenvalue and of the girth of Cayley graphs based on octonions and

Joseph Gonzalez Yucheng Low Danny Bickson Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu Joint work with: Carlos Guestrin

Distance-Constraint Reachability Computation in Uncertain Graphs Ruoming Jin, Lin Liu Kent State University Bolin Ding UIUC Haixun Wang MSRA

· Locksmith: Context-Sensitive Correlation Analysis for Race Detection ∗ Polyvios Pratikakis polyvios@cs.umd.edu Jeﬀrey S. Foster jfoster@cs.umd.edu Michael Hicks mwh@cs.umd.edu

Distance-Constraint Reachability Computation in Uncertain Graphs

Shortest Path Tree Computation in Dynamic Graphs · PDF fileShortest Path Tree Computation in Dynamic Graphs Edward P.F. Chan and Yaya Yang Abstract—Let G ¼ðV;E;wÞbe a simple

Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The

Sravanthi Bondugula, Varun Manjunatha, Larry S. Davis ... · Sravanthi Bondugula, Varun Manjunatha, Larry S. Davis, David Doermann sravb@cs.umd.edu, varunm@cs.umd.edu, lsd@umiacs.umd.edu,

Structural equation modeling using computation graphs · Structural Equation Modeling using Computation Graphs Erik-Jan van Kesteren and Daniel Oberski Utrecht University, Department

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING ...joschu.net/docs/thesis.pdf · OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Phased Computation Graphs in the Polyhedral Modelgroups.csail.mit.edu/cag/streamit/papers/pcg-630.pdf · Phased Computation Graphs in the Polyhedral Model ∗ William Thies, Jasper

COMPUTATION AND ANALYSIS OF SPECTRA OF LARGE …etd.lib.metu.edu.tr/upload/12612249/index.pdf · COMPUTATION AND ANALYSIS OF SPECTRA OF LARGE NETWORKS WITH DIRECTED GRAPHS submitted

20170126 TensorFlow Tutorial - Department of …gkumar/slides/20170126...Other credits on their respective slides. Remember computation graphs? Remember computation graphs? Remember

Topology-Hiding Computation Beyond … Computation Beyond Logarithmic Diameter ... – We construct a protocol for topology-hiding ... computation protocol for small-diameter graphs

Bar Graphs Line Graphs & Picto-Graphs

PowerLyra: Differentiated Graph Computation and ... · PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs Rong Chen, Jiaxin Shi, Yanzhe Chen, ... prior

CMSC828K: Anatomy of a Database System Instructor: Amol Deshpande amol@cs.umd.edu

Shortest Path Tree Computation in Dynamic Graphs

Distributed Graph-Parallel Computation on Natural Graphs