Neural Networks basics and parallelization/vectorization

Neural Networks

basics and parallelization/vectorization

Kenjiro Taura

1 / 40

Contents

1 What is machine learning?A simple linear regressionA handwritten digits recognition

2 TrainingA simple gradient descentStochastic gradient descent

3 Chain Rule

4 Back Propagation in Action

2 / 40

Contents



3 Chain Rule


3 / 40

What is machine learning?

input: a set of training data set

D = { (xi, ti) | i = 0, 1, · · · }

each xi is normally a real vector (i.e. many real values)each ti is a real value (regression), 0/1 (binary classification),a discrete value (multi-class classification), etc., dependingon the task

goal: a supervised machine learning tries to find a function fthat “matches” training data well. i.e.

f(xi) ≈ ti for (xi, ti) ∈ D

put formally, find f that minimizes an error or a loss:

L(f ;D) ≡∑

(xi,ti)∈D

err(f(xi), ti),

where err(yi, ti) is a function that measures an “error” or a“distance” between the predicted output and the true value

4 / 40



D = { (xi, ti) | i = 0, 1, · · · }each xi is normally a real vector (i.e. many real values)each ti is a real value (regression), 0/1 (binary classification),a discrete value (multi-class classification), etc., dependingon the task




L(f ;D) ≡∑

(xi,ti)∈D

err(f(xi), ti),


4 / 40







L(f ;D) ≡∑

(xi,ti)∈D

err(f(xi), ti),


4 / 40







L(f ;D) ≡∑

(xi,ti)∈D

err(f(xi), ti),


4 / 40

Machine learning as an optimization problem

finding a good function from the space of literally all possiblefunctions is neither easy nor meaningful

we normally fix a search space of functions (F) parameterizedby w and find a good function fw ∈ F (parametric models)

the task is then to find the value of w that minimizes the loss:

L(w;D) ≡∑

(xi,ti)∈D

err(fw(xi), ti)

5 / 40





L(w;D) ≡∑

(xi,ti)∈D

err(fw(xi), ti)

5 / 40





L(w;D) ≡∑

(xi,ti)∈D

err(fw(xi), ti)

5 / 40

Contents



3 Chain Rule


6 / 40

A simple example (linear regression)

training data D = { (xi, ti) | i = 0, 1, · · · }xi : a real valueti : a real value

let the search space be a set of polynomials of degree ≤ 2. afunction is then parameterized by w = (w0 w1 w2). i.e.

fw(x) ≡ w2x2 + w1x+ w0

let the error function be a simple square distance:

err(y, t) ≡ (y − t)2

the task is to find w = (w0, w1, w2) that minimizes:

L(w;D) =∑

(xi,ti)∈D

err(fw(xi), ti) =∑

(xi,ti)∈D

(w2x2i+w1xi+w0−ti)

2

7 / 40




fw(x) ≡ w2x2 + w1x+ w0


err(y, t) ≡ (y − t)2


L(w;D) =∑

(xi,ti)∈D


(xi,ti)∈D


2

7 / 40




fw(x) ≡ w2x2 + w1x+ w0


err(y, t) ≡ (y − t)2


L(w;D) =∑

(xi,ti)∈D


(xi,ti)∈D


2

7 / 40




fw(x) ≡ w2x2 + w1x+ w0


err(y, t) ≡ (y − t)2


L(w;D) =∑

(xi,ti)∈D


(xi,ti)∈D


2

7 / 40

Contents



3 Chain Rule


8 / 40

A somewhat more realistic example: image

(digits) recognition

training data D = { (xi, ti) | i = 0, 1, · · · }

xi : a vector of pixel values of an image:ti : a “one hot” vector representing theclass ∈ {0, 1, · · · , 9} (e.g.∼ t(0 0 0 0 1 0 0 0 0 0 0) represents “4”)we write i to mean the hot vector vhaving vi = 1

D = {( ,4), ( ,9), . . .}

the search space: the followingcomposition parameterized by threematrices W0,W1 and W2

fW0,W1,W2(x) ≡ softmax(W2ReLU(W1ReLU(W0x)))

9 / 40

A somewhat more realistic example: image

(digits) recognition

training data D = { (xi, ti) | i = 0, 1, · · · }

xi : a vector of pixel values of an image:ti : a “one hot” vector representing theclass ∈ {0, 1, · · · , 9} (e.g.∼ t(0 0 0 0 1 0 0 0 0 0 0) represents “4”)we write i to mean the hot vector vhaving vi = 1

D = {( ,4), ( ,9), . . .}the search space: the followingcomposition parameterized by threematrices W0,W1 and W2

fW0,W1,W2(x) ≡ softmax(W2ReLU(W1ReLU(W0x)))×

x

ReLU

x1

×

ReLU

x3

x2

×

x4

softmax

x5

W0

W1

W2

y

9 / 40

A handwritten digits recognition

the value fW0,W1,W2(x) is a 10-vectorrepresenting probabilities that x belongsto each of the ten classes

a loss function is the cross entropycommonly used in multiclassclassifications (· : a dot product)

err(y, t) = H(t, y) ≡ −t · log y

the task is to find W0,W1 and W2 thatminimizes:

L(W0,W1,W2;D)

=∑

(xi,ti)∈D

H(softmax(ti,W2 ReLU(W1 ReLU(W0xi))))

×

x

ReLU

x1

×

ReLU

x3

x2

×

x4

softmax

x5

W0

W1

W2

y t

H

e

10 / 40






L(W0,W1,W2;D)

=∑

(xi,ti)∈D

H(softmax(ti,W2 ReLU(W1 ReLU(W0xi))))

×

x

ReLU

x1

×

ReLU

x3

x2

×

x4

softmax

x5

W0

W1

W2

y t

H

e

10 / 40






L(W0,W1,W2;D)

=∑

(xi,ti)∈D

H(softmax(ti,W2 ReLU(W1 ReLU(W0xi))))×

x

ReLU

x1

×

ReLU

x3

x2

×

x4

softmax

x5

W0

W1

W2

y t

H

e

10 / 40

Contents



3 Chain Rule


11 / 40

Contents



3 Chain Rule


12 / 40

How to find the minimizing parameter?

it boils down to minimizing a function that takes lots ofparameters w

L(w;D) =∑

(xi,ti)∈D

err(fw(xi), ti),

for which we compute a derivative of L with respect to w andmove w to its opposite direction (gradient descent; GD)

w = w − ηt∂L

∂w

(η : a scalar controlling a learning rate)

repeat this until L(w;D) converges

13 / 40



L(w;D) =∑

(xi,ti)∈D

err(fw(xi), ti),


w = w − ηt∂L

∂w



13 / 40



L(w;D) =∑

(xi,ti)∈D

err(fw(xi), ti),


w = w − ηt∂L

∂w



13 / 40

A linear regression example

recall that in the linear regression example:

L(w;D) =∑

(xi,ti)∈D

(w2x2i + w1xi + w0 − ti)

2

differentiate L by w = t(w0 w1 w2) to get:

∂L

∂w=

∑(xi,ti)∈D

2(w2x2i + w1xi + w0 − ti)(1 xi x2

i )

(remark: we used a chain rule)

so you repeat:

w = w − η∑

(xi,ti)∈D

2(w2x2i + w1xi + w0 − ti)

1xi

x2i

until L(w;D) converges

14 / 40



L(w;D) =∑

(xi,ti)∈D

(w2x2i + w1xi + w0 − ti)

2


∂L

∂w=

∑(xi,ti)∈D

2(w2x2i + w1xi + w0 − ti)(1 xi x2

i )


so you repeat:

w = w − η∑

(xi,ti)∈D

2(w2x2i + w1xi + w0 − ti)

1xi

x2i


14 / 40



L(w;D) =∑

(xi,ti)∈D

(w2x2i + w1xi + w0 − ti)

2


∂L

∂w=

∑(xi,ti)∈D

2(w2x2i + w1xi + w0 − ti)(1 xi x2

i )


so you repeat:

w = w − η∑

(xi,ti)∈D

2(w2x2i + w1xi + w0 − ti)

1xi

x2i


14 / 40

A problem of the gradient descent

the loss function we want to minimize is normally asummation over all training data:

L(w;D) =∑

(xi,ti)∈D

err(fw(xi), ti)

the gradient descent method just described:

1 computes∂

∂werr(fw(xi), ti) for each training data

(xi, ti) ∈ D, with the current value of w2 sum them over whole data set and then update w

it is commonly observed that the convergence becomes fasterwhen we update w more “incrementally” → StochasticGradient Descent (SGD)

15 / 40



L(w;D) =∑

(xi,ti)∈D

err(fw(xi), ti)


1 computes∂




15 / 40



L(w;D) =∑

(xi,ti)∈D

err(fw(xi), ti)


1 computes∂




15 / 40

Contents



3 Chain Rule


16 / 40

SGD

repeat:

1 randomly draw a subset of training data D′ (a mini batch;D′ ⊂ D)

2 compute the gradient of loss over the mini batch

∂L(w;D′)

∂w=

∑(xi,ti)∈D′

∂

∂werr(fw(xi), ti)

3 update w

w = w − ηt∂L(w;D′)

∂w4 “update sooner rather than later”

17 / 40

SGD

repeat:



∂L(w;D′)

∂w=

∑(xi,ti)∈D′

∂

∂werr(fw(xi), ti)

3 update w

w = w − ηt∂L(w;D′)


17 / 40

SGD

repeat:



∂L(w;D′)

∂w=

∑(xi,ti)∈D′

∂

∂werr(fw(xi), ti)

3 update w

w = w − ηt∂L(w;D′)


17 / 40

SGD and neural networks

in neural networks, a function is acomposition of many stages eachrepresented by a lot of parameters

x1 = f1(w1;x)

x2 = f2(w2;x1)

. . .

y = fn(wn;xn)

e = err(y, t)

we need to differentiate e byw1, · · · , wn

W

xi

xi+1 = f(Wxi)

18 / 40

The digits recognition example

x1 = W0x

x2 = ReLU(x1)

x3 = W1x2

x4 = ReLU(x3)

x5 = W2x4

y = softmax(x5)

e = H(y, t)

you need to get differentiation of e byW0,W1 and W2 done right

×

x

ReLU

x1

×

ReLU

x3

x2

×

x4

softmax

x5

W0

W1

W2

y t

H

e

19 / 40

Contents



3 Chain Rule


20 / 40

Differentiating multivariable functions

x = t(x0 · · · xn−1) ∈ Rn (a column vector)

f(x) : a scalar

definition: a derivative of f with respect to x, written f ′(x)

or∂f

∂x, is a row n-vector a s.t.

f(x+∆x) ≈ f(x) + a∆x

= f(x) +n−1∑i=0

ai∆xi

(a row vector × a column vector, yielding a 1× 1 matrix,identified with a scalar)

when it exists,

a =

(∂f

∂x0

· · · ∂f

∂xn−1

)21 / 40

The Chain Rule

consider a function f that depends ony = (y0, · · · , ym−1) ∈ Rm, each of which in turn depends onx = (x0, · · · , xn−1) ∈ Rn

the chain rule (math textbook version):

∂f

∂xi

=∑

0≤j<m

∂f

∂yj

∂yj∂xi

(0 ≤ i < n)

x0 · · · xn−1

f

x

y

y0 · · · ym−1

22 / 40

The Chain Rule : intuition

x0 · · · xn−1

f

x

yy0 · · · ym−1

say you increase an input variable xi by ∆xi, each yj willincrease by

≈ ∂yj∂xi

∆xi,

which will contribute to increasing the final output (f) by

≈ ∂f

∂yj

∂yj∂xi

∆xi

23 / 40

Chain Rule

master the following “index-free” version for neural network

x, y : a scalar (a single component in a vector/matrix/highdimensional array)

the chain rule (ML practioner’s version):

∂f

∂x=

∑all variables y that x directly affects

∂f

∂y

∂y

∂x

x

y

f

24 / 40

Chain Rule and “Back Propagation”

Chain rule allows you to compute

∂L

∂x,

the derivative of the loss with respect toa variable, from

∂L

∂y,

the derivatives of the loss with respect toupstream variables

∂L

∂x=

∑all variables y a step ahead of x

∂L

∂y

∂y

∂x ×

x

ReLU

x1

×

ReLU

x3

x2

×

x4

softmax

x5

W0

W1

W2

y t

H

e

25 / 40

Contents



3 Chain Rule


26 / 40

Component functions

we used the following functions

Convolution(W ;x) : linear/local transformations to imagesLinear(W ;x) : linear or “fully connected” layerReLU(x) : rectified linear unitssoftmax(x)H(t, x) : cross entropy

we summarize their definitions and their derivatives

27 / 40

Convolution

it takesan image = 2D pixels × a number of channelsa “filter” or a “kernel”, which is essentially a small image

and slides the filter over all pixels of the input and takes thelocal inner product at each pixelan illustration of a single channel 2D convolution (imagine agrayscale image)

*

filter (kernel)

output image output image

W

x y

=

28 / 40

Convolution (a single channel version)

Wi,j : a filter (−K ≤ i ≤ K, −K ≤ j ≤ K)xi,j : an input image (0 ≤ i < H, 0 ≤ j < W )yi,j : an output image (0 ≤ i < H, 0 ≤ j < W )

*

filter (kernel)

output image output image

W

x y

=

assuming xi,j = 0 for underflowed/overflowed indices forbrevity,

yi,j =∑

−K≤i′≤K,−K≤j′≤K

wi′,j′xi+i′,j+j′

(for each i, j) 29 / 40

Convolution

say input has IC channels and output OC channels

Woc,ic,i,j : filter (0 ≤ ic < IC, 0 ≤ oc < OC)

xic,i,j : an input image

yoc,i,j : an output image

yoc,i,j =∑ic,i′,j′

woc,ic,i′,j′xic,i+i′,j+j′

(for each oc, i, j)

the actual code does this for each image in a batch

yb,oc,i,j =∑ic,i′,j′

woc,ic,i′,j′xb,ic,i+i′,j+j′

(for each b, oc, i, j)

30 / 40

Convolution (Back propagation)

∂L

∂xb,ic,i+i′,j+j′=

∑b′,oc,i,j

∂L

∂yb′,oc,i,j

∂yb′,oc,i,j∂xb,ic,i+i′,j+j′

=∑oc,i,j

∂L

∂yb,oc,i,jwoc,ic,i′,j′

∂L

∂woc,ic,i′,j′=

∑b,oc′,i,j

∂L

∂yb,oc′,i,j

∂yb,oc′,i,j∂woc,ic,i′,j′

=∑b,i,j

∂L

∂yb,oc,i,jxb,ic,i+i′,j+j′

31 / 40

Linear (a.k.a. Fully Connected Layer)

definition:

y = Linear(W ;x) ≡ Wx

yi =∑j

Wijxj

derivatives:

by W

∂yi′

∂Wij=

{xj (i′ = i)0 (i′ ̸= i)

by x

∂yi∂xj

= wij

32 / 40

Linear (a.k.a. Fully Connected Layer)

back propagation:∂L∂W

∂L

∂Wij=

∑i′

∂L

∂yi′

∂yi′

∂Wij

=∂L

∂yixj

∂L

∂W=

∂L

∂y× x (outer product)

∂L∂x

∂L

∂xj=

∑i

∂L

∂yi

∂yi∂xj

=∑i

wij∂L

∂yi

∂L

∂x=

∂L

∂yW (vector-matrix product)

33 / 40

ReLU

definition (scalar ReLU): for x ∈ R, define

relu(x) ≡ max(x, 0)

derivatives of relu: for y = relu(x),

∂y

∂x=

{1 (x > 0)0 (x ≤ 0)

= max(sign(x), 0)

0x

y

y = relu(x)

34 / 40

ReLU

definition (vector ReLU): for a vector x ∈ Rn, defineReLU as the application of relu to each component

ReLU(x) ≡

relu(x0)...

relu(xn−1)

derivatives of ReLU:

∂yj∂xi

=

{max(sign(xi), 0) (i = j)0 (i ̸= j)

35 / 40

ReLU

back propagation:

∂L

∂xj

=∑i

∂L

∂yi

∂yi∂xj

=∂L

∂yj

∂yj∂xj

=

∂L

∂yj(xj ≥ 0)

0 (xj < 0)

36 / 40

softmax

definition: for x ∈ Rn

y = softmax(x) ≡ 1∑n−1i=0 exp(xj)

exp(x0)...

exp(xn−1)

it is a vector whose:

each component > 0,sum of all components = 1largest component “dominates”

0123456789 0123456789

softmax1.0

37 / 40

Derivative of softmax

for y = softmax(x),

(∂y

∂x

)is an n× n matrix whose elements

are given by:(diagonal elements)(

∂y

∂x

)i,i

=∂

∂xi

exp(xi)∑k exp(xk)

=exp(xi)

∑k exp(xk)− exp(xi)

2

(∑

k exp(xk))2

= yi(1− yi)

(non-diagonal elements) for i ̸= j,(∂y

∂x

)i,j

=∂

∂xj

exp(xi)∑k exp(xk)

= − exp(xi)exp(xj)

(∑

k exp(xk))2

= −yiyj 38 / 40

Cross entropy

definition: for y ∈ Rn,

H(t, y) ≡ −n−1∑i=0

ti log yi

derivative of H: for z = H(t, y),∂z

∂yis an n-vector

∂z

∂y= −

(t0y0

· · · tn−1

yn−1

)if t is a one hot vector, so is this vector; if t = c,

∂z

∂y= −

(0 · · · 0 1

yc0 · · · 0

)= − c

yc39 / 40

Composition of softmax and cross entropy

In particular, composition of softmax and H enjoy aremarkable simplification when differentiated:

for z = H(t, y) = H(t, softmax(x)),

∂z

∂x=

∂z

∂y

∂y

∂x

= − c

yc

∂y

∂x

= − 1

yc

∂yc∂x

=1

yc(y0yc y1yc · · · yc(1− yc) yc+1yc · · · yn−1yc)

= (y0 y1 · · · (yc − 1) yc+1 · · · yn−1)

= t(y − t)

40 / 40

Documents

Neural Networks basics and parallelization/vectorization