46
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology)

Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

  • Upload
    hahanh

  • View
    229

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning

Part 1: Supervised Learning

Alexandre Proutiere and Vahan Petrosyan

KTH (The Royal Institute of Technology)

Page 2: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Outline of the Course

1. Supervised Learning

• Regression and Classification

• Neural Networks (Deep Learning)

2. Unsupervised Learning

• Clustering

• Reinforcement Learning

1

Page 3: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Learning

Supervised (or Predictive) learning

Learn a mapping from inputs x to outputs y, given a labeled set

of input-ouput pairs (the training set)

Dn = (Xi, Yi), i = 1, . . . , n

We learn the classification function f = 1 if versicolor, f = −1 if virginica

2

Page 4: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Learning

Unsupervised (or Descriptive) learning

Find interesting patterns in the data Dn = Xi, i = 1, . . . , n

We learn there are 2 distinct types of iris and how to distinguish them!

3

Page 5: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Examples

Supervised learning:

• digit, flower, picture, ... recognition

• music classification

• predict prices

• predict the outcome of chemical reactions

• source separation: identify the instruments present in a recording

• ...

Unsupervised learning:

• classification (without training set): spam filter, news (google news),

...

• identify structures and causes in the data: e.g. J. Snow – cholera

deaths vs pollution

• image segmentation

• ...

4

Page 6: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Outline of the Course

1. Supervised Learning

• Regression and Classification

• Neural Networks (Deep Learning)

2. Unsupervised Learning

• Clustering

• Reinforcement Learning

5

Page 7: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Supervised Learning: Outline

1. A statistical framework

2. Algorithms using Local Averaging (k-nn, kernels)

3. Linear regression

4. Support Vector Machines

5. The kernel trick

6. Neural Networks

6

Page 8: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

1. Supervised learning

• Training set: Dn = (Xi, Yi), i = 1, . . . , n- Input features: Xi ∈ Rd

- Output: Yi

Yi ∈ Y

R regression (price, position, etc)

finite classification (type, mode, etc)

• y is a non-deterministic and complicated function of x

i.e., y = f(x) + z where z is unknown (e.g. noise). Goal: learn f .

• Learning algorithm:

7

Page 9: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

A Statistical Approach

• (Xi, Yi), i = 1, . . . , n are i.i.d. random variables with joint

distribution P

• Model: look for an estimate of f in a class of functions F , e.g.

linear, polynomial, ...

• Learning algorithm: A : Dn 7→ fn ∈ F , fn estimates the true

function f

• Performance of predictions defined through a loss function `

- Example 1. Regression, Least Squares (LS): Y = R,

`(y, y′) = 12|y − y′|2

- Example 2. Classification in Y = 0, 1: `(y, y′) = 1y 6=y′

8

Page 10: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Risks

• The risk of a prediction function g is R(g) := EP [`(g(X), Y )]

(often referred to as out-of-sample error)

Target function f? := arg ming∈F R(g)

• Empirical risk: Rn(g) := 1n

∑ni=1 `(g(Xi), Yi)

(often referred to as in-sample error)

Law of large numbers: limn→∞ Rn(g) = R(g) almost surely

Central limit theorem:√n(Rn(g)−R(g)) ∼ N (0, varP (`(g(X), Y )))

• Consistent algorithm: if limn→∞ E[R(fn)] = R(f?)

Warning. Consistency 6= good algorithm (F can be poor, f 6= f?)

9

Page 11: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

LS regression and binary classification

Example 1. Regression, Least Square (LS)

• Risk R(g) = 12E[|g(X)− Y |2]

• Target function f?(x) = E[Y |X = x]

Example 2. Binary classification: Y = −1, 1

• Risk R(g) = P(g(X) 6= Y )

• Target function f?(x) = arg maxk∈−1,1 P(Y = k|X = x)

10

Page 12: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Model selection - Choosing F

Avoid overfitting!

A few principles:

• Occam’s razor principle: choose the simplest of two models if they

explain the data equally well

• Hadamard’s well posed problems: unique solution, smooth in the

parameters

Regularization: minimize error(fn)+Ω(fn) – Penalizes the model

complexity

11

Page 13: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

2. Local averaging algorithms

Principle: predict for x based on neighbours of x in the training data

fn(x) = Ψ(

n∑i=1

Wi(x)Yi),

n∑i=1

Wi(x) = 1

Ψ(·) = · for regression, Ψ(·) = sgn(·) for binary classification

12

Page 14: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Examples

Prediction function: fn(x) = Ψ(∑ni=1Wi(x)Yi),

∑ni=1Wi(x) = 1

• k-n.n.: Wi(x) =

1/k if Xi is one of the k-n.n. of x

0 otherwise

• Nadaraya-Watson algorithm: Wi(x) ∝ K(x−Xi

h

), e.g.

K(·) = e−‖·‖2

13

Page 15: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Performance analysis

Stone’s theorem: Conditions for Wi under which the corresponding

algorithm is consistent!

Example: LS target function f?(x) = E[Y |X = x]. Distribution P on

(X,Y ).

Assume that:

1. X lies in A bounded a.s. under P

2. For all x ∈ A, var(Y |X = x) ≤ σ2

3. f? is L-lipschitz

Then the k-n.n. estimator satisfies:

E(R(fn))−R(f?) ≤ σ2

k+ cL2

(k

n

)2/d

The k-n.n. algorithm is consistent with P if k goes to infinity sublinearly

with n.

14

Page 16: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

3. Linear regression: Minimizing the empirical risk

• F = set of linear functions from Rd (Xi ∈ Rd) to R (Yi ∈ R)

• Function fθ ∈ F parametrized by θ ∈ Rd+1: (by convention x0 = 1)

fθ(x) = θ0 + θ1x1 + . . .+ θdxd = θ>x

• Least square model. Empirical risk of θ:

Rn(θ) = 12n

∑ni=1(fθ(Xi)− Yi)2

• Minimal risk achieved for θ? = (X>X)−1X>y where

y = [Y1 . . . Yn]> and X is a matrix whose i-th line is X>i

15

Page 17: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Gradient Descents for Regression

Alternative methods to find θ?: sequential algorithms.

Batch gradient descent. Repeat:

∀k ∈ 0, 1, . . . , d, θk := θk − αn∑i=1

(Yi − fθ(Xi))Xik

Stochastic gradient descent. Repeat:

1. Select a sample i uniformly at random

2. Perform a descent using (Xi, Yi) only

∀k ∈ 0, 1, . . . , d, θk := θk − α(Yi − fθ(Xi))Xik

16

Page 18: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Linear regression: Regularization

• For d > n, X>X ∈ Rd×d is not invertible, so θ? is not uniquely

defined. Need to put additional constraints on the model to get a

well-posed problem

• Regularization: add a cost for the magnitude of θ

minθ

1

2n

n∑i=1

(fθ(Xi)− Yi)2 + λΩ(θ)

Ridge: Ω(θ) = ‖θ‖22LASSO: Ω(θ) = ‖θ‖1`p: Ω(θ) = ‖θ‖p

p small (< 1): sparse solutions but hard optimization problems

p large (≥ 1): less sparse solutions but convex optimization problems

17

Page 19: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Linear regression: Regularization

• Regularization: add a cost for the magnitude of θ

minθ

1

2n

n∑i=1

(fθ(Xi)− Yi)2 + λΩ(θ)

• λ > 0 controls the bias-variance trade-off

- High value of λ: the data has a low weight in the objective function

(low variance but high bias)

- Low value of λ: the data has a high weight in the objective function

(low bias but high variance)

• Solution for Ridge regression: θ? = (X>X + nλI)−1X>y

Prediction: For all x ∈ Rd, fλ(x) = y>(X>X + nλI)−1X>x

18

Page 20: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

4. Support Vector Machines

SVM is a classification algorithm aiming at maximizing the margin of the

separating hyperplane

The data points (or vectors) realizing the minimum distance to this

hyperplane are support vectors (as they carry the hyperplane)

19

Page 21: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Support Vector Machines: Derivation

Separating hyperplane Hθ,b: θ>x+ b = 0

θ and b are normalized so that |θ>Xsv + b| = 1 where Xsv is any support

vector, i.e., ∈ arg mini d(Xi, Hθ,b).

With this choice, the margin is simply 2/‖θ‖2.

Max margin problem.maxθ,b 1/‖θ‖2s.t. mini |θ>Xi + b| = 1

⇐⇒

minθ,b ‖θ‖22s.t. Yi(θ

>Xi + b) ≥ 1,∀i

A simple quadratic program ...

20

Page 22: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Support Vector Machines: Derivation

Lagrangian: L(θ, b, α) = 12θ>θ −

∑i αi(Yi(θ

>Xi + b)− 1)

Dual problem.maxα

∑i αi −

12

∑i

∑j YiYjX

>i Xjαiαj

s.t. αi ≥ 0,∀i,∑i αiYi = 0

A quadratic program: given the solution α?, we have the following:

• (complementary slackness) α?i > 0 iff Xi is a support vector

• (hyperplane): θ =∑i α

?i YiXi and b: Yi(θ

>Xi + b) = 1 if α?i > 0

• (prediction function): fn(x) = sgn(∑i α

?i YiX

>i x+ b)

Remark. The solution and the prediction can be computed by just

evaluating inner products X>i Xj and X>i x.

21

Page 23: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Support Vector Machines vs. Local Averaging

The predictor ressembles that of a local averaging algorithm:

g(x) = sgn(∑i

Wi(x)Yi + b)

with Wi(x) = α?iX>i x

But: Wi(x) is not local anymore ... as the support vectors span the

entire data

22

Page 24: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

SVM: Extensions

What if the data is not linearly separable?

Data almost linearly separable

Soft-margin SVM

(impose a cost of misclassifica-

tion)

Data not linearly separable at all

Go to a higher dimensional space

where the data becomes separa-

ble – The Kernel trick

23

Page 25: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

4. SVM with the kernel trick

• Non-linear transformation: φ : Rd → Re where e ≥ d• Apply the SVM in Re with the training data (φ(Xi), Yi)i=1,...,n

• The optimal prediction can be constructed by only considering scalar

products

K(Xi, Xj) = φ(Xi)>φ(Xj), and K(Xi, x) = φ(Xi)

>φ(x)

• K is called a kernel (by definition if all its matrices are semi-definite

positive)

• Kernel trick: The knowledge of K is enough; we can use spaces

with arbitrary large dimension!

24

Page 26: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

SVM with the kernel trick

SVM with kernel K.

1. Using QP packages, find α? solution of:maxα

∑i αi −

12

∑i

∑j YiYjK(Xi, Xj)αiαj

s.t. αi ≥ 0,∀i,∑i αiYi = 0

2. Prediction function: fK(x) = sgn(∑i α

?i YiK(Xi, x) + b)

25

Page 27: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Kernels

• K is a kernel iff there exists an Hilbert space and φ such that

K(x, x′) = φ(x)>φ(x′) (Aronszajn’s theorem)

• Examples:- Polynomial kernel: K(x, x′) = (1 + x>x′)p (e = 1 + pd)

- Gaussian kernel: K(x, x′) = exp(−‖x− x′‖2/h) (e =∞)

- Kernel cookbook:

http://www.cs.toronto.edu/∼duvenaud/cookbook/index.html

26

Page 28: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Outline of the Course

1. Supervised Learning

• Regression and Classification

• Neural Networks (Deep Learning)

2. Unsupervised Learning

• Clustering

• Reinforcement Learning

27

Page 29: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

6. Neural networks

Loosely inspired by how the brain works1. Construct a network of

simplified neurones, with the hope of approximating and learning any

possible function

1Mc Culloch-Pitts, 1943

28

Page 30: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

The perceptron

The first artificial neural network with one layer, and σ(x) = sgn(x)

(classification)

Input x ∈ Rd, output in −1, 1. Can represent separating hyperplanes.

29

Page 31: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Multilayer perceptrons

They can represent any function of Rd to −1, 1

... but the structure depends on the unknown target function f , and is

difficult to optimise

30

Page 32: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

From perceptrons to neural networks

... and the number of layers can

rapidly grow with the complexity

of the function

A key idea to make neural networks practical: soft-thresholding ...

31

Page 33: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Soft-thresholding

Replace hard-thresholding function σ by smoother functions

Theorem (Cybenko 1989) Any continuous function f from

[0, 1]d to R can be approximated as a function of the form:∑Nj=1 αjσ(w>j x+ bj), where σ is any sigmoid function.

32

Page 34: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Soft-thresholding

Cybenko’s theorem tells us that f can be represented using a single

hidden layer network ...

A non-constructive proof: how many neurones do we need? Might

depend on f ...

33

Page 35: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Neural networks

A feedforward layered network (deep learning = enough layers)

34

Page 36: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Deep Learning and the ILSVR challenge

Deep learning outperformed any other techniques in all major machine

learning competitions (image classification, speech recognition and

natural language processing)

The ImageNet Large Scale Visual Recognition Challenge

(ILSVRC).

1. Training: 1.2 million images (227×227), labeled one out of 1000

categories

2. Test: 100.000 images (227×227)

3. Error measure: The teams have to predict 5 (out of 1000) classes

and an image is considered to be correct if at least one of the

predictions is the ground truth. 2

35

Page 37: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

ILSVR challenge

1From Stanford CS231n lecture notes

36

Page 38: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Architectures

37

Page 39: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Architectures

38

Page 40: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Computing with neural networks

• Layer 0: inputs x = (x(0)1 , . . . , x

(0)d ) and x

(0)0 = 1

• Layer 1, . . . , L− 1: hidden layer `, d(`) + 1 nodes, state of node i,

x(`)i with x

(`)0 = 1

• Layer L: output y = x(L)1

Signal at k: s(`)k =

∑d(`−1)

i=0 w(`)ik x

(`−1)i

State at k: x(`)k = σ(s

(`)k )

Output: the state of y = x(L)1

39

Page 41: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Training neural networks

The output of the network is a function of w = (w(`)ij )i,j,`: y = fw(x)

We wish to optimise over w to find the most accurate estimation of the

target function

Training data: (X1, Y1), . . . , (Xn, Yn) ∈ Rd × −1, 1

Objective: find w minimising the empirical risk:

E(w) := R(fw) =1

2n

n∑l=1

|fw(Xl)− Yl|2

40

Page 42: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Stochastic Gradient Descent

E(w) = 12n

∑nl=1El(w) where El(w) := |fw(Xl)− Yl|2

In each iteration of the SGD algorithm, only one function El is

considered ...

Parameter. learning rate α > 0

1. Initialization. w := w0

2. Sample selection. Select l uniformly at random in

1, . . . , n3. GD iteration. w := w − α∇El(w), go to 2.

Is there an efficient way of computing El(w)?

41

Page 43: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Backpropagation

We fix l, and introduce e(w) = El(w).

Let us compute ∇e(w):

∂e

∂w(`)ij

=∂e

∂s(`)j︸ ︷︷ ︸

:=δ(`)j

×∂s

(`)j

∂w(`)ij︸ ︷︷ ︸

=x(`−1)i

The sensitivity of the error w.r.t. the signal at node j can be computed

recursively ...

42

Page 44: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Backward recursion

Output layer. δ(L)1 := ∂e

∂s(L)1

and e(w) = (σ(s(L)1 )− Yl)2

δ(L)1 = 2(x

(L)1 − Yl)σ′(s(L)1 )

From layer ` to layer `− 1.

δ(`−1)i :=

∂e

∂s(`−1)i

=

d(`)∑j=1

∂e

∂s(`)j︸ ︷︷ ︸

:=δ(`)j

×∂s

(`)j

∂x(`−1)i︸ ︷︷ ︸

=w(`)ij

× ∂x(`−1)i

∂s(`−1)i︸ ︷︷ ︸

=σ′(s(`−1)i )

Summary.

∂El

∂w(`)ij

= δ(`)j x

(`−1)i , δ

(`−1)i =

d(`)∑j=1

δ(`)j w

(`)ij σ

′(s(`−1)i )

43

Page 45: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Backpropagation algorithm

Parameter. Learning rate α > 0

Input. (X1, Y1), . . . , (Xn, Yn) ∈ Rd × −1, 11. Initialization. w := w0

2. Sample selection. Select l uniformly at random in

1, . . . , n3. Gradient of El.

• x(0)i := Xli for all i = 1, . . . d

• Forward propagation: compute the state and signal at each

node (x(`)i , s

(`)i )

• Backward propagation: propagate back Yl to compute δ(`)i

at each node and the partial derivative ∂El

∂w(`)ij

4. GD iteration. w := w − α∇El(w), go to 2.

44

Page 46: Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Example: tensorflow

http://playground.tensorflow.org/

45