Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning

Part 1: Supervised Learning

Alexandre Proutiere and Vahan Petrosyan

KTH (The Royal Institute of Technology)

Outline of the Course

1. Supervised Learning

• Regression and Classification

• Neural Networks (Deep Learning)

2. Unsupervised Learning

• Clustering

• Reinforcement Learning

1

Learning

Supervised (or Predictive) learning

Learn a mapping from inputs x to outputs y, given a labeled set

of input-ouput pairs (the training set)

Dn = (Xi, Yi), i = 1, . . . , n

We learn the classification function f = 1 if versicolor, f = −1 if virginica

2

Learning

Unsupervised (or Descriptive) learning

Find interesting patterns in the data Dn = Xi, i = 1, . . . , n

We learn there are 2 distinct types of iris and how to distinguish them!

3

Examples

Supervised learning:

• digit, flower, picture, ... recognition

• music classification

• predict prices

• predict the outcome of chemical reactions

• source separation: identify the instruments present in a recording

• ...

Unsupervised learning:

• classification (without training set): spam filter, news (google news),

...

• identify structures and causes in the data: e.g. J. Snow – cholera

deaths vs pollution

• image segmentation

• ...

4






• Clustering


5

Supervised Learning: Outline

1. A statistical framework

2. Algorithms using Local Averaging (k-nn, kernels)

3. Linear regression

4. Support Vector Machines

5. The kernel trick

6. Neural Networks

6

1. Supervised learning

• Training set: Dn = (Xi, Yi), i = 1, . . . , n- Input features: Xi ∈ Rd

- Output: Yi

Yi ∈ Y

R regression (price, position, etc)

finite classification (type, mode, etc)

• y is a non-deterministic and complicated function of x

i.e., y = f(x) + z where z is unknown (e.g. noise). Goal: learn f .

• Learning algorithm:

7

A Statistical Approach

• (Xi, Yi), i = 1, . . . , n are i.i.d. random variables with joint

distribution P

• Model: look for an estimate of f in a class of functions F , e.g.

linear, polynomial, ...

• Learning algorithm: A : Dn 7→ fn ∈ F , fn estimates the true

function f

• Performance of predictions defined through a loss function `

- Example 1. Regression, Least Squares (LS): Y = R,

`(y, y′) = 12|y − y′|2

- Example 2. Classification in Y = 0, 1: `(y, y′) = 1y 6=y′

8

Risks

• The risk of a prediction function g is R(g) := EP [`(g(X), Y )]

(often referred to as out-of-sample error)

Target function f? := arg ming∈F R(g)

• Empirical risk: Rn(g) := 1n

∑ni=1 `(g(Xi), Yi)

(often referred to as in-sample error)

Law of large numbers: limn→∞ Rn(g) = R(g) almost surely

Central limit theorem:√n(Rn(g)−R(g)) ∼ N (0, varP (`(g(X), Y )))

• Consistent algorithm: if limn→∞ E[R(fn)] = R(f?)

Warning. Consistency 6= good algorithm (F can be poor, f 6= f?)

9

LS regression and binary classification

Example 1. Regression, Least Square (LS)

• Risk R(g) = 12E[|g(X)− Y |2]

• Target function f?(x) = E[Y |X = x]

Example 2. Binary classification: Y = −1, 1

• Risk R(g) = P(g(X) 6= Y )

• Target function f?(x) = arg maxk∈−1,1 P(Y = k|X = x)

10

Model selection - Choosing F

Avoid overfitting!

A few principles:

• Occam’s razor principle: choose the simplest of two models if they

explain the data equally well

• Hadamard’s well posed problems: unique solution, smooth in the

parameters

Regularization: minimize error(fn)+Ω(fn) – Penalizes the model

complexity

11

2. Local averaging algorithms

Principle: predict for x based on neighbours of x in the training data

fn(x) = Ψ(

n∑i=1

Wi(x)Yi),

n∑i=1

Wi(x) = 1

Ψ(·) = · for regression, Ψ(·) = sgn(·) for binary classification

12

Examples

Prediction function: fn(x) = Ψ(∑ni=1Wi(x)Yi),

∑ni=1Wi(x) = 1

• k-n.n.: Wi(x) =

1/k if Xi is one of the k-n.n. of x

0 otherwise

• Nadaraya-Watson algorithm: Wi(x) ∝ K(x−Xi

h

), e.g.

K(·) = e−‖·‖2

13

Performance analysis

Stone’s theorem: Conditions for Wi under which the corresponding

algorithm is consistent!

Example: LS target function f?(x) = E[Y |X = x]. Distribution P on

(X,Y ).

Assume that:

1. X lies in A bounded a.s. under P

2. For all x ∈ A, var(Y |X = x) ≤ σ2

3. f? is L-lipschitz

Then the k-n.n. estimator satisfies:

E(R(fn))−R(f?) ≤ σ2

k+ cL2

(k

n

)2/d

The k-n.n. algorithm is consistent with P if k goes to infinity sublinearly

with n.

14

3. Linear regression: Minimizing the empirical risk

• F = set of linear functions from Rd (Xi ∈ Rd) to R (Yi ∈ R)

• Function fθ ∈ F parametrized by θ ∈ Rd+1: (by convention x0 = 1)

fθ(x) = θ0 + θ1x1 + . . .+ θdxd = θ>x

• Least square model. Empirical risk of θ:

Rn(θ) = 12n

∑ni=1(fθ(Xi)− Yi)2

• Minimal risk achieved for θ? = (X>X)−1X>y where

y = [Y1 . . . Yn]> and X is a matrix whose i-th line is X>i

15

Gradient Descents for Regression

Alternative methods to find θ?: sequential algorithms.

Batch gradient descent. Repeat:

∀k ∈ 0, 1, . . . , d, θk := θk − αn∑i=1

(Yi − fθ(Xi))Xik

Stochastic gradient descent. Repeat:

1. Select a sample i uniformly at random

2. Perform a descent using (Xi, Yi) only

∀k ∈ 0, 1, . . . , d, θk := θk − α(Yi − fθ(Xi))Xik

16

Linear regression: Regularization

• For d > n, X>X ∈ Rd×d is not invertible, so θ? is not uniquely

defined. Need to put additional constraints on the model to get a

well-posed problem

• Regularization: add a cost for the magnitude of θ

minθ

1

2n

n∑i=1

(fθ(Xi)− Yi)2 + λΩ(θ)

Ridge: Ω(θ) = ‖θ‖22LASSO: Ω(θ) = ‖θ‖1`p: Ω(θ) = ‖θ‖p

p small (< 1): sparse solutions but hard optimization problems

p large (≥ 1): less sparse solutions but convex optimization problems

17

Linear regression: Regularization

• Regularization: add a cost for the magnitude of θ

minθ

1

2n

n∑i=1

(fθ(Xi)− Yi)2 + λΩ(θ)

• λ > 0 controls the bias-variance trade-off

- High value of λ: the data has a low weight in the objective function

(low variance but high bias)

- Low value of λ: the data has a high weight in the objective function

(low bias but high variance)

• Solution for Ridge regression: θ? = (X>X + nλI)−1X>y

Prediction: For all x ∈ Rd, fλ(x) = y>(X>X + nλI)−1X>x

18

4. Support Vector Machines

SVM is a classification algorithm aiming at maximizing the margin of the

separating hyperplane

The data points (or vectors) realizing the minimum distance to this

hyperplane are support vectors (as they carry the hyperplane)

19

Support Vector Machines: Derivation

Separating hyperplane Hθ,b: θ>x+ b = 0

θ and b are normalized so that |θ>Xsv + b| = 1 where Xsv is any support

vector, i.e., ∈ arg mini d(Xi, Hθ,b).

With this choice, the margin is simply 2/‖θ‖2.

Max margin problem.maxθ,b 1/‖θ‖2s.t. mini |θ>Xi + b| = 1

⇐⇒

minθ,b ‖θ‖22s.t. Yi(θ

>Xi + b) ≥ 1,∀i

A simple quadratic program ...

20

Support Vector Machines: Derivation

Lagrangian: L(θ, b, α) = 12θ>θ −

∑i αi(Yi(θ

>Xi + b)− 1)

Dual problem.maxα

∑i αi −

12

∑i

∑j YiYjX

>i Xjαiαj

s.t. αi ≥ 0,∀i,∑i αiYi = 0

A quadratic program: given the solution α?, we have the following:

• (complementary slackness) α?i > 0 iff Xi is a support vector

• (hyperplane): θ =∑i α

?i YiXi and b: Yi(θ

>Xi + b) = 1 if α?i > 0

• (prediction function): fn(x) = sgn(∑i α

?i YiX

>i x+ b)

Remark. The solution and the prediction can be computed by just

evaluating inner products X>i Xj and X>i x.

21

Support Vector Machines vs. Local Averaging

The predictor ressembles that of a local averaging algorithm:

g(x) = sgn(∑i

Wi(x)Yi + b)

with Wi(x) = α?iX>i x

But: Wi(x) is not local anymore ... as the support vectors span the

entire data

22

SVM: Extensions

What if the data is not linearly separable?

Data almost linearly separable

Soft-margin SVM

(impose a cost of misclassifica-

tion)

Data not linearly separable at all

Go to a higher dimensional space

where the data becomes separa-

ble – The Kernel trick

23

4. SVM with the kernel trick

• Non-linear transformation: φ : Rd → Re where e ≥ d• Apply the SVM in Re with the training data (φ(Xi), Yi)i=1,...,n

• The optimal prediction can be constructed by only considering scalar

products

K(Xi, Xj) = φ(Xi)>φ(Xj), and K(Xi, x) = φ(Xi)

>φ(x)

• K is called a kernel (by definition if all its matrices are semi-definite

positive)

• Kernel trick: The knowledge of K is enough; we can use spaces

with arbitrary large dimension!

24

SVM with the kernel trick

SVM with kernel K.

1. Using QP packages, find α? solution of:maxα

∑i αi −

12

∑i

∑j YiYjK(Xi, Xj)αiαj

s.t. αi ≥ 0,∀i,∑i αiYi = 0

2. Prediction function: fK(x) = sgn(∑i α

?i YiK(Xi, x) + b)

25

Kernels

• K is a kernel iff there exists an Hilbert space and φ such that

K(x, x′) = φ(x)>φ(x′) (Aronszajn’s theorem)

• Examples:- Polynomial kernel: K(x, x′) = (1 + x>x′)p (e = 1 + pd)

- Gaussian kernel: K(x, x′) = exp(−‖x− x′‖2/h) (e =∞)

- Kernel cookbook:

http://www.cs.toronto.edu/∼duvenaud/cookbook/index.html

26






• Clustering


27

6. Neural networks

Loosely inspired by how the brain works1. Construct a network of

simplified neurones, with the hope of approximating and learning any

possible function

1Mc Culloch-Pitts, 1943

28

The perceptron

The first artificial neural network with one layer, and σ(x) = sgn(x)

(classification)

Input x ∈ Rd, output in −1, 1. Can represent separating hyperplanes.

29

Multilayer perceptrons

They can represent any function of Rd to −1, 1

... but the structure depends on the unknown target function f , and is

difficult to optimise

30

From perceptrons to neural networks

... and the number of layers can

rapidly grow with the complexity

of the function

A key idea to make neural networks practical: soft-thresholding ...

31

Soft-thresholding

Replace hard-thresholding function σ by smoother functions

Theorem (Cybenko 1989) Any continuous function f from

[0, 1]d to R can be approximated as a function of the form:∑Nj=1 αjσ(w>j x+ bj), where σ is any sigmoid function.

32

Soft-thresholding

Cybenko’s theorem tells us that f can be represented using a single

hidden layer network ...

A non-constructive proof: how many neurones do we need? Might

depend on f ...

33

Neural networks

A feedforward layered network (deep learning = enough layers)

34

Deep Learning and the ILSVR challenge

Deep learning outperformed any other techniques in all major machine

learning competitions (image classification, speech recognition and

natural language processing)

The ImageNet Large Scale Visual Recognition Challenge

(ILSVRC).

1. Training: 1.2 million images (227×227), labeled one out of 1000

categories

2. Test: 100.000 images (227×227)

3. Error measure: The teams have to predict 5 (out of 1000) classes

and an image is considered to be correct if at least one of the

predictions is the ground truth. 2

35

ILSVR challenge

1From Stanford CS231n lecture notes

36

Architectures

37

Architectures

38

Computing with neural networks

• Layer 0: inputs x = (x(0)1 , . . . , x

(0)d ) and x

(0)0 = 1

• Layer 1, . . . , L− 1: hidden layer `, d(`) + 1 nodes, state of node i,

x(`)i with x

(`)0 = 1

• Layer L: output y = x(L)1

Signal at k: s(`)k =

∑d(`−1)

i=0 w(`)ik x

(`−1)i

State at k: x(`)k = σ(s

(`)k )

Output: the state of y = x(L)1

39

Training neural networks

The output of the network is a function of w = (w(`)ij )i,j,`: y = fw(x)

We wish to optimise over w to find the most accurate estimation of the

target function

Training data: (X1, Y1), . . . , (Xn, Yn) ∈ Rd × −1, 1

Objective: find w minimising the empirical risk:

E(w) := R(fw) =1

2n

n∑l=1

|fw(Xl)− Yl|2

40

Stochastic Gradient Descent

E(w) = 12n

∑nl=1El(w) where El(w) := |fw(Xl)− Yl|2

In each iteration of the SGD algorithm, only one function El is

considered ...

Parameter. learning rate α > 0

1. Initialization. w := w0

2. Sample selection. Select l uniformly at random in

1, . . . , n3. GD iteration. w := w − α∇El(w), go to 2.

Is there an efficient way of computing El(w)?

41

Backpropagation

We fix l, and introduce e(w) = El(w).

Let us compute ∇e(w):

∂e

∂w(`)ij

=∂e

∂s(`)j︸︷︷︸

:=δ(`)j

×∂s

(`)j

∂w(`)ij︸︷︷︸

=x(`−1)i

The sensitivity of the error w.r.t. the signal at node j can be computed

recursively ...

42

Backward recursion

Output layer. δ(L)1 := ∂e

∂s(L)1

and e(w) = (σ(s(L)1 )− Yl)2

δ(L)1 = 2(x

(L)1 − Yl)σ′(s(L)1 )

From layer ` to layer `− 1.

δ(`−1)i :=

∂e

∂s(`−1)i

=

d(`)∑j=1

∂e

∂s(`)j︸︷︷︸

:=δ(`)j

×∂s

(`)j

∂x(`−1)i︸︷︷︸

=w(`)ij

× ∂x(`−1)i

∂s(`−1)i︸︷︷︸

=σ′(s(`−1)i )

Summary.

∂El

∂w(`)ij

= δ(`)j x

(`−1)i , δ

(`−1)i =

d(`)∑j=1

δ(`)j w

(`)ij σ

′(s(`−1)i )

43

Backpropagation algorithm

Parameter. Learning rate α > 0

Input. (X1, Y1), . . . , (Xn, Yn) ∈ Rd × −1, 11. Initialization. w := w0

2. Sample selection. Select l uniformly at random in

1, . . . , n3. Gradient of El.

• x(0)i := Xli for all i = 1, . . . d

• Forward propagation: compute the state and signal at each

node (x(`)i , s

(`)i )

• Backward propagation: propagate back Yl to compute δ(`)i

at each node and the partial derivative ∂El

∂w(`)ij

4. GD iteration. w := w − α∇El(w), go to 2.

44

Example: tensorflow

http://playground.tensorflow.org/

45

http://playground.tensorflow.org/

Documents

Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning