Upload
hahanh
View
229
Download
0
Embed Size (px)
Citation preview
Bits of Machine Learning
Part 1: Supervised Learning
Alexandre Proutiere and Vahan Petrosyan
KTH (The Royal Institute of Technology)
Outline of the Course
1. Supervised Learning
• Regression and Classification
• Neural Networks (Deep Learning)
2. Unsupervised Learning
• Clustering
• Reinforcement Learning
1
Learning
Supervised (or Predictive) learning
Learn a mapping from inputs x to outputs y, given a labeled set
of input-ouput pairs (the training set)
Dn = (Xi, Yi), i = 1, . . . , n
We learn the classification function f = 1 if versicolor, f = −1 if virginica
2
Learning
Unsupervised (or Descriptive) learning
Find interesting patterns in the data Dn = Xi, i = 1, . . . , n
We learn there are 2 distinct types of iris and how to distinguish them!
3
Examples
Supervised learning:
• digit, flower, picture, ... recognition
• music classification
• predict prices
• predict the outcome of chemical reactions
• source separation: identify the instruments present in a recording
• ...
Unsupervised learning:
• classification (without training set): spam filter, news (google news),
...
• identify structures and causes in the data: e.g. J. Snow – cholera
deaths vs pollution
• image segmentation
• ...
4
Outline of the Course
1. Supervised Learning
• Regression and Classification
• Neural Networks (Deep Learning)
2. Unsupervised Learning
• Clustering
• Reinforcement Learning
5
Supervised Learning: Outline
1. A statistical framework
2. Algorithms using Local Averaging (k-nn, kernels)
3. Linear regression
4. Support Vector Machines
5. The kernel trick
6. Neural Networks
6
1. Supervised learning
• Training set: Dn = (Xi, Yi), i = 1, . . . , n- Input features: Xi ∈ Rd
- Output: Yi
Yi ∈ Y
R regression (price, position, etc)
finite classification (type, mode, etc)
• y is a non-deterministic and complicated function of x
i.e., y = f(x) + z where z is unknown (e.g. noise). Goal: learn f .
• Learning algorithm:
7
A Statistical Approach
• (Xi, Yi), i = 1, . . . , n are i.i.d. random variables with joint
distribution P
• Model: look for an estimate of f in a class of functions F , e.g.
linear, polynomial, ...
• Learning algorithm: A : Dn 7→ fn ∈ F , fn estimates the true
function f
• Performance of predictions defined through a loss function `
- Example 1. Regression, Least Squares (LS): Y = R,
`(y, y′) = 12|y − y′|2
- Example 2. Classification in Y = 0, 1: `(y, y′) = 1y 6=y′
8
Risks
• The risk of a prediction function g is R(g) := EP [`(g(X), Y )]
(often referred to as out-of-sample error)
Target function f? := arg ming∈F R(g)
• Empirical risk: Rn(g) := 1n
∑ni=1 `(g(Xi), Yi)
(often referred to as in-sample error)
Law of large numbers: limn→∞ Rn(g) = R(g) almost surely
Central limit theorem:√n(Rn(g)−R(g)) ∼ N (0, varP (`(g(X), Y )))
• Consistent algorithm: if limn→∞ E[R(fn)] = R(f?)
Warning. Consistency 6= good algorithm (F can be poor, f 6= f?)
9
LS regression and binary classification
Example 1. Regression, Least Square (LS)
• Risk R(g) = 12E[|g(X)− Y |2]
• Target function f?(x) = E[Y |X = x]
Example 2. Binary classification: Y = −1, 1
• Risk R(g) = P(g(X) 6= Y )
• Target function f?(x) = arg maxk∈−1,1 P(Y = k|X = x)
10
Model selection - Choosing F
Avoid overfitting!
A few principles:
• Occam’s razor principle: choose the simplest of two models if they
explain the data equally well
• Hadamard’s well posed problems: unique solution, smooth in the
parameters
Regularization: minimize error(fn)+Ω(fn) – Penalizes the model
complexity
11
2. Local averaging algorithms
Principle: predict for x based on neighbours of x in the training data
fn(x) = Ψ(
n∑i=1
Wi(x)Yi),
n∑i=1
Wi(x) = 1
Ψ(·) = · for regression, Ψ(·) = sgn(·) for binary classification
12
Examples
Prediction function: fn(x) = Ψ(∑ni=1Wi(x)Yi),
∑ni=1Wi(x) = 1
• k-n.n.: Wi(x) =
1/k if Xi is one of the k-n.n. of x
0 otherwise
• Nadaraya-Watson algorithm: Wi(x) ∝ K(x−Xi
h
), e.g.
K(·) = e−‖·‖2
13
Performance analysis
Stone’s theorem: Conditions for Wi under which the corresponding
algorithm is consistent!
Example: LS target function f?(x) = E[Y |X = x]. Distribution P on
(X,Y ).
Assume that:
1. X lies in A bounded a.s. under P
2. For all x ∈ A, var(Y |X = x) ≤ σ2
3. f? is L-lipschitz
Then the k-n.n. estimator satisfies:
E(R(fn))−R(f?) ≤ σ2
k+ cL2
(k
n
)2/d
The k-n.n. algorithm is consistent with P if k goes to infinity sublinearly
with n.
14
3. Linear regression: Minimizing the empirical risk
• F = set of linear functions from Rd (Xi ∈ Rd) to R (Yi ∈ R)
• Function fθ ∈ F parametrized by θ ∈ Rd+1: (by convention x0 = 1)
fθ(x) = θ0 + θ1x1 + . . .+ θdxd = θ>x
• Least square model. Empirical risk of θ:
Rn(θ) = 12n
∑ni=1(fθ(Xi)− Yi)2
• Minimal risk achieved for θ? = (X>X)−1X>y where
y = [Y1 . . . Yn]> and X is a matrix whose i-th line is X>i
15
Gradient Descents for Regression
Alternative methods to find θ?: sequential algorithms.
Batch gradient descent. Repeat:
∀k ∈ 0, 1, . . . , d, θk := θk − αn∑i=1
(Yi − fθ(Xi))Xik
Stochastic gradient descent. Repeat:
1. Select a sample i uniformly at random
2. Perform a descent using (Xi, Yi) only
∀k ∈ 0, 1, . . . , d, θk := θk − α(Yi − fθ(Xi))Xik
16
Linear regression: Regularization
• For d > n, X>X ∈ Rd×d is not invertible, so θ? is not uniquely
defined. Need to put additional constraints on the model to get a
well-posed problem
• Regularization: add a cost for the magnitude of θ
minθ
1
2n
n∑i=1
(fθ(Xi)− Yi)2 + λΩ(θ)
Ridge: Ω(θ) = ‖θ‖22LASSO: Ω(θ) = ‖θ‖1`p: Ω(θ) = ‖θ‖p
p small (< 1): sparse solutions but hard optimization problems
p large (≥ 1): less sparse solutions but convex optimization problems
17
Linear regression: Regularization
• Regularization: add a cost for the magnitude of θ
minθ
1
2n
n∑i=1
(fθ(Xi)− Yi)2 + λΩ(θ)
• λ > 0 controls the bias-variance trade-off
- High value of λ: the data has a low weight in the objective function
(low variance but high bias)
- Low value of λ: the data has a high weight in the objective function
(low bias but high variance)
• Solution for Ridge regression: θ? = (X>X + nλI)−1X>y
Prediction: For all x ∈ Rd, fλ(x) = y>(X>X + nλI)−1X>x
18
4. Support Vector Machines
SVM is a classification algorithm aiming at maximizing the margin of the
separating hyperplane
The data points (or vectors) realizing the minimum distance to this
hyperplane are support vectors (as they carry the hyperplane)
19
Support Vector Machines: Derivation
Separating hyperplane Hθ,b: θ>x+ b = 0
θ and b are normalized so that |θ>Xsv + b| = 1 where Xsv is any support
vector, i.e., ∈ arg mini d(Xi, Hθ,b).
With this choice, the margin is simply 2/‖θ‖2.
Max margin problem.maxθ,b 1/‖θ‖2s.t. mini |θ>Xi + b| = 1
⇐⇒
minθ,b ‖θ‖22s.t. Yi(θ
>Xi + b) ≥ 1,∀i
A simple quadratic program ...
20
Support Vector Machines: Derivation
Lagrangian: L(θ, b, α) = 12θ>θ −
∑i αi(Yi(θ
>Xi + b)− 1)
Dual problem.maxα
∑i αi −
12
∑i
∑j YiYjX
>i Xjαiαj
s.t. αi ≥ 0,∀i,∑i αiYi = 0
A quadratic program: given the solution α?, we have the following:
• (complementary slackness) α?i > 0 iff Xi is a support vector
• (hyperplane): θ =∑i α
?i YiXi and b: Yi(θ
>Xi + b) = 1 if α?i > 0
• (prediction function): fn(x) = sgn(∑i α
?i YiX
>i x+ b)
Remark. The solution and the prediction can be computed by just
evaluating inner products X>i Xj and X>i x.
21
Support Vector Machines vs. Local Averaging
The predictor ressembles that of a local averaging algorithm:
g(x) = sgn(∑i
Wi(x)Yi + b)
with Wi(x) = α?iX>i x
But: Wi(x) is not local anymore ... as the support vectors span the
entire data
22
SVM: Extensions
What if the data is not linearly separable?
Data almost linearly separable
Soft-margin SVM
(impose a cost of misclassifica-
tion)
Data not linearly separable at all
Go to a higher dimensional space
where the data becomes separa-
ble – The Kernel trick
23
4. SVM with the kernel trick
• Non-linear transformation: φ : Rd → Re where e ≥ d• Apply the SVM in Re with the training data (φ(Xi), Yi)i=1,...,n
• The optimal prediction can be constructed by only considering scalar
products
K(Xi, Xj) = φ(Xi)>φ(Xj), and K(Xi, x) = φ(Xi)
>φ(x)
• K is called a kernel (by definition if all its matrices are semi-definite
positive)
• Kernel trick: The knowledge of K is enough; we can use spaces
with arbitrary large dimension!
24
SVM with the kernel trick
SVM with kernel K.
1. Using QP packages, find α? solution of:maxα
∑i αi −
12
∑i
∑j YiYjK(Xi, Xj)αiαj
s.t. αi ≥ 0,∀i,∑i αiYi = 0
2. Prediction function: fK(x) = sgn(∑i α
?i YiK(Xi, x) + b)
25
Kernels
• K is a kernel iff there exists an Hilbert space and φ such that
K(x, x′) = φ(x)>φ(x′) (Aronszajn’s theorem)
• Examples:- Polynomial kernel: K(x, x′) = (1 + x>x′)p (e = 1 + pd)
- Gaussian kernel: K(x, x′) = exp(−‖x− x′‖2/h) (e =∞)
- Kernel cookbook:
http://www.cs.toronto.edu/∼duvenaud/cookbook/index.html
26
Outline of the Course
1. Supervised Learning
• Regression and Classification
• Neural Networks (Deep Learning)
2. Unsupervised Learning
• Clustering
• Reinforcement Learning
27
6. Neural networks
Loosely inspired by how the brain works1. Construct a network of
simplified neurones, with the hope of approximating and learning any
possible function
1Mc Culloch-Pitts, 1943
28
The perceptron
The first artificial neural network with one layer, and σ(x) = sgn(x)
(classification)
Input x ∈ Rd, output in −1, 1. Can represent separating hyperplanes.
29
Multilayer perceptrons
They can represent any function of Rd to −1, 1
... but the structure depends on the unknown target function f , and is
difficult to optimise
30
From perceptrons to neural networks
... and the number of layers can
rapidly grow with the complexity
of the function
A key idea to make neural networks practical: soft-thresholding ...
31
Soft-thresholding
Replace hard-thresholding function σ by smoother functions
Theorem (Cybenko 1989) Any continuous function f from
[0, 1]d to R can be approximated as a function of the form:∑Nj=1 αjσ(w>j x+ bj), where σ is any sigmoid function.
32
Soft-thresholding
Cybenko’s theorem tells us that f can be represented using a single
hidden layer network ...
A non-constructive proof: how many neurones do we need? Might
depend on f ...
33
Neural networks
A feedforward layered network (deep learning = enough layers)
34
Deep Learning and the ILSVR challenge
Deep learning outperformed any other techniques in all major machine
learning competitions (image classification, speech recognition and
natural language processing)
The ImageNet Large Scale Visual Recognition Challenge
(ILSVRC).
1. Training: 1.2 million images (227×227), labeled one out of 1000
categories
2. Test: 100.000 images (227×227)
3. Error measure: The teams have to predict 5 (out of 1000) classes
and an image is considered to be correct if at least one of the
predictions is the ground truth. 2
35
ILSVR challenge
1From Stanford CS231n lecture notes
36
Architectures
37
Architectures
38
Computing with neural networks
• Layer 0: inputs x = (x(0)1 , . . . , x
(0)d ) and x
(0)0 = 1
• Layer 1, . . . , L− 1: hidden layer `, d(`) + 1 nodes, state of node i,
x(`)i with x
(`)0 = 1
• Layer L: output y = x(L)1
Signal at k: s(`)k =
∑d(`−1)
i=0 w(`)ik x
(`−1)i
State at k: x(`)k = σ(s
(`)k )
Output: the state of y = x(L)1
39
Training neural networks
The output of the network is a function of w = (w(`)ij )i,j,`: y = fw(x)
We wish to optimise over w to find the most accurate estimation of the
target function
Training data: (X1, Y1), . . . , (Xn, Yn) ∈ Rd × −1, 1
Objective: find w minimising the empirical risk:
E(w) := R(fw) =1
2n
n∑l=1
|fw(Xl)− Yl|2
40
Stochastic Gradient Descent
E(w) = 12n
∑nl=1El(w) where El(w) := |fw(Xl)− Yl|2
In each iteration of the SGD algorithm, only one function El is
considered ...
Parameter. learning rate α > 0
1. Initialization. w := w0
2. Sample selection. Select l uniformly at random in
1, . . . , n3. GD iteration. w := w − α∇El(w), go to 2.
Is there an efficient way of computing El(w)?
41
Backpropagation
We fix l, and introduce e(w) = El(w).
Let us compute ∇e(w):
∂e
∂w(`)ij
=∂e
∂s(`)j︸ ︷︷ ︸
:=δ(`)j
×∂s
(`)j
∂w(`)ij︸ ︷︷ ︸
=x(`−1)i
The sensitivity of the error w.r.t. the signal at node j can be computed
recursively ...
42
Backward recursion
Output layer. δ(L)1 := ∂e
∂s(L)1
and e(w) = (σ(s(L)1 )− Yl)2
δ(L)1 = 2(x
(L)1 − Yl)σ′(s(L)1 )
From layer ` to layer `− 1.
δ(`−1)i :=
∂e
∂s(`−1)i
=
d(`)∑j=1
∂e
∂s(`)j︸ ︷︷ ︸
:=δ(`)j
×∂s
(`)j
∂x(`−1)i︸ ︷︷ ︸
=w(`)ij
× ∂x(`−1)i
∂s(`−1)i︸ ︷︷ ︸
=σ′(s(`−1)i )
Summary.
∂El
∂w(`)ij
= δ(`)j x
(`−1)i , δ
(`−1)i =
d(`)∑j=1
δ(`)j w
(`)ij σ
′(s(`−1)i )
43
Backpropagation algorithm
Parameter. Learning rate α > 0
Input. (X1, Y1), . . . , (Xn, Yn) ∈ Rd × −1, 11. Initialization. w := w0
2. Sample selection. Select l uniformly at random in
1, . . . , n3. Gradient of El.
• x(0)i := Xli for all i = 1, . . . d
• Forward propagation: compute the state and signal at each
node (x(`)i , s
(`)i )
• Backward propagation: propagate back Yl to compute δ(`)i
at each node and the partial derivative ∂El
∂w(`)ij
4. GD iteration. w := w − α∇El(w), go to 2.
44