Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Marco Trincavelli
21/11/2011
Mobile Robotics and Olfaction Lab
AASS Research Centre, Örebro University
Nonlinear models for Classification
and Regression
State of the Art Methods of Data Modeling and Machine Learning,
IMRIS program, Fall 2011
Acknowledgments
Nonlinear models for Classification and Regression 2
These slides have been adapted from the slides used in previous years
for the Machine Learning course at Örebro University.
My gratitude to the former teachers of this course that provided me
their slides and greatly simplified my work.
Erik Berglund Thorsteinn Rögnvaldsson
Repetition
1. Repetition
2. Nonlinear models for regression
3. Nonlinear models for classification
4. Artificial Neural Networks
Nonlinear models for Classification and Regression 3
Summary of previous lecture
Nonlinear models for Classification and Regression 4
Learning issues Generalization
Bias & Variance
Hypothesis space
Cost of inputs
Linear systems Linear regression
LMS (adaptive filters)
Simple perceptron
Gaussian PDF-based classifier
Logistic regression
Summary Classification
Nonlinear models for Classification and Regression 5
What does classification mean?
Decision theory
Bayes rule
Linear classifiers
Simple Perceptron
Linear Gaussian Classifier
Logistic Regression
Summary Regression
Nonlinear models for Classification and Regression 6
The fixed regressor assumption Noise in output, static model
Bias & Variance
Error Measures
Analytical solution or learning (gradient descent)
Linear regressors Linear regression
Ridge regression
LMS (on-line learning)
Bayes’ Rule
Nonlinear models for Classification and Regression 7
),(),( kk cxpxcp
)(
)()|()|(
xp
cpcxpxcp kk
k
K
k
kk cpcxpxp1
)()|()(where
Assumptions about the process
Nonlinear models for Classification and Regression 8
The ”fixed regressor” model: x(n) Observed input
y(n) Observed output
g[x(n)] True underlying function
e(n) I.I.D. noise process with zero mean
Data set:
)()]([)( nnxgny e
NnnynxD 1)(),(
Idealized regression
Nonlinear models for Classification and Regression 9
Use (find) appropriate model family F.
Find f(x) in F with minimum “distance” to g(x) (“error”).
Modify the model parameters until the “error” is minimized.
F
Model family (our hypothesis set)
g(x)
f(x) in F
Error
Error I – Summed Square Error (SSE)
Nonlinear models for Classification and Regression 10
W are the parameters of the function f.
SSE assumes zero mean IID noise.
SSE is the error measure used in least-squares fit.
N
n
nywnxfSSEE1
2)(]),([
Error II – Negative log-likelihood
Nonlinear models for Classification and Regression 11
W are the parameters of the function f.
D is the dataset.
Common to assume normally distributed noise which leads to:
SSEE
N
n
nywnxfpwDPE1
)(]),([ln)|(ln
Error III – The Bayesian error measure
Nonlinear models for Classification and Regression 12
Allows including a prior belief, expressed in p(w) , about the
function f(x ,w) . A common example is:
2
2
2exp)(
w
wwp
)(lnln)(
)()|(ln)|(ln wpL
Dp
wpwDpDwpE
We assume linear process:
We use a linear model family F.
...and the goal is to make w = w* . Analytical solution:
Linear regression
Nonlinear models for Classification and Regression 13
e xwxy T
*)(
xwwxfxy T ),()(
yXyXXXw †TT 1
Gradient Descent
Nonlinear models for Classification and Regression 14
E(w)η=Δw w
Go downhill.
The learning rate h is set heuristically.
Bias & Variance
Nonlinear models for Classification and Regression 15
2
εζ++= VarianceBiasError2
A comment on learning...
Nonlinear models for Classification and Regression 16
Learning can be done in two forms:
Storing the information as examples (e.g. a look-up table). This
requires a ”distance” measure between samples.
Storing the information in the form of parameters w of a function
(e.g. linear regression). This requires a parameter update equation
(e.g. gradient descent).
There are intermediate forms of this, e.g. model that are updated
locally around examples.
Nonlinear models for regression
1. Repetition
2. Nonlinear models for regression
3. Nonlinear models for classification
4. Artificial Neural Networks
Nonlinear models for Classification and Regression 17
Nonlinear regression
Nonlinear models for Classification and Regression 18
We assume a nonlinear process:
With i.i.d. noise e. We use a nonlinear model family F.
e )()( xgxy
),()( wxfxy
Polynomial model family
Nonlinear models for Classification and Regression 19
Linear in w. It reduces to the linear
regression case, but with more variables.
M
M210 xw++xw+xw+w=w)f(x; 2
Polynomial regression – 1 dimension
Nonlinear models for Classification and Regression 20
)()(1
)2()2(1
)1()1(1
11
11
11
NxNx
xx
xx
=X
M
M
M
Analytic Solution: yX=yXXX=w †TT 1
Requires the estimation of M+1 parameters, where M is the
order of the polynomial.
Polynomial regression – 2 dimensions
Nonlinear models for Classification and Regression 21
The number of parameters to estimate scales as MD, where M is the
order of the polynomial and D the dimensionality of the input space.
M
2M
M
12M3210 xw+xw++xxw+xw+xw+w=w)f(x; 212121
)()(
)2()2(
)1()1(
)()()()(1
)2()2()2()2(1
)1()1()1()1(1
21
21
21
2121
2121
2121
NxNx
xx
xx
NxNxNxNx
xxxx
xxxx
=X
MM
MM
MM
Example – Polynomial model
Nonlinear models for Classification and Regression 22
The true function is
a Bessel function
Generalized Linear model
Nonlinear models for Classification and Regression 23
Linear in the parameters w. Reduces to the linear regression case,
but with more variables.
Requires a good guess on the basis functions hk(x).
(x)hw++(x)hw+w=w)f(x; MM10 1
Example – Generalized Linear model
Nonlinear models for Classification and Regression 24
...21 +(x)Jw+(x)Jw=w)f(x, 21
Jk(x) is a Bessel
function
Fourier Series
Nonlinear models for Classification and Regression 25
Fourier Series are another example of generalized linear model.
k
T
kk0 xiαw+w=w)f(x exp,
K Nearest Neighbour regression
Nonlinear models for Classification and Regression 26
The prediction equals y of the nearest neighbour (K=1).
The prediction equals the average, mode, median, etc... of the y
of K nearest neighbours.
The prediction equals the weighted average of the y of K nearest
neighbours.
K
=k
kkk ))y(m(rw=(x)y1
ˆ
Where mk is the index of the k:th neighbor and
rk is the distance ||x – x (mk )||
1 Nearest Neighbour
Nonlinear models for Classification and Regression 27
otherwise
kfor=)(rw kk
0
11
)y(=y
=m
2ˆ
21
3 Nearest Neighbours
Nonlinear models for Classification and Regression 28
3/542ˆ
5421
)y(+)y(+)y(=y
=m;=m;=m 32
otherwise
KkforK=)(rw kk
0
/1
Linear Interpolation
Nonlinear models for Classification and Regression 29
1/r is an interpolation kernel.
Can consider all the observations in the dataset.
N
=m
m
kkk
r
r=)(rw
1
/1
/1
N
=kk
m
N
=kk
mk r)y(m
=(x)y
1
1
1/r
/
ˆ
Example: KNNR
Nonlinear models for Classification and Regression 30
Kernel Regression
Nonlinear models for Classification and Regression 31
Kernel functions around x(n)
Example: Nadaraya-Watson estimator
Bishop´s book Ch. 6.3.1
N
=n
N
=n0
](x(n))r[
](x(n))r[y(n)
=)wf(x;
1
2
0
2
1
2
0
2
2w/exp
2w/exp
N
=n
n
kkk
]r[
]r[=)(rw
1
2
0
2
2
0
2
2w/exp
2w/exp
Example: Kernel Regression
Nonlinear models for Classification and Regression 32
Note on nonlinear regression
Nonlinear models for Classification and Regression 33
Polynomial regression and generalized linear regression are fitted
using error based learning.
KNN regression is just a look-up method.
Kernel regression is a combination. The ”width” (w0) is fitted
using an iterative (often manual) method.
Nonlinear models for classification
1. Repetition
2. Nonlinear models for regression
3. Nonlinear models for classification
4. Artificial Neural Networks
Nonlinear models for Classification and Regression 34
Nonlinear models for Classification and Regression 35
Quadratic Gaussian Classifier
Assume p(x|ck) Gaussian with different means uk and different
covariance matrices ∑k. D is the dimension of the input space.
Estimate means and covariance matrices for the categories
maximizing the likelihood of the dataset p(D| uk, ∑k ):
kK
T
k
K
Dk uxuxcxp2
1exp
)det()2(
1)|(
2/
kcnxk
k nxN
u)(
)(1
ˆ Tk
K
k cnx
k
k
k unxunxN
k
ˆ)(ˆ)(1
1
1 )(
Example: Quadratic Gaussian Classifier
Nonlinear models for Classification and Regression 36
Training error = 0.07%
Test error = 0.03%
Example: Quadratic Gaussian Classifier
Nonlinear models for Classification and Regression 37
Training error = 0.07%
Test error = 0.03%
Nonlinear models for Classification and Regression 38
Linear Gaussian class boundary
11399 green samples
2142 red samples
Training error = 0.06%
Test error = 0.10%
Quadratic logistic regression
Nonlinear models for Classification and Regression 39
Fit w maximizing conditional likelihood like for the linear logistic
regression.
cb,A,=w
c)+xb+Ax(x+=w)f(x,
TTexp1
1
K Nearest Neighbours classification
Nonlinear models for Classification and Regression 40
Estimate the posterior probabilities according to neighbours
Maximum a posteriori classification
K
K=x)|(cp
j
jˆ
x)|(cpc jc
jˆmaxargˆ
Example: 1-NN classifier
Nonlinear models for Classification and Regression 41
Test error = 0.10%
Example: 5-NN classifier
Nonlinear models for Classification and Regression 42
Test error = 0.14%
Decision Trees
Nonlinear models for Classification and Regression 43
Split into smaller and smaller subsets.
Each split increases node purity (e.g. entropy).
Splits usually made along variable axes. This generates a subdivision
into “hypercubes”.
Backwards pruning important.
Example: Decision Tree
Nonlinear models for Classification and Regression 44
First cut along x1:
Rule: if x1 < -0.1515 then red otherwise green.
No suitable cut along x2 after the first cut along x1.
Training error = 0.06%
Test error = 0.07%
Inductive learning of a Decision Tree
Nonlinear models for Classification and Regression 45
Simplest: Construct a decision tree with one leaf for every
example. Memory based learning, not very good generalization.
Advanced: Split on each variable so that the purity of each split
increases (i.e. only samples belonging to one class).
A purity measure can be for example entropy.
i
ii )p(c)p(c= lnEntropy
Entropy: a measure of “order”
Nonlinear models for Classification and Regression 46
The entropy is maximal
when all possibilities
are equally likely.
The goal of the decision
tree is to decrease the
entropy in each node.
Entropy is zero in a
“pure” node, i.e. a node
containing only samples
belonging to one class.
Entropy function
Nonlinear models for Classification and Regression 47
Plot the entropy function for a 2 class problem, where the classes
are yes and no as a function of p(yes).
[p(no)]p(no)[p(yes)]p(yes)= lnlnEntropy
Decision Tree learning algorithm
Nonlinear models for Classification and Regression 48
Create pure nodes whenever possible.
If pure nodes are not possible, choose the split that leads to the
largest decrease in entropy.
Decision Tree learning - 1
Nonlinear models for Classification and Regression 49
Apply the decision tree learning algorithm to the following data
set with 10 features and 12 observations.
Decision Tree learning - 2
Nonlinear models for Classification and Regression 50
Dataset:
Variable to predict: TRUE or FALSE
Decision Tree learning result
Nonlinear models for Classification and Regression 51
True Decision Tree
Nonlinear models for Classification and Regression 52
Considerations – Inductive learning
Nonlinear models for Classification and Regression 53
The induced decision tree cannot be more complex than what the
data support.
The tree was constructed based on perfect learning, i.e. we
assume that there are no mistakes on the training data. This is
often not a good idea!
Considerations – Inductive learning
Nonlinear models for Classification and Regression 54
The induced decision tree cannot be more complex than what the
data support.
The tree was constructed based on perfect learning, i.e. we
assume that there are no mistakes on the training data. This is
often not a good idea!
Probably good to stop learning before having pure nodes or to prune
some nodes and branches. Then estimate the a posteriori probabilities
from the number of observations of different classes in the leaf.
K
K=x)|(cp
j
jˆ
How do we know that f≈g?
Nonlinear models for Classification and Regression 55
In other words, how do we know that what we learned is correct?
Try f on a new test set of examples(cross validation)...
...and assume the ”principle of uniformity”, i.e. the result we get on this test data should be indicative of results on future data.
Learning curve for the decision tree
algorithm on 100 randomly generated
examples (test set) in the restaurant
domain. The graph plots the average
of 20 trials.
Cross Validation
Nonlinear models for Classification and Regression 56
Split your data set into two parts, one for training your model
and the other for validating your model. The error on the
validation data is called “validation error”(Eval).
valgen EE
valE
K fold Cross Validation
Nonlinear models for Classification and Regression 57
More accurate than using only one validation set.
Fit your model K times and test it K times. Then average the
performance.
K
=k
valvalgen (k)EK
=EE1
1
)(Eval 1 )(Eval 2 )(Eval 3
Leave One Out Cross Validation
Nonlinear models for Classification and Regression 58
Use every data point as validation set.
Leave One Out is a K fold cross validation with K equals to the
number of observations.
K
=k
valvalgen (k)EK
=EE1
1
)(Eval 1 )(Eval 2 (K)Eval
Artificial Neural Networks
1. Repetition
2. Nonlinear models for regression
3. Nonlinear models for classification
4. Artificial Neural Networks
Nonlinear models for Classification and Regression 59
Problems with single perceptron
Nonlinear models for Classification and Regression 60
1. Single layer allows only
linear machine.
2. Perceptron learning
oscillates if data
distributions overlap.
3. Step functions
complicate learning
with many perceptrons.
Problems with single perceptron
Nonlinear models for Classification and Regression 61
1. Single layer allows only
linear machine.
2. Perceptron learning
oscillates if data
distributions overlap.
3. Step functions
complicate learning
with many perceptrons.
1. Use many perceptrons
arranged in multiple layers.
2. Use a different learning
algorithm.
3. Use smooth transfer
functions.
Problems with single perceptron
Nonlinear models for Classification and Regression 62
1. Single layer allows only
linear machine.
2. Perceptron learning
oscillates if data
distributions overlap.
3. Step functions
complicate learning
with many perceptrons.
1. Use many perceptrons
arranged in multiple layers.
2. Use a different learning
algorithm.
3. Use smooth transfer
functions.
Multilayer Perceptron
a.k.a. Artificial Neural Network (ANN)
The Multilayer Perceptron
Nonlinear models for Classification and Regression 63
Combine several single layer perceptrons.
Each single layer perceptron uses a sigmoid shaped transfer
function like the logistic or hyperbolic tangent function.
z)(+=φ(z)
(z)=φ(z)
exp1
1
tanh
Transfer functions
Nonlinear models for Classification and Regression 64
z)(+=φ(z)
(z)=φ(z)
exp1
1
tanh
Training a Multilayer Perceptron
Nonlinear models for Classification and Regression 65
The simplest algorithm for training a multilayer perceptron is the
backpropagation algorithm:
1. Select small random weights w.
2. Until halting condition: 1. Select a random training example.
2. Calculate the output of the hidden layer.
3. Calculate the output of the output layer.
4. Calculate error for output layer.
5. Calculate error for hidden layer.
6. Update weights.
Training a Multilayer Perceptron
Nonlinear models for Classification and Regression 66
The simplest algorithm for training a multilayer perceptron is the
backpropagation algorithm:
1. Select small random weights w.
2. Until halting condition: 1. Select a random training example.
2. Calculate the output of the hidden layer.
3. Calculate the output of the output layer.
4. Calculate error for output layer.
5. Calculate error for hidden layer.
6. Update weights.
Forward Step
Backwards Step
The backpropagation algorithm
Nonlinear models for Classification and Regression 67
The error is propagated backwards, from which the name of the
algorithm.
Forward step:
Backwards step:
Backpropagation is gradient descent on SSE
[h]yh(x)x ˆ
eδδ ij
The backpropagation algorithm
Nonlinear models for Classification and Regression 68
The error is propagated backwards, from which the name of the
algorithm.
Forward step:
Backwards step:
Backpropagation is gradient descent on SSE
[h]yh(x)x ˆ
eδδ ij
E(t)η=Δw(t) w
Example: 2-2-1 backpropagation
Nonlinear models for Classification and Regression 69
Training error = 0.20%
Test error = 0.24%
Converges after 3000
epochs (forever!!)
Speeding up BackProp: bold driver
Nonlinear models for Classification and Regression 70
Adaptive learning rate: If things are going well increase speed
If things are going bad decrease speed
)E(t<E(t))(t
)E(tE(t))(t=η(t)
E(t)η(t)=Δw(t) w
1 if 11.2η
1 if 10.5η
Example: bold driver
Nonlinear models for Classification and Regression 71
Backpropagation:
fixed h
Bold driver:
adaptive h
Speeding up BackProp: momentum
Nonlinear models for Classification and Regression 72
Speed up if several steps are in the same direction.
Slow down if steps change direction all the time.
The update is the moving average of the updates calculated at
every iteration.
)αΔw(t+E(t)η=Δw(t) w 1E(t)η=Δw(t) w
update rule for
backpropagation update rule for backpropagation
with momentum
Example: momentum
Nonlinear models for Classification and Regression 73
Backpropagation:
no momentum
Backpropagation:
with momentum
Second order search
Nonlinear models for Classification and Regression 74
Go downhill.
The step length is estimated from the curvature of the function.
EH=Δw w 1
Second order learning algorithms
Nonlinear models for Classification and Regression 75
Adjust the length step analytically.
Jacobian: matrix of the first partial derivatives of a function.
Hessian: matrix of the second partial derivatives of a function:
j
i
x
f=ji,J(f)
ji xx
f=ji,H(f)
2
Why second order learning algorithms?
Nonlinear models for Classification and Regression 76
Taylor expansion of the error function:
Hessian matrix:
Setting the first derivative of E(w+Dw) w.r.t. Dw equals to zero yelds to:
DDDD wH(w)w+E(w)w+E(w)=w)+E(w TT
E(w)H(w) T
0=Δw)+E(wΔw EH=Δw w 1
Example: second order learning
Nonlinear models for Classification and Regression 77
Solution in one step
for a quadratic
error function!
Levenberg-Marquardt algorithm
Nonlinear models for Classification and Regression 78
The Hessian is expensive to calculate, and second derivatives are often very noisy. Use a Jacobian based approximation of the Hessian.
Apply regularization.
E(t)λI+t)(n,t)JJ(n,=Δw(t) w
n
T
1
n
T (n)(n)JH 2J
Example: 2-2-1 backpropagation
Nonlinear models for Classification and Regression 79
Training error = 0.20%
Test error = 0.24%
Converges after 3000
epochs (forever!!)
Example: 2-2-1 Levenberg-Marquardt
Nonlinear models for Classification and Regression 80
Training error = 0.03%
Test error = 0.07%
Converges after
3 epochs
Example: 2-3-1 Levenberg-Marquardt
Nonlinear models for Classification and Regression 81
Training error = 0.17%
Test error = 0.16%
Converges after
150 epochs
ANN for nonlinear regression
Nonlinear models for Classification and Regression 82
No unique solution!
Nonlinear models for Classification and Regression 83
When the optimization
algorithm terminates,
nothing certifies that the
global minimum of the
error function has been
reached.
Often ANN training
algorithms terminate in
local minima.
Interpretation of ANN
Nonlinear models for Classification and Regression 84
Classification: nonlinear logistic regression
Regression: projection pursuit regression
[f(x)]+=(x)y=x)|(cp
exp1
1ˆˆ
j
T
jjj x)(whv=(x)y
Book Reading (Bishop)
Nonlinear models for Classification and Regression 85
Ch. 2.5
Ch. 4.1, 4.2, 4.3
Ch. 5.1, 5.2, 5.3, 5.4