Upload
nico-soulsby
View
274
Download
0
Tags:
Embed Size (px)
Citation preview
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
ETHEM ALPAYDIN© The MIT Press, 2010
[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml2e
Lecture Slides for
1
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Parametric EstimationParametric (single global model)
Advantage: It reduces the problem of estimating a probability density
function (pdf), discriminant, or regression function to estimating the values of a small number of parameters.
Disadvantage: This assumption does not always hold and we may incur a
large error if it does not.
Semiparametric (small number of local models)Mixture model (Chap 7)The density is written as a disjunction of a small number
of parametric models.
3
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Nonparametric EstimationAssumption:
Similar inputs have similar outputsFunctions (pdf, discriminant, regression)
change smoothlyKeep the training data;“let the data speak for
itself”Given x, find a small number of closest training
instances and interpolate from theseNonparametric methods are also called memory-
based or instance-based learning algorithms.
4
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Density EstimationGiven the training set X={xt}t drawn iid
(independent and identically distributed) from p(x)Divide data into bins of size hHistogram estimator: (Figure 8.1)
PS. In probability theory and statistics, a sequence or other collection of random variables is independent and identically distributed (i.i.d.) if each random variable has the same probability distribution as the others and all are mutually independent.
(http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables)
Nh
xx#xp
t as bin same the in
5
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Nh
xx#xp
t as bin same the in
6
Histogram estimator
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Density EstimationGiven the training set X={xt}t drawn iid from p(x)x is always at the center of a bin of size hNaive estimator:
or
(Figure 8.2)
# / 2 / 2ˆ
tx h x x hp x
Nh
1
1 if 1 / 21ˆ ; 0 otherwise
tN
t
ux xp x w w u
Nh h
7
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
# / 2 / 2ˆ
tx h x x hp x
Nh
h=0.5
h=1
Naïve estimator: h=2
8
Naive estimator:
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Kernel EstimatorKernel function, e.g., Gaussian kernel:
Kernel estimator (Parzen windows): Figure 8.3
If K is Gaussian, then will be smooth having all the derivatives.
N
t
t
hxx
KNh
xp1
1
2exp
2
1 2uuK
p
9
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
N
t
t
hxx
KNh
xp1
1
10
Kernel estimator
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
k-Nearest Neighbor EstimatorInstead of fixing bin width h and counting the
number of instances, fix the instances (neighbors) k and check bin width
dk(x): distance to kth closest instance to x
xNdk
xpk2
11
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
xNdk
xpk2
12
k-Nearest Neighbor Estimator
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Generalization to Multivariate DataKernel density estimator
with the requirement that
Multivariate Gaussian kernel
spheric
ellipsoid
N
t
t
d hK
Nhp
1
1 xxx
uuu
uu
1212
2
21
exp2
1
2exp
2
1
SS
T//d
d
K
K
( ) 1dRK d x x
13
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Nonparametric ClassificationEstimate p(x|Ci) and use Bayes’ ruleKernel estimator
The discriminant function
1
1 ˆˆ | , tN
t ii i id
ti
Nx xp x C K r P C
N h h N
14
1
ˆˆ |
1
i i i
tNtid
t
g x p x C P C
x xK r
Nh h
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Nonparametric Classification k-nn estimatorFor the special case of k-nn estimator
where ki : the number of neighbors out of the k nearest that belong to Ci
Vk(x) : the volume of the d-dimensional hypersphere centered at x,
with radius cd : the volume of the unit sphere in d dimensions For example,
xVN
kCxp
ki
ii |ˆ
( )kr x x
ddk crV
;3
43
;2
;21
33
3
22
2
11
rcrVd
rcrVd
rcrVd
k
k
k
15
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Nonparametric Classification k-nn estimatorFrom
Then
xVN
kCxp
ki
ii |ˆ
k
k
xp
CPCxpxCP iii
i ˆ
ˆ|ˆ|ˆ
xNV
kxp
kˆ
N
NCP ii ˆ
16
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Condensed(壓縮的 ) Nearest NeighborTime/space complexity of k-NN is O (N)Find a subset Z of X that is small and is accurate
in classifying X (Hart, 1968)
ZZXXZ || E'E
Error function
The error on X storing Z
The cardinality of Z
17
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Condensed Nearest NeighborIncremental algorithm: Add instance if needed
A greedy method and a local searchThe results depend on the order of the training
instances.
18
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Nonparametric RegressionSmoothing models:
Nonparametric regression estimator also called a smoother Running mean smoother
regressogram vs naive estimator Kernel smoother Running line smoother
Regressogram: Figure 8.7
otherwise0
withbin same the in is if 1
where
1
1
xxx,xb
x,xb
rx,xbxg
tt
N
t
t
tN
t
t
19
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
N
t
t
tN
t
t
xxb
rxxbxg
1
1
,
,ˆ
20
Regressogram
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 21
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Running mean smoother: naive estimatorFigure 8.8
Running Mean Smoother
otherwise0
1if 1
where
1
1
uuw
hxx
w
rhxx
w
xgN
t
t
tN
t
t
22
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
N
t
t
tN
t
t
hxx
w
rhxx
w
xg
1
1
ˆ
23
Running mean smoother
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Kernel SmootherKernel smoother: Figure 8.9
where K( ) : Gaussian kernel
N
t
t
tN
t
t
hxx
K
rhxx
K
xg
1
1
2exp
2
1 2uuK
24
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
N
t
t
tN
t
t
hxx
K
rhxx
K
xg
1
1
25
Kernel smoother
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Running Line SmootherInstead of taking an average and giving a constant
fit at a point, we can take into account one more term in the Taylor expansion and calculate a linear fit.
F(x) = F’(a)/1!+F’’(a) (x-a)/2!+ F’’’(a) (x-a)2/3!+…
Running line smoother: Figure 8.10Use the data points in the neighborhood, as
defined by h or kFit a local regression line
26
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27
Running line smoother
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
How to Choose k or h?When k or h is small:
Single instances matter; Bias is small, variance is large UndersmoothingHigh complexity
As k or h increases:Average over more instances Variance decreases but bias increasesOversmoothing Low complexity
Cross-validation is used to finetune k or h.28
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29
Kernel estimate for various bin lengths for a two-class problem. Plotted are the conditional densities,p(x|Ci). It seems that the top one oversmooths and the bottom undersmooths, but whichever is the bestwill depend on where the validation data points are.