Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 [email protected]

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

ETHEM ALPAYDIN© The MIT Press, 2010

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml2e

Lecture Slides for

1

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2


Parametric EstimationParametric (single global model)

Advantage: It reduces the problem of estimating a probability density

function (pdf), discriminant, or regression function to estimating the values of a small number of parameters.

Disadvantage: This assumption does not always hold and we may incur a

large error if it does not.

Semiparametric (small number of local models)Mixture model (Chap 7)The density is written as a disjunction of a small number

of parametric models.

3


Nonparametric EstimationAssumption:

Similar inputs have similar outputsFunctions (pdf, discriminant, regression)

change smoothlyKeep the training data;“let the data speak for

itself”Given x, find a small number of closest training

instances and interpolate from theseNonparametric methods are also called memory-

based or instance-based learning algorithms.

4


Density EstimationGiven the training set X={xt}t drawn iid

(independent and identically distributed) from p(x)Divide data into bins of size hHistogram estimator: (Figure 8.1)

PS. In probability theory and statistics, a sequence or other collection of random variables is independent and identically distributed (i.i.d.) if each random variable has the same probability distribution as the others and all are mutually independent.

(http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables)

Nh

xx#xp

t as bin same the in

5

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Nh

xx#xp

t as bin same the in

6

Histogram estimator


Density EstimationGiven the training set X={xt}t drawn iid from p(x)x is always at the center of a bin of size hNaive estimator:

or

(Figure 8.2)

# / 2 / 2ˆ

tx h x x hp x

Nh

1

1 if 1 / 21ˆ ; 0 otherwise

tN

t

ux xp x w w u

Nh h

7


# / 2 / 2ˆ

tx h x x hp x

Nh

h=0.5

h=1

Naïve estimator: h=2

8

Naive estimator:


Kernel EstimatorKernel function, e.g., Gaussian kernel:

Kernel estimator (Parzen windows): Figure 8.3

If K is Gaussian, then will be smooth having all the derivatives.

N

t

t

hxx

KNh

xp1

1

2exp

2

1 2uuK

p

9


N

t

t

hxx

KNh

xp1

1

10

Kernel estimator


k-Nearest Neighbor EstimatorInstead of fixing bin width h and counting the

number of instances, fix the instances (neighbors) k and check bin width

dk(x): distance to kth closest instance to x

xNdk

xpk2

11


xNdk

xpk2

12

k-Nearest Neighbor Estimator


Generalization to Multivariate DataKernel density estimator

with the requirement that

Multivariate Gaussian kernel

spheric

ellipsoid

N

t

t

d hK

Nhp

1

1 xxx

uuu

uu

1212

2

21

exp2

1

2exp

2

1

SS

T//d

d

K

K

( ) 1dRK d x x

13


Nonparametric ClassificationEstimate p(x|Ci) and use Bayes’ ruleKernel estimator

The discriminant function

1

1 ˆˆ | , tN

t ii i id

ti

Nx xp x C K r P C

N h h N

14

1

ˆˆ |

1

i i i

tNtid

t

g x p x C P C

x xK r

Nh h


Nonparametric Classification k-nn estimatorFor the special case of k-nn estimator

where ki : the number of neighbors out of the k nearest that belong to Ci

Vk(x) : the volume of the d-dimensional hypersphere centered at x,

with radius cd : the volume of the unit sphere in d dimensions For example,

xVN

kCxp

ki

ii |ˆ

( )kr x x

ddk crV

;3

43

;2

;21

33

3

22

2

11

rcrVd

rcrVd

rcrVd

k

k

k

15


Nonparametric Classification k-nn estimatorFrom

Then

xVN

kCxp

ki

ii |ˆ

k

k

xp

CPCxpxCP iii

i ˆ

ˆ|ˆ|ˆ

xNV

kxp

kˆ

N

NCP ii ˆ

16


Condensed(壓縮的 ) Nearest NeighborTime/space complexity of k-NN is O (N)Find a subset Z of X that is small and is accurate

in classifying X (Hart, 1968)

ZZXXZ || E'E

Error function

The error on X storing Z

The cardinality of Z

17


Condensed Nearest NeighborIncremental algorithm: Add instance if needed

A greedy method and a local searchThe results depend on the order of the training

instances.

18


Nonparametric RegressionSmoothing models:

Nonparametric regression estimator also called a smoother Running mean smoother

regressogram vs naive estimator Kernel smoother Running line smoother

Regressogram: Figure 8.7

otherwise0

withbin same the in is if 1

where

1

1

xxx,xb

x,xb

rx,xbxg

tt

N

t

t

tN

t

t

19


N

t

t

tN

t

t

xxb

rxxbxg

1

1

,

,ˆ

20

Regressogram

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 21


Running mean smoother: naive estimatorFigure 8.8

Running Mean Smoother

otherwise0

1if 1

where

1

1

uuw

hxx

w

rhxx

w

xgN

t

t

tN

t

t

22


N

t

t

tN

t

t

hxx

w

rhxx

w

xg

1

1

ˆ

23

Running mean smoother


Kernel SmootherKernel smoother: Figure 8.9

where K( ) : Gaussian kernel

N

t

t

tN

t

t

hxx

K

rhxx

K

xg

1

1

2exp

2

1 2uuK

24


N

t

t

tN

t

t

hxx

K

rhxx

K

xg

1

1

25

Kernel smoother


Running Line SmootherInstead of taking an average and giving a constant

fit at a point, we can take into account one more term in the Taylor expansion and calculate a linear fit.

F(x) = F’(a)/1!+F’’(a) (x-a)/2!+ F’’’(a) (x-a)2/3!+…

Running line smoother: Figure 8.10Use the data points in the neighborhood, as

defined by h or kFit a local regression line

26


Running line smoother


How to Choose k or h?When k or h is small:

Single instances matter; Bias is small, variance is large UndersmoothingHigh complexity

As k or h increases:Average over more instances Variance decreases but bias increasesOversmoothing Low complexity

Cross-validation is used to finetune k or h.28


Kernel estimate for various bin lengths for a two-class problem. Plotted are the conditional densities,p(x|Ci). It seems that the top one oversmooths and the bottom undersmooths, but whichever is the bestwill depend on where the validation data points are.

Documents

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 [email protected]