29
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml2e Lecture Slides for 1

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 [email protected]

Embed Size (px)

Citation preview

Page 1: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

ETHEM ALPAYDIN© The MIT Press, 2010

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml2e

Lecture Slides for

1

Page 2: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2

Page 3: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Parametric EstimationParametric (single global model)

Advantage: It reduces the problem of estimating a probability density

function (pdf), discriminant, or regression function to estimating the values of a small number of parameters.

Disadvantage: This assumption does not always hold and we may incur a

large error if it does not.

Semiparametric (small number of local models)Mixture model (Chap 7)The density is written as a disjunction of a small number

of parametric models.

3

Page 4: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Nonparametric EstimationAssumption:

Similar inputs have similar outputsFunctions (pdf, discriminant, regression)

change smoothlyKeep the training data;“let the data speak for

itself”Given x, find a small number of closest training

instances and interpolate from theseNonparametric methods are also called memory-

based or instance-based learning algorithms.

4

Page 5: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Density EstimationGiven the training set X={xt}t drawn iid

(independent and identically distributed) from p(x)Divide data into bins of size hHistogram estimator: (Figure 8.1)

PS. In probability theory and statistics, a sequence or other collection of random variables is independent and identically distributed (i.i.d.) if each random variable has the same probability distribution as the others and all are mutually independent.

(http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables)

Nh

xx#xp

t as bin same the in

5

Page 6: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Nh

xx#xp

t as bin same the in

6

Histogram estimator

Page 7: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Density EstimationGiven the training set X={xt}t drawn iid from p(x)x is always at the center of a bin of size hNaive estimator:

or

(Figure 8.2)

# / 2 / 2ˆ

tx h x x hp x

Nh

1

1 if 1 / 21ˆ ; 0 otherwise

tN

t

ux xp x w w u

Nh h

7

Page 8: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

# / 2 / 2ˆ

tx h x x hp x

Nh

h=0.5

h=1

Naïve estimator: h=2

8

Naive estimator:

Page 9: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Kernel EstimatorKernel function, e.g., Gaussian kernel:

Kernel estimator (Parzen windows): Figure 8.3

If K is Gaussian, then will be smooth having all the derivatives.

N

t

t

hxx

KNh

xp1

1

2exp

2

1 2uuK

p

9

Page 10: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

N

t

t

hxx

KNh

xp1

1

10

Kernel estimator

Page 11: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

k-Nearest Neighbor EstimatorInstead of fixing bin width h and counting the

number of instances, fix the instances (neighbors) k and check bin width

dk(x): distance to kth closest instance to x

xNdk

xpk2

11

Page 12: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

xNdk

xpk2

12

k-Nearest Neighbor Estimator

Page 13: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Generalization to Multivariate DataKernel density estimator

with the requirement that

Multivariate Gaussian kernel

spheric

ellipsoid

N

t

t

d hK

Nhp

1

1 xxx

uuu

uu

1212

2

21

exp2

1

2exp

2

1

SS

T//d

d

K

K

( ) 1dRK d x x

13

Page 14: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Nonparametric ClassificationEstimate p(x|Ci) and use Bayes’ ruleKernel estimator

The discriminant function

1

1 ˆˆ | , tN

t ii i id

ti

Nx xp x C K r P C

N h h N

14

1

ˆˆ |

1

i i i

tNtid

t

g x p x C P C

x xK r

Nh h

Page 15: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Nonparametric Classification k-nn estimatorFor the special case of k-nn estimator

where ki : the number of neighbors out of the k nearest that belong to Ci

Vk(x) : the volume of the d-dimensional hypersphere centered at x,

with radius cd : the volume of the unit sphere in d dimensions For example,

xVN

kCxp

ki

ii |ˆ

( )kr x x

ddk crV

;3

43

;2

;21

33

3

22

2

11

rcrVd

rcrVd

rcrVd

k

k

k

15

Page 16: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Nonparametric Classification k-nn estimatorFrom

Then

xVN

kCxp

ki

ii |ˆ

k

k

xp

CPCxpxCP iii

i ˆ

ˆ|ˆ|ˆ

xNV

kxp

N

NCP ii ˆ

16

Page 17: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Condensed(壓縮的 ) Nearest NeighborTime/space complexity of k-NN is O (N)Find a subset Z of X that is small and is accurate

in classifying X (Hart, 1968)

ZZXXZ || E'E

Error function

The error on X storing Z

The cardinality of Z

17

Page 18: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Condensed Nearest NeighborIncremental algorithm: Add instance if needed

A greedy method and a local searchThe results depend on the order of the training

instances.

18

Page 19: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Nonparametric RegressionSmoothing models:

Nonparametric regression estimator also called a smoother Running mean smoother

regressogram vs naive estimator Kernel smoother Running line smoother

Regressogram: Figure 8.7

otherwise0

withbin same the in is if 1

where

1

1

xxx,xb

x,xb

rx,xbxg

tt

N

t

t

tN

t

t

19

Page 20: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

N

t

t

tN

t

t

xxb

rxxbxg

1

1

,

20

Regressogram

Page 21: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 21

Page 22: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Running mean smoother: naive estimatorFigure 8.8

Running Mean Smoother

otherwise0

1if 1

where

1

1

uuw

hxx

w

rhxx

w

xgN

t

t

tN

t

t

22

Page 23: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

N

t

t

tN

t

t

hxx

w

rhxx

w

xg

1

1

ˆ

23

Running mean smoother

Page 24: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Kernel SmootherKernel smoother: Figure 8.9

where K( ) : Gaussian kernel

N

t

t

tN

t

t

hxx

K

rhxx

K

xg

1

1

2exp

2

1 2uuK

24

Page 25: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

N

t

t

tN

t

t

hxx

K

rhxx

K

xg

1

1

25

Kernel smoother

Page 26: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Running Line SmootherInstead of taking an average and giving a constant

fit at a point, we can take into account one more term in the Taylor expansion and calculate a linear fit.

F(x) = F’(a)/1!+F’’(a) (x-a)/2!+ F’’’(a) (x-a)2/3!+…

Running line smoother: Figure 8.10Use the data points in the neighborhood, as

defined by h or kFit a local regression line

26

Page 27: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27

Running line smoother

Page 28: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

How to Choose k or h?When k or h is small:

Single instances matter; Bias is small, variance is large UndersmoothingHigh complexity

As k or h increases:Average over more instances Variance decreases but bias increasesOversmoothing Low complexity

Cross-validation is used to finetune k or h.28

Page 29: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29

Kernel estimate for various bin lengths for a two-class problem. Plotted are the conditional densities,p(x|Ci). It seems that the top one oversmooths and the bottom undersmooths, but whichever is the bestwill depend on where the validation data points are.