Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Non-parametric MethodsMachine Learning

Torsten Möller

©Möller/Mori 1


Reading

• Chapter 2 of “Pattern Recognition and Machine Learning” byBishop (with an emphasis on section 2.5)

©Möller/Mori 2


Outline

Last week

Parametric Models

Kernel Density Estimation

Nearest-neighbour

©Möller/Mori 3


Last week

• model selection / generalization OR the tale of finding goodparameters

• curve fitting• decision theory• probability theory

©Möller/Mori 4


Which Degree of Polynomial?

�

�

��

0 1

−1

0

1

�

�

��

0 1

−1

0

1

�

�

��

0 1

−1

0

1

�

�

��

0 1

−1

0

1

• A model selection problem• M = 9 → E(w∗) = 0: This is over-fitting

©Möller/Mori 5


Generalization

�

��

0 3 6 90

0.5

1TrainingTest

• Generalization is the holy grail of ML• Want good performance for new data

• Measure generalization using a separate set• Use root-mean-squared (RMS) error: ERMS =

√2E(w∗)/N

©Möller/Mori 6


Decision Theory

For a sample x, decide which class(Ck) it is from.

Ideas:• Maximum Likelihood• Minimum Loss/Cost (e.g. misclassification rate)• Maximum Aposteriori (MAP)

©Möller/Mori 7


Decision: Maximum Likelihood

• Inferencestep: Determine statistics from training data.p(x, t) OR p(x|Ck)

• Decisionstep: Determine optimal t for test input x:

t = arg maxk{ p (x|Ck)︸︷︷︸ }

Likelihood

©Möller/Mori 8






Likelihood

©Möller/Mori 9






Likelihood

©Möller/Mori 10


Parametric Models

• Problemstatement: What is a good probabilistic model(probability distribution that produced) our data?

• Answer: density estimation — given a finite set x1, . . . ,xN

of observations, model the probability distribution p(x) of arandom variable.

• one standardapproach are parametric methods: assumea model (e.g. Gaussian, Bernoulli, Dirichlet, etc.) and fit thecorrect parameters based on the given observations.

• Today we will focus on non-parametric methods: Ratherthan having a fixed set of parameters (e.g. weight vector forregression, µ,Σ for Gaussian) we have a possibly infinite setof parameters based on each data point

• Kernel density estimation• Nearest-neighbour methods

©Möller/Mori 11


Histograms

• Consider the problem of modelling the distribution ofbrightness values in pictures taken on sunny days versuscloudy days

• We could build histograms of pixel values for each class©Möller/Mori 12


Histograms

��

0 0.5 10

5

��

0 0.5 10

5

��

0 0.5 10

5

• E.g. for sunny days• Count ni number of datapoints (pixels)

with brightness value falling into each bin:pi =

niN∆i

• Sensitive to bin width ∆i

• Discontinuous due to bin edges• In D-dim space with M bins per

dimension, MD bins

©Möller/Mori 13


Histograms

��

0 0.5 10

5

��

0 0.5 10

5

��

0 0.5 10

5



niN∆i



dimension, MD bins

©Möller/Mori 14


Histograms

��

0 0.5 10

5

��

0 0.5 10

5

��

0 0.5 10

5



niN∆i



dimension, MD bins

©Möller/Mori 15


Histograms

��

0 0.5 10

5

��

0 0.5 10

5

��

0 0.5 10

5



niN∆i



dimension, MD bins

©Möller/Mori 16


Local Density Estimation• In a histogram we use nearby points to estimate density• Assuming our space is divided into smallish regions of size V,

then their probability mass is

P =

∫Rp(x)dx ≈ p(x)V

• We assume that p(x) is approximately constant in that region• Further, the number of data points per region is roughly

K ≈ NP

• Hence, for a small region around x, estimate density as:

p(x) =K

NV

• K is number of points in region, V is volume of region, N istotal number of datapoints

©Möller/Mori 17




P =



K ≈ NP


p(x) =K

NV


©Möller/Mori 18




P =



K ≈ NP


p(x) =K

NV


©Möller/Mori 19




P =



K ≈ NP


p(x) =K

NV


©Möller/Mori 20



• First idea: keep volume of neighbourhood V constant• Try to keep idea of using nearby points to estimate density,

but obtain smoother estimate• Estimate density by placing a small bump at each datapoint

• Kernel function k(·) determines shape of these bumps• Density estimate is

p(x) ∝ 1

N

N∑n=1

k

(x− xn

h

)

©Möller/Mori 21


Kernel Density Estimation��

0 0.5 10

5

��

0 0.5 10

5

��

0 0.5 10

5

• Example using Gaussian kernel:

p(x) =1

N

N∑n=1

1

(2πh2)1/2exp

{−||x− xn||2

2h2

}

©Möller/Mori 22



!3 !2 !1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

!5 0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

• Other kernels: Rectangle, Triangle, Epanechnikov

• Fast at training time, slow at test time – keep all datapoints• Sensitive to kernel bandwidth h

©Möller/Mori 23



!3 !2 !1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

!5 0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

• Other kernels: Rectangle, Triangle, Epanechnikov

• Fast at training time, slow at test time – keep all datapoints• Sensitive to kernel bandwidth h

©Möller/Mori 24



!3 !2 !1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

!5 0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

• Other kernels: Rectangle, Triangle, Epanechnikov• Fast at training time, slow at test time – keep all datapoints

• Sensitive to kernel bandwidth h

©Möller/Mori 25



!3 !2 !1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

!5 0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

• Other kernels: Rectangle, Triangle, Epanechnikov• Fast at training time, slow at test time – keep all datapoints• Sensitive to kernel bandwidth h

©Möller/Mori 26


Nearest-neighbour��

0 0.5 10

5

��

0 0.5 10

5

��

0 0.5 10

5

• Instead of relying on kernel bandwidth to get proper densityestimate, fix number of nearby points K:

p(x) =K

NV

• Note: diverges, not proper density estimate©Möller/Mori 27


Nearest-neighbour for Classification

x1

x2

(a)x1

x2

(b)

• K Nearest neighbour is often used for classification• Classification: predict labels ti from xi

• e.g. xi ∈ R2 and ti ∈ {0, 1}, 3-nearest neighbour• K = 1 referred to as nearest-neighbour

©Möller/Mori 28



x1

x2

(a)

x1

x2

(b)


• e.g. xi ∈ R2 and ti ∈ {0, 1}, 3-nearest neighbour

• K = 1 referred to as nearest-neighbour

©Möller/Mori 29



x1

x2

(a)x1

x2

(b)


• e.g. xi ∈ R2 and ti ∈ {0, 1}, 3-nearest neighbour• K = 1 referred to as nearest-neighbour

©Möller/Mori 30



• Good baseline method• Slow, but can use fancy data structures for efficiency

(KD-trees, Locality Sensitive Hashing)• Nice theoretical properties

• As we obtain more training data points, space becomesmore filled with labelled data

• As N → ∞ error no more than twice Bayes error

©Möller/Mori 31


Bayes Error

R1 R2

x0 x̂

p(x, C1)

p(x, C2)

x

• Best classification possible given features• Two classes, PDFs shown• Decision rule: C1 if x ≤ x̂; makes errors on red, green, and

blue regions• Optimal decision rule: C1 if x ≤ x0, Bayes error is area of

green and blue regions

©Möller/Mori 32


Bayes Error

R1 R2

x0 x̂

p(x, C1)

p(x, C2)

x




©Möller/Mori 33


Bayes Error

R1 R2

x0 x̂

p(x, C1)

p(x, C2)

x




©Möller/Mori 34


Conclusion

• Readings: Ch. 2.5• Kernel density estimation

• Model density p(x) using kernels around training datapoints• Nearest neighbour

• Model density or perform classification using nearest trainingdatapoints

• Multivariate Gaussian• Needed for next week’s lectures, if you need a refresher read

pp. 78-81

©Möller/Mori 35

Documents

Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum