Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Non-parametric MethodsMachine Learning
Torsten Möller
©Möller/Mori 1
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Reading
• Chapter 2 of “Pattern Recognition and Machine Learning” byBishop (with an emphasis on section 2.5)
©Möller/Mori 2
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Outline
Last week
Parametric Models
Kernel Density Estimation
Nearest-neighbour
©Möller/Mori 3
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Last week
• model selection / generalization OR the tale of finding goodparameters
• curve fitting• decision theory• probability theory
©Möller/Mori 4
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Which Degree of Polynomial?
�
�
�����
0 1
−1
0
1
�
�
�����
0 1
−1
0
1
�
�
�����
0 1
−1
0
1
�
�
�����
0 1
−1
0
1
• A model selection problem• M = 9 → E(w∗) = 0: This is over-fitting
©Möller/Mori 5
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Generalization
�
�����
0 3 6 90
0.5
1TrainingTest
• Generalization is the holy grail of ML• Want good performance for new data
• Measure generalization using a separate set• Use root-mean-squared (RMS) error: ERMS =
√2E(w∗)/N
©Möller/Mori 6
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Decision Theory
For a sample x, decide which class(Ck) it is from.
Ideas:• Maximum Likelihood• Minimum Loss/Cost (e.g. misclassification rate)• Maximum Aposteriori (MAP)
©Möller/Mori 7
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Decision: Maximum Likelihood
• Inferencestep: Determine statistics from training data.p(x, t) OR p(x|Ck)
• Decisionstep: Determine optimal t for test input x:
t = arg maxk{ p (x|Ck)︸ ︷︷ ︸ }
Likelihood
©Möller/Mori 8
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Decision: Maximum Likelihood
• Inferencestep: Determine statistics from training data.p(x, t) OR p(x|Ck)
• Decisionstep: Determine optimal t for test input x:
t = arg maxk{ p (x|Ck)︸ ︷︷ ︸ }
Likelihood
©Möller/Mori 9
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Decision: Maximum Likelihood
• Inferencestep: Determine statistics from training data.p(x, t) OR p(x|Ck)
• Decisionstep: Determine optimal t for test input x:
t = arg maxk{ p (x|Ck)︸ ︷︷ ︸ }
Likelihood
©Möller/Mori 10
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Parametric Models
• Problemstatement: What is a good probabilistic model(probability distribution that produced) our data?
• Answer: density estimation — given a finite set x1, . . . ,xN
of observations, model the probability distribution p(x) of arandom variable.
• one standardapproach are parametric methods: assumea model (e.g. Gaussian, Bernoulli, Dirichlet, etc.) and fit thecorrect parameters based on the given observations.
• Today we will focus on non-parametric methods: Ratherthan having a fixed set of parameters (e.g. weight vector forregression, µ,Σ for Gaussian) we have a possibly infinite setof parameters based on each data point
• Kernel density estimation• Nearest-neighbour methods
©Möller/Mori 11
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Histograms
• Consider the problem of modelling the distribution ofbrightness values in pictures taken on sunny days versuscloudy days
• We could build histograms of pixel values for each class©Möller/Mori 12
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Histograms
������� ���
0 0.5 10
5
������� ��
0 0.5 10
5
������� ��
0 0.5 10
5
• E.g. for sunny days• Count ni number of datapoints (pixels)
with brightness value falling into each bin:pi =
niN∆i
• Sensitive to bin width ∆i
• Discontinuous due to bin edges• In D-dim space with M bins per
dimension, MD bins
©Möller/Mori 13
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Histograms
������� ���
0 0.5 10
5
������� ��
0 0.5 10
5
������� ��
0 0.5 10
5
• E.g. for sunny days• Count ni number of datapoints (pixels)
with brightness value falling into each bin:pi =
niN∆i
• Sensitive to bin width ∆i
• Discontinuous due to bin edges• In D-dim space with M bins per
dimension, MD bins
©Möller/Mori 14
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Histograms
������� ���
0 0.5 10
5
������� ��
0 0.5 10
5
������� ��
0 0.5 10
5
• E.g. for sunny days• Count ni number of datapoints (pixels)
with brightness value falling into each bin:pi =
niN∆i
• Sensitive to bin width ∆i
• Discontinuous due to bin edges• In D-dim space with M bins per
dimension, MD bins
©Möller/Mori 15
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Histograms
������� ���
0 0.5 10
5
������� ��
0 0.5 10
5
������� ��
0 0.5 10
5
• E.g. for sunny days• Count ni number of datapoints (pixels)
with brightness value falling into each bin:pi =
niN∆i
• Sensitive to bin width ∆i
• Discontinuous due to bin edges• In D-dim space with M bins per
dimension, MD bins
©Möller/Mori 16
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Local Density Estimation• In a histogram we use nearby points to estimate density• Assuming our space is divided into smallish regions of size V,
then their probability mass is
P =
∫Rp(x)dx ≈ p(x)V
• We assume that p(x) is approximately constant in that region• Further, the number of data points per region is roughly
K ≈ NP
• Hence, for a small region around x, estimate density as:
p(x) =K
NV
• K is number of points in region, V is volume of region, N istotal number of datapoints
©Möller/Mori 17
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Local Density Estimation• In a histogram we use nearby points to estimate density• Assuming our space is divided into smallish regions of size V,
then their probability mass is
P =
∫Rp(x)dx ≈ p(x)V
• We assume that p(x) is approximately constant in that region• Further, the number of data points per region is roughly
K ≈ NP
• Hence, for a small region around x, estimate density as:
p(x) =K
NV
• K is number of points in region, V is volume of region, N istotal number of datapoints
©Möller/Mori 18
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Local Density Estimation• In a histogram we use nearby points to estimate density• Assuming our space is divided into smallish regions of size V,
then their probability mass is
P =
∫Rp(x)dx ≈ p(x)V
• We assume that p(x) is approximately constant in that region• Further, the number of data points per region is roughly
K ≈ NP
• Hence, for a small region around x, estimate density as:
p(x) =K
NV
• K is number of points in region, V is volume of region, N istotal number of datapoints
©Möller/Mori 19
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Local Density Estimation• In a histogram we use nearby points to estimate density• Assuming our space is divided into smallish regions of size V,
then their probability mass is
P =
∫Rp(x)dx ≈ p(x)V
• We assume that p(x) is approximately constant in that region• Further, the number of data points per region is roughly
K ≈ NP
• Hence, for a small region around x, estimate density as:
p(x) =K
NV
• K is number of points in region, V is volume of region, N istotal number of datapoints
©Möller/Mori 20
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Kernel Density Estimation
• First idea: keep volume of neighbourhood V constant• Try to keep idea of using nearby points to estimate density,
but obtain smoother estimate• Estimate density by placing a small bump at each datapoint
• Kernel function k(·) determines shape of these bumps• Density estimate is
p(x) ∝ 1
N
N∑n=1
k
(x− xn
h
)
©Möller/Mori 21
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Kernel Density Estimation������� �����
0 0.5 10
5
������� ��
0 0.5 10
5
�������
0 0.5 10
5
• Example using Gaussian kernel:
p(x) =1
N
N∑n=1
1
(2πh2)1/2exp
{−||x− xn||2
2h2
}
©Möller/Mori 22
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Kernel Density Estimation
!3 !2 !1 0 1 2 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
!5 0 5 10 15 20 25 300
0.02
0.04
0.06
0.08
0.1
0.12
0.14
• Other kernels: Rectangle, Triangle, Epanechnikov
• Fast at training time, slow at test time – keep all datapoints• Sensitive to kernel bandwidth h
©Möller/Mori 23
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Kernel Density Estimation
!3 !2 !1 0 1 2 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
!5 0 5 10 15 20 25 300
0.02
0.04
0.06
0.08
0.1
0.12
0.14
• Other kernels: Rectangle, Triangle, Epanechnikov
• Fast at training time, slow at test time – keep all datapoints• Sensitive to kernel bandwidth h
©Möller/Mori 24
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Kernel Density Estimation
!3 !2 !1 0 1 2 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
!5 0 5 10 15 20 25 300
0.02
0.04
0.06
0.08
0.1
0.12
0.14
• Other kernels: Rectangle, Triangle, Epanechnikov• Fast at training time, slow at test time – keep all datapoints
• Sensitive to kernel bandwidth h
©Möller/Mori 25
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Kernel Density Estimation
!3 !2 !1 0 1 2 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
!5 0 5 10 15 20 25 300
0.02
0.04
0.06
0.08
0.1
0.12
0.14
• Other kernels: Rectangle, Triangle, Epanechnikov• Fast at training time, slow at test time – keep all datapoints• Sensitive to kernel bandwidth h
©Möller/Mori 26
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Nearest-neighbour�����
0 0.5 10
5
�����
0 0.5 10
5
�������
0 0.5 10
5
• Instead of relying on kernel bandwidth to get proper densityestimate, fix number of nearby points K:
p(x) =K
NV
• Note: diverges, not proper density estimate©Möller/Mori 27
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Nearest-neighbour for Classification
x1
x2
(a)x1
x2
(b)
• K Nearest neighbour is often used for classification• Classification: predict labels ti from xi
• e.g. xi ∈ R2 and ti ∈ {0, 1}, 3-nearest neighbour• K = 1 referred to as nearest-neighbour
©Möller/Mori 28
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Nearest-neighbour for Classification
x1
x2
(a)
x1
x2
(b)
• K Nearest neighbour is often used for classification• Classification: predict labels ti from xi
• e.g. xi ∈ R2 and ti ∈ {0, 1}, 3-nearest neighbour
• K = 1 referred to as nearest-neighbour
©Möller/Mori 29
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Nearest-neighbour for Classification
x1
x2
(a)x1
x2
(b)
• K Nearest neighbour is often used for classification• Classification: predict labels ti from xi
• e.g. xi ∈ R2 and ti ∈ {0, 1}, 3-nearest neighbour• K = 1 referred to as nearest-neighbour
©Möller/Mori 30
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Nearest-neighbour for Classification
• Good baseline method• Slow, but can use fancy data structures for efficiency
(KD-trees, Locality Sensitive Hashing)• Nice theoretical properties
• As we obtain more training data points, space becomesmore filled with labelled data
• As N → ∞ error no more than twice Bayes error
©Möller/Mori 31
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Bayes Error
R1 R2
x0 x̂
p(x, C1)
p(x, C2)
x
• Best classification possible given features• Two classes, PDFs shown• Decision rule: C1 if x ≤ x̂; makes errors on red, green, and
blue regions• Optimal decision rule: C1 if x ≤ x0, Bayes error is area of
green and blue regions
©Möller/Mori 32
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Bayes Error
R1 R2
x0 x̂
p(x, C1)
p(x, C2)
x
• Best classification possible given features• Two classes, PDFs shown• Decision rule: C1 if x ≤ x̂; makes errors on red, green, and
blue regions• Optimal decision rule: C1 if x ≤ x0, Bayes error is area of
green and blue regions
©Möller/Mori 33
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Bayes Error
R1 R2
x0 x̂
p(x, C1)
p(x, C2)
x
• Best classification possible given features• Two classes, PDFs shown• Decision rule: C1 if x ≤ x̂; makes errors on red, green, and
blue regions• Optimal decision rule: C1 if x ≤ x0, Bayes error is area of
green and blue regions
©Möller/Mori 34
Last week Parametric Models Kernel Density Estimation Nearest-neighbour
Conclusion
• Readings: Ch. 2.5• Kernel density estimation
• Model density p(x) using kernels around training datapoints• Nearest neighbour
• Model density or perform classification using nearest trainingdatapoints
• Multivariate Gaussian• Needed for next week’s lectures, if you need a refresher read
pp. 78-81
©Möller/Mori 35