Nonparametric Density Estimation (one dimension) › tine › nonpar › Nonparametric...Derivation ^f h(x) = 1 2hn #fXi 2 [x ¡h;x +h)g 1 2hn Xn i=1 I (jx ¡Xij • h) 1 hn Xn i=1

Nonparametric Density Estimation(one dimension)

Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction

Nonparametric kernel density estimation

Tine Buch-Kromann

February 7, 2007

Nonparametric kernel density estimation Nonparametric Density Estimation (one dimension)

Derivation

Idea of the histogram:

1

n · interval length#{obs that fall into a small interval containing x}

Idea of the kernel density estimator:

1

n · interval length#{obs that fall into a small interval around x}

Note: That the estimator does not depends on a origin of a bin grid.

When we want to estimate the density in x , we consider theinterval [x − h, x + h):

f̂h(x) =1

2hn#{Xi ∈ [x − h, x + h)}

Note: Interval length is 2h.


Derivation

f̂h(x) =1

2hn#{Xi ∈ [x − h, x + h)}

=1

2hn

n∑

i=1

I (|x − Xi | ≤ h)

=1

hn

n∑

i=1

1

2I

(∣∣∣∣x − Xi

h

∣∣∣∣ ≤ 1)

=1

hn

n∑

i=1

K

(x − Xi

h

)

where K (u) = 12 I (|u| ≤ 1) is the uniform kernel function, whichassigns weight 1/2 to each observation in the interval around x .Points outside the interval assigns the weight 0.


Derivation

Improvement:More weight to observations very close to x and less weight toobservations farther away x (The Epanechnikov kernel function)

K (u) =3

4(1− u2)I (|u| ≤ 1)

Kernel K (u)

Uniform 12 I (|u| ≤ 1)Triangle (1− |u|)I (|u| ≤ 1)Epanechnikov 34(1− u2)I (|u| ≤ 1)Quartic 1516(1− u2)2I (|u| ≤ 1)Triweight 3532(1− u2)3I (|u| ≤ 1)Gaussian 1√

2πexp(−12u2)

Cosine π4 cos(

π2 u

)I (|u| ≤ 1)


Derivation

Kernel Density EstimatorGeneral form of the kernel density estimator of a probabilitydensity f , based on a sample X1, ...,Xn

f̂h(x) =1

n

n∑

i=1

Kh(x − Xi )

where Kh(·) = 1hK (·/h) and h is the bandwidth.


Bandwidth

Bandwidth, h:h controls the smoothness of the estimate (similar to thehistogram) and the choice of h is a crucial problem.

The problem of how to determine the value of h is a reasonableway is handled later.


The Kernel Function

Properties:Kernel functions are usually probability density function:

∫K (u) du = 1

K (u) ≥ 0 ∀u in the domain of K

Consequences:The kernel density estimator is a pdf

∫f̂h(x) dx = 1

Moreover, f̂h(x) inherits all the continuity and differentiabilityproperties of K : If K is ν times continuously differentiable thenalso f̂h(x) is ν times cont. diff.


The Kernel Function

The smoothness of f̂h(x) (for the same value of h) depends ofkernel function.

...even if both kernel functions are continuous


Kernel Density Estimator as a Sum of Bumps

An alternative view of the kernel density estimation.Rescaled kernel function

1

nhK

(x − Xi

h

)=

1

nKh (x − Xi )

Note: The area under the rescaled kernel function is∫

1

nhK

(x − Xi

h

)dx =

1

nh

∫K (u)h du

=1

nhh

∫K (u) du

=1

n


Rescaled kernel function

Rewrite the kernel density function as the sum of rescaledkernel functions

f̂h(x) =1

n

n∑

i=1

Kh(x − Xi ) =n∑

i=1

1

nhK

(x − Xi

h

)


Statistical Properties: Bias

Bias:

Bias(f̂h(x)) =h2

2f ′′(x)µ2(K ) + o(h2), h → 0

where µ2(K ) =∫

s2K (s) ds.

The bias is proportional to h2:Choose a small h to reduce the bias.


Statistical Properties: Bias

Bias depends on f ′′(x) (curvature):In peaks of f : The bias < 0 since f ′′ < 0 around a local maximumof f .In ”valleys” of f : The bias > 0 since f ′′ > 0 around a local mini-mum of f .The magnitude of the bias depends of the absolute value of f ′′.


Statistical Properties: Variance

Variance:

V(f̂h(x)) =1

nh||K ||22f (x) + o

(1

nh

), nh →∞

where ||K ||22 =∫

K 2(s) ds, the squared L2 norm of K .

The variance is proportional to (nh)−1:

I Choose large h to reduce the variance.I Increase n to reduce the variance.I The variance increases in ||K ||22: Flat kernels reduce the

variance.


Statistical Properties: MSE

Trade-off between bias and variance:Increasing h will lower the variance but increases the bias and viceversa.

Minimizing the MSE is a compromise between over- andundersmoothing.


Statistical Properties: MSE

Mean Squared Error:

MSE(f̂h(x)) =h4

4f ′′(x)2µ2(K )2 +

1

nh||K ||22f (x) + o(h4) + o

(1

nh

)

Note: MSE → 0 as h → 0 and nh →∞,ie. the kernel density estimator is a consistent estimator.


Statistical Properties: MISE and AMISE

MISE: Mean Integrated Squared Error

MISE(f̂h) =1

nh||K ||22 +

h4

4µ2(K )

2||f ′′||22 + o(

1

nh

)+ o(h4),

as h → 0, nh →∞.AMISE: Approx. formula for MISE, ignoring higher order terms

AMISE(f̂h) =1

nh||K ||22 +

h4

4µ22(K )||f ′′||22


Statistical Properties: Optimal bandwidth

Bandwidth: Optimal wrt. AMISE

hopt =

( ||K ||22||f ′′||22 µ2(K )2 n

)1/5∼ n−1/5

Depends on ||f ′′||22 - unknown.Convergence of AMISE:

AMISE(f̂hopt ) =5

4

(||K ||22)4/5 (

µ2(K )||f ′′||2)2/5

n−4/5 ∼ n−4/5

Note: AMISE converges at the rate n−4/5.


Statistical Properties: Comparison with the histogram

AMISE: Histogram

AMISE(f̂h) =1

nh+

h2

12||f ′||22

Bandwidth: Optimal wrt. AMISE (histogram)

h0 =

(6

n||f ′||22

)1/3∼ n−1/3

AMISE converges at the rate n−2/3 in the histogram.

Slower rate of convergence in the histogram compared to thekernel density estimator.


Bandwidth selection

The two most frequently used method of banwidth selection:

I The plug-in method,I Cross-validation methods.


Bandwidth selection: Silverman’s Rule of Thumb

Plug-in methods:Replace unknown parameters with estimates.

AMISE optimal bandwidth

hopt =

( ||K ||22||f ′′||22 µ2(K )2 n

)1/5

Unknown parameter is ||f ′′||22.Assume f is a normal(µ, σ2)-distribution, then

||f ′′||22 = σ−53

8√

π≈ 0.212σ−5

Replace the σ with σ̂.



Choose kernel function: Gaussian kernel.

Then the ”Rule-of-Thumb” bandwidth

ĥrot =

(||ϕ||22

||f̂ ′′||22 µ22(ϕ) n

)1/5

=

(4σ̂5

3n

)1/5

≈ 1.06σ̂n−1/5

Applicable formula for bandwidth selection.If X is normally distributed, then ĥrot gives the optimal bandwidth.If X is not normally distributed, then ĥrot will give a bandwidth nottoo far from the optimal bandwidth, if the distribution of X is nottoo different from the normal distribution.



Practical problem:The Rule-of-Thumb bandwidths is sensitive to outliers:A single outlier may cause a too large estimator of σ and hence atoo large bandwidth.

Robust estimator:Calculate

R = X[0.75n]︸︷︷︸75%−quantile

− X[0.25n]︸︷︷︸25%−quantile

We assume X ∼ N(µ, σ2) and Z ∼ N(0, 1), thereforeR = X[0.75n] − X[0.25n]

= (µ + σZ[0.75n])− (µ + σZ[0.25n])= σ(Z[0.75n] − Z[0.25n])≈ σ(0.67− (−0.67))= 1.34σ



Therefore

σ̂ =R

1.34

Plug it into the ”Rule-of-Thumb” bandwidth

ĥrot ≈ = 1.06σ̂n−1/5

= 1.06R

1.34n−1/5

≈ 0.79R̂n−1/n

”Better-Rule-of-Thumb”:Combine the fist and the robust Rule-of-Thumb

ĥrot = 1.06 min

(σ̂,

R

1.34

)n−1/5


Documents

Nonparametric Density Estimation (one dimension) › tine › nonpar › Nonparametric...Derivation ^f h(x) = 1 2hn #fXi 2 [x ¡h;x +h)g 1 2hn Xn i=1 I (jx ¡Xij • h) 1 hn Xn i=1