sluo2006lec4

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 1/29

Density Estimation

Density Estimation:

Deals with the problem of estimating probability density functions (PDFs)based on some data sampled from the PDF.

May use assumed forms of the distribution, parameterized in some

way (parametric statistics);

or

May avoid making assumptions about the form of the PDF (non-

parametric statistics).

We are concerned more here with the non-parametric case (see Roger

Barlow’s lectures for parametric statistics)

1 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4


Some References (I)

Richard A. Tapia & James R. Thompson, Nonparametric Density Esti-

mation, Johns Hopkins University Press, Baltimore (1978).

David W. Scott, Multivariate Density Estimation, John Wiley & Sons,

Inc., New York (1992).

Adrian W. Bowman and Adelchi Azzalini, Applied Smoothing Tech-

niques for Data Analysis, Clarendon Press, Oxford (1997).B. W. Silverman, Density Estimation for Statistics and Data Analysis,

Monographs on Statistics and Applied Probability, Chapman and Hall

(1986);

http://nedwww.ipac.caltech.edu/level5/March02/Silverman/Silver contents.html

K. S. Cranmer, “Kernel Estimation in High Energy Physics”, Comp.

Phys. Comm. 136, 198 (2001) [hep-ex/0011057v1];

http://arxiv.org/PS cache/hep-ex/pdf/0011/0011057.pdf


8/12/2019 sluo2006lec4


Some References (II)

M. Pivk & F. R. Le Diberder, “sPlot: a statistical tool to unfold data

distributions”, Nucl. Instr. Meth. A 555, 356 (2005).

R. Cahn, “How sPlots are Best” (2005),

http://babar-hn.slac.stanford.edu:5090/hn/aux/auxvol01/rncahn/

rev splots best.pdf

BaBar Statistics Working Group, “Recommendations for Display of Pro- jections in Multi-Dimensional Analyses”,

http://www.slac.stanford.edu/BFROOT/www/Physics/Analysis/

Statistics/Documents/MDgraphRec.pdf

Additional specific references will noted in the course of the lectures.


8/12/2019 sluo2006lec4


Preliminaries

We’ll couch discussion in terms of observations (dataset) from some

“experiment”. Our dataset consists of the values xi; i = 1, 2, . . . , n.

– Our dataset consists of repeated samplings from a (presumed un-

known) probability distribution.

IID≡“Independent, Identically Distributed”

We’ll note generalizations here and there.

– Order is not important; if we are discussing a time series, we

could introduce ordered pairs {(xi, ti), i = 1, . . . , n}, and call it

two-dimensional [But beware the correlations then; probably not

IID!].

– In general, our quantities can be multi-dimensional; no special no-

tation will be used to distinguish one- from multi-variate cases.

We’ll discuss where issues enter with dimensionality.


8/12/2019 sluo2006lec4


Notation

At our convenience we may use “E ”, “”, and “¯” all to mean “ex-

pectation”:

E (x) ≡ x̄ ≡ x ≡

xp(x)dx,

where p(x) is the probability density function (PDF) for x (or, more

generally p(x)dx → µ(dx) is the probability measure).

Estimators are denoted with a “hat”: In these lectures, we’ll be

concerned with estimators for the density function itself, hence p(x)

is a random variable giving our estimate for p(x).

We will not be especially rigorous. For example, we won’t make a

notational distinction between the random variable and an instance.


8/12/2019 sluo2006lec4


Motivation

Why do we want to estimate densities?

– Well, that is the whole point. . .Harder question: Why non-parametric estimates?

– Comparison with models (which may be parametric)

– May be easier/better than parametric modeling for efficiency cor-

rections and background subtraction

– Visualization

– “Unfolding”

– Comparing samples


8/12/2019 sluo2006lec4


R, A Toolkit, er, Language, You Might be Interested In. . .

The S Language: developed with

statistical analysis of data in mind.

> x <- rnorm(100,10,1)

> hist(x,xlim=range(5,15))>

Histogram of x

x

F r e q u e n c y

6 8 10 12 14

0

5

1 0

1 5

2 0

Free, open source version is R, from the R Project. Downloads avail-

able for Linux/MacOS X/Windows, e.g., at:

http://cran.cnr.berkeley.edu/

Commercial version is S-Plus, at http://www.insightful.com/


8/12/2019 sluo2006lec4


Empirical Probability Density Function

Place a delta function at each data

point. The estimator is (EPDF, for

“Empirical Probability Density Func-

tion”)

p(x) = 1

n

n

i=1

δ (x − xi).0 200 400 600 800 1000

x

Note that x could be multi-dimensional here.

This is “the” sampling density for the bootstrap (more later; also see

Ilya Narsky lectures).


8/12/2019 sluo2006lec4


The Histogram

Perhaps our most ubiquitous density estimator is the histogram:

h(x) =

ni=1

B(x − x̃i; w),

where x̃i is the center of the bin in which ob-

servation xi lies, w is the bin width, and

B(x; w) = 1 x ∈ (−w/2, w/2)0 otherwise

(called the “Indicator function” in probabil-

ity).

x i~ x

B(x-x ; w)~i

i

w

1

This is written for uniform bin widths, but may be generalized to

differing widths with appropriate relative normalization factors.The estimator for the probability density function (PDF) is:

p(x) =

1

nwh(x).


8/12/2019 sluo2006lec4


Histogram Example

0 200 400 600 800 1000

x

0

1

2

3

4

5

6

0 100 200 300 400 500 600 700 800 900

m(p pi) - m(p) - m(pi)

E v e n t s / 1 0 M e V

Left: EPDF; Right: Histogram with w = 10 MeV.

[Actual sampling is 100 points from a ∆(1232) Breit-Wigner (Cauchy)on a second-order polynomial background. Background probability is

50%.]


8/12/2019 sluo2006lec4


Criticisms of Histogram as Density Estimator

Discontinuous even if PDF is continuous.

Dependence on bin size and bin origin.

Information from location of datum within a bin is ignored.


8/12/2019 sluo2006lec4


Kernel Estimation

Take the histogram, but replace “bin” function B with something else:

p(x) =

1

n

n

i=1

k(x − xi; w),

where k(x, w) is the “kernel function”, normalized to unity: ∞

−∞k(x; w) dx = 1.

Usually interested in kernels of the form

k(x − xi; w) = 1

wK

x − xi

w

,

indeed this may be used as the definition of “kernel”. The kernel esti-

mator for the PDF is then:

p(x) = 1

nw

ni=1

K

x − xi

w

,

The role of parameter w as a smoothing parameter is clearer.


8/12/2019 sluo2006lec4


Multi-Variate Kernel Esitmation

Explicit multi-variate case, d = 2 dimensions:

p(x, y) = 1

nwxwy

ni=1

K

x − xi

wx

K

y − yi

wy

.

This is a “product kernel” form, with the same kernel in each dimension,

except for possibly different smoothing parameters. It does not have

correlations.

The kernels we have introduced are classified more explicitly as “fixed

kernels”: The smoothing parameter is independent of x.


8/12/2019 sluo2006lec4


Ideogram

A simple variant on the kernel idea is to permit the kernel to depend on

additional knowledge in the data.

• Physicists call this an “ideogram”.

• Most common is the “Gaussian ideogram”, in which each data point

is entered as a Gaussian of area one and standard deviation appro-

priate to that datum.

• This addresses a way that the IID assumption might be broken.

[Aside: Be careful to get your likelihood function right if you are in-

corporating variable resolution information in your fits; see, e.g., Punzi:

http://www.slac.stanford.edu/econf/C030908/papers/WELT002.pdf ]


8/12/2019 sluo2006lec4


8/12/2019 sluo2006lec4


Sample Ideograms (II)

Figure 1. A histogram of magnetic field values (black),

compared with a smoothed frequency distribution con-

structed using a Gaussian ideogram technique (red).

Note detailed comparison.

(from J. S. Halekas et al., “Magnetic Properties of Lunar Geologic Ter-

ranes: New Statistical Results”, Lunar and Planetary Science XXXIII

(2002), 1368.pdf)


8/12/2019 sluo2006lec4


Parametric vs non-Parametric Density Estimation (I)

Distinction is fuzzy

A histogram is non-parametric, in the sense that no assumption aboutthe form of the sampling distribution is made.

– Often an implicit assumption that distribution is “smooth” on scale

smaller than bin size. For example, we know something about the

resolution of our apparatus.

But the estimator of the parent distribution made with a histogram is

parametric – the parameters are populations (or frequencies) in each

bin. The estimators for those parameters are the observed histogram

populations. Even more parameters than a typical parametric fit!


8/12/2019 sluo2006lec4


Parametric vs non-Parametric Density Estimation (II)

Essence of difference may be captured in notions of “local” and “non-

local”:

If a datum at xi influences the density estimator at some

other point x this is non-local. A non-parametric estimator is

one in which the influence of a point at xi on the estimate at

any x with d(xi, x) > vanishes, asymptotically.†

Notice that for a kernel estimator, the bigger the smoothing paramter

w, the more non-local the estimator,

p(x) =

1

nw

n

i=1 K x − xi

w .

†As we’ll discuss, the “optimal” choice of smoothing parameter depends

on n.


8/12/2019 sluo2006lec4


Optimization

We would like to make an optimal density estimate from our data.

What does that mean?– Need a criterion for “optimal”

– Choice of criterion is subjective; it depends on what you want to

achieve.

We may compare the estima-

tor for a quantity (here, value of

the density at x) with the true

value: ∆(x) = f (x) − f (x).

x

f(x) x)

f(x) x)

^(x)∆


8/12/2019 sluo2006lec4


Mean Squared Error (I)

A common choice in parametric estimation is to minimize the sum

of the squares. We may take this idea over here, and form the “Mean

Squared Error” (MSE):

MSE[

f (x)] ≡

f (x) − f (x)2

= Var[

f (x)] + Bias2[

f (x)],

whereVar[ f (x)] ≡ E

f (x) − E [ f (x)]2

Bias[ f (x)] ≡ E [ f (x)] − f (x)


8/12/2019 sluo2006lec4


Mean Squared Error (II)

Since this isn’t quite our familiar parameter estimation, let’s take a

little time to make sure it is understood:

Suppose p(x) is an estimator for the PDF p(x), based on data {xi; i =

1, . . . , n}, IID from p(x). Then

E [ p(x)] = · · · p(x; {xi})Prob({xi})dn({xi})

=

· · ·

p(x; {xi})

ni=1

[ p(xi)dxi]


8/12/2019 sluo2006lec4


Exercise: Proof of formula for the MSE

MSE[ f (x)] = ( f (x) − f (x))2

= · · · f (x; {xi}) − f (x)2 n

i=1

[ p(xi)dxi]

=

· · ·

f (x; {xi}) − E ( f ) + E ( f ) − f (x)2 n

i=1

[ p(xi)dxi]

= · · · f (x; {xi}) − E ( f )2 + E ( f ) − f (x)2

− 2 f (x; {xi}) − E ( f )

E ( f ) − f (x)

ni=1

[ p(xi)dxi]

= Var[ f (x)] + Bias2[ f (x)] + 0.

[In typical treatments of parametric statistics, we assume unbiased es-

timators, hence the “Bias” term is zero. That isn’t a good assumption

here.]


8/12/2019 sluo2006lec4


The Problem With Smoothing (I)

Thm: [Rosenblatt (1956)] A uniform minimum variance unbiased estimator

for p(x) does not exist.

Unbiased:

E [

p(x)] = p(x).

Uniform minimum variance:

Var [ p(x)| p(x)] ≤ Var [ q (x)| p(x)] , ∀x,

for all p(x), where q (x) is any other estimator of p(x).


8/12/2019 sluo2006lec4


The Problem With Smoothing (II)

For example, suppose we have a kernel estimator:

p(x) = 1

n

ni=1

k(x − xi; w),

Its expectation is:

E [ p(x)] = 1

n

ni=1

k(x − xi; w) p(xi)dxi

=

k(x − y) p(y)dy.

Unless k(x − y) = δ (x − y), p(x) will be biased for some p(x).

But δ (x − y) has infinite variance.


8/12/2019 sluo2006lec4


The Problem with Smoothing (III)

So the nice properties we strive for in parameter estimation (and some-

times achieve) are beyond reach.

Intuition: smoothing lowers peaks and fills in valleys.

2 4 6 8 10 12 14 16

0

1 0 0

2 0 0

3 0 0

4 0 0

F r e q u e n c y

x

Red curve: PDFHistogram: Sampling from PDF

Black curve: Gaussian kernel esti-

mator for PDF


8/12/2019 sluo2006lec4


Comment on Number of Bins in Histogram

Note: “Sturges’ rule,” based on optimizing MSE, was used in deciding

how many bins, k, to make in the histogram:

k = 1 + log2 n.

The argument behind this rule has been criticized (1995):

http://www-personal.buseco.monash.edu.au/∼hyndman/papers/sturges.pdf

Indeed we see in our example that we would have “by hand” selected

more bins; our histogram is “over-smoothed”. There are other rules for

optimizing the number of bins. For example, “Scott’s rule” for the bin

width is:w = 3.5sn−1/3,

where s is the sample standard deviation.

[More later]


8/12/2019 sluo2006lec4


Dependence on Smoothing Parameter

Plot showing effect of choice of smoothing parameter”:

2 4 6 8 10 12 14 16

0

5 0

1 0 0

1 5 0

2

0 0

x

F r e q u

e n c y

Red: Sampling PDF

Black: Default smoothing w)

Blue: w/2 smoothing

Turquoise: w/4 smoothing

Green: 2w smoothing


8/12/2019 sluo2006lec4


The Curse of Dimensionality

Roger Barlow gave a nice example of the impact of the “Curse of

Dimensionality” in parametric statistics. It is a significant affliction in

density estimation as well.

Difficult to display and visualize as the number of dimensions in-

creases.

“All” the volume (of a bounded region) goes to the boundary (expo-

nentially!) as the dimensions increases. I.e., data becomes “sparse”.

1/2

1/41/8

, . . . 12d

Tendency for exponentially growing computation requirement with

dimensions.

Even worse than parametric statistics.


8/12/2019 sluo2006lec4


Summary

We have introduced:

Basic notions in (non-parametric) density estimation

Some simple variations on the theme

A foundation towards optimization

An idea of where and how things will fail

Next: Further sophistication on these ideas; and introduction of other

variations in approach and application.