29
Density Estimation Density Estimation: Deals with the problem of estimating probabil ity density functions (PDFs) based on some data sampled from the PDF. May use assumed forms of the distribution, parameterized in some way (parametric statistics); or May avoid making assumptions about the form of the PDF (non- parametric statistics). We are concerned more here with the non-parametric case  (see  Roger Barlow’s lectures for parametric statistics) 1 Frank Por ter, SLUO Lectures on Statistics, 15–17 August 2006

sluo2006lec4

Embed Size (px)

Citation preview

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 1/29

Density Estimation

Density Estimation:

Deals with the problem of estimating probability density functions (PDFs)based on some data sampled from the PDF.

May use assumed forms of the distribution, parameterized in some

way (parametric statistics);

or

May avoid making assumptions about the form of the PDF (non-

parametric statistics).

We are concerned more here with the non-parametric case  (see Roger

Barlow’s lectures for parametric statistics)

1 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 2/29

Some References (I)

Richard A. Tapia & James R. Thompson, Nonparametric Density Esti-

mation, Johns Hopkins University Press, Baltimore (1978).

David W. Scott, Multivariate Density Estimation, John Wiley & Sons,

Inc., New York (1992).

Adrian W. Bowman and Adelchi Azzalini, Applied Smoothing Tech-

niques for Data Analysis, Clarendon Press, Oxford (1997).B. W. Silverman, Density Estimation for Statistics and Data Analysis,

Monographs on Statistics and Applied Probability, Chapman and Hall

(1986);

http://nedwww.ipac.caltech.edu/level5/March02/Silverman/Silver contents.html

K. S. Cranmer, “Kernel Estimation in High Energy Physics”, Comp.

Phys. Comm.  136, 198 (2001) [hep-ex/0011057v1];

http://arxiv.org/PS cache/hep-ex/pdf/0011/0011057.pdf 

2 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 3/29

Some References (II)

M. Pivk & F. R. Le Diberder, “sPlot: a statistical tool to unfold data

distributions”, Nucl. Instr. Meth. A  555, 356 (2005).

R. Cahn, “How sPlots are Best” (2005),

http://babar-hn.slac.stanford.edu:5090/hn/aux/auxvol01/rncahn/

rev splots best.pdf 

BaBar Statistics Working Group, “Recommendations for Display of Pro- jections in Multi-Dimensional Analyses”,

http://www.slac.stanford.edu/BFROOT/www/Physics/Analysis/

Statistics/Documents/MDgraphRec.pdf 

Additional specific references will noted in the course of the lectures.

3 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 4/29

Preliminaries

We’ll couch discussion in terms of observations (dataset) from some

“experiment”. Our dataset consists of the values  xi; i = 1, 2, . . . , n.

– Our dataset consists of repeated samplings from a (presumed un-

known) probability distribution.

IID≡“Independent, Identically Distributed”

We’ll note generalizations here and there.

– Order is not important; if we are discussing a time series, we

could introduce ordered pairs   {(xi, ti), i   = 1, . . . , n}, and call it

two-dimensional [But beware the correlations then; probably not

IID!].

– In general, our quantities can be multi-dimensional; no special no-

tation will be used to distinguish one- from multi-variate cases.

We’ll discuss where issues enter with dimensionality.

4 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 5/29

Notation

At our convenience we may use “E ”, “”, and “¯” all to mean “ex-

pectation”:

E (x) ≡  x̄ ≡ x ≡ 

  xp(x)dx,

where  p(x) is the probability density function (PDF) for  x (or, more

generally  p(x)dx →  µ(dx) is the probability measure).

Estimators are denoted with a “hat”:    In these lectures, we’ll be

concerned with estimators for the density function itself, hence  p(x)

is a random variable giving our estimate for  p(x).

We will not be especially rigorous. For example, we won’t make a

notational distinction between the random variable and an instance.

5 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 6/29

Motivation

Why do we want to estimate densities?

–  Well, that is the whole point. . .Harder question: Why non-parametric estimates?

– Comparison with models (which may be parametric)

– May be easier/better than parametric modeling for efficiency cor-

rections and background subtraction

–   Visualization

– “Unfolding”

– Comparing samples

6 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 7/29

R, A Toolkit, er, Language, You Might be Interested In. . .

The S Language: developed with

statistical analysis of data in mind.

> x <- rnorm(100,10,1)

> hist(x,xlim=range(5,15))>

Histogram of x

x

       F     r     e     q     u     e     n     c     y

6 8 10 12 14

       0

       5

       1       0

       1       5

       2       0

Free, open source version is R, from the R Project. Downloads avail-

able for Linux/MacOS X/Windows, e.g., at:

http://cran.cnr.berkeley.edu/

Commercial version is S-Plus, at http://www.insightful.com/

7 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 8/29

Empirical Probability Density Function

Place a delta function at each data

point. The estimator is (EPDF, for

“Empirical Probability Density Func-

tion”)

  p(x) = 1

n

n

i=1

δ (x − xi).0 200 400 600 800 1000

x

Note that  x  could be multi-dimensional here.

This is “the” sampling density for the  bootstrap   (more later; also see

Ilya Narsky lectures).

8 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 9/29

The Histogram

Perhaps our most ubiquitous density estimator is the histogram:

h(x) =

ni=1

B(x − x̃i; w),

where  x̃i   is the center of the bin in which ob-

servation  xi   lies,  w   is the bin width, and

B(x; w) = 1   x ∈  (−w/2, w/2)0 otherwise

(called the “Indicator function” in probabil-

ity).

x i~ x

B(x-x ; w)~i

i

w

1

This is written for uniform bin widths, but may be generalized to

differing widths with appropriate relative normalization factors.The estimator for the probability density function (PDF) is:

  p(x) =

  1

nwh(x).

9 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 10/29

Histogram Example

0 200 400 600 800 1000

x

0

1

2

3

4

5

6

0 100 200 300 400 500 600 700 800 900

m(p pi) - m(p) - m(pi)

   E  v  e  n   t  s   /   1   0   M  e   V

Left: EPDF; Right: Histogram with w = 10 MeV.

[Actual sampling is 100 points from a ∆(1232) Breit-Wigner (Cauchy)on a second-order polynomial background. Background probability is

50%.]

10 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 11/29

Criticisms of Histogram as Density Estimator

Discontinuous even if PDF is continuous.

Dependence on bin size and bin origin.

Information from location of datum within a bin is ignored.

11 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 12/29

Kernel Estimation

Take the histogram, but replace “bin” function  B  with something else:

  p(x) =

 1

n

n

i=1

k(x − xi; w),

where  k(x, w) is the “kernel function”, normalized to unity:   ∞

−∞k(x; w) dx = 1.

Usually interested in kernels of the form

k(x − xi; w) =  1

wK 

x − xi

w

,

indeed this may be used as the definition of “kernel”. The kernel esti-

mator for the PDF is then:

  p(x) =  1

nw

ni=1

x − xi

w

,

The role of parameter  w  as a smoothing parameter is clearer.

12 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 13/29

Multi-Variate Kernel Esitmation

Explicit multi-variate case,  d = 2 dimensions:

  p(x, y) =  1

nwxwy

ni=1

x − xi

wx

y − yi

wy

.

This is a “product kernel” form, with the same kernel in each dimension,

except for possibly different smoothing parameters. It does not have

correlations.

The kernels we have introduced are classified more explicitly as “fixed

kernels”: The smoothing parameter is independent of  x.

13 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 14/29

Ideogram

A simple variant on the kernel idea is to permit the kernel to depend on

additional knowledge in the data.

•  Physicists call this an “ideogram”.

•  Most common is the “Gaussian ideogram”, in which each data point

is entered as a Gaussian of area one and standard deviation appro-

priate to that datum.

•  This addresses a way that the IID assumption might be broken.

[Aside: Be careful to get your likelihood function right if you are in-

corporating variable resolution information in your fits; see, e.g., Punzi:

http://www.slac.stanford.edu/econf/C030908/papers/WELT002.pdf  ]

14 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 15/29

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 16/29

Sample Ideograms (II)

Figure 1. A histogram of magnetic field values (black),

compared with a smoothed frequency distribution con-

structed using a Gaussian ideogram technique (red).

Note detailed comparison.

(from J. S. Halekas et al., “Magnetic Properties of Lunar Geologic Ter-

ranes: New Statistical Results”, Lunar and Planetary Science XXXIII

(2002), 1368.pdf)

16 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 17/29

Parametric vs non-Parametric Density Estimation (I)

Distinction is fuzzy

A histogram is non-parametric, in the sense that no assumption aboutthe form of the sampling distribution is made.

– Often an implicit assumption that distribution is “smooth” on scale

smaller than bin size. For example, we know something about the

resolution of our apparatus.

But the estimator of the parent distribution made with a histogram is

parametric – the parameters are populations (or frequencies) in each

bin. The estimators for those parameters are the observed histogram

populations.  Even more parameters than a typical parametric fit!

17 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 18/29

Parametric vs non-Parametric Density Estimation (II)

Essence of difference may be captured in notions of “local” and “non-

local”:

If a datum at   xi   influences the density estimator at some

other point  x  this is non-local. A non-parametric estimator is

one in which the influence of a point at  xi  on the estimate at

any x  with  d(xi, x) >  vanishes, asymptotically.†

Notice that for a kernel estimator, the bigger the smoothing paramter

w, the more non-local the estimator,

  p(x) =

  1

nw

n

i=1 K x − xi

w .

†As we’ll discuss, the “optimal” choice of smoothing parameter depends

on n.

18 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 19/29

Optimization

We would like to make an optimal density estimate from our data.

What does that mean?– Need a criterion for “optimal”

– Choice of criterion is subjective; it depends on what you want to

achieve.

We may compare the estima-

tor for a quantity (here, value of 

the density at   x) with the true

value:   ∆(x) = f (x) − f (x).

x

f(x) x)

f(x) x)

^(x)∆

19 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 20/29

Mean Squared Error (I)

A common choice in parametric estimation is to minimize the sum

of the squares. We may take this idea over here, and form the “Mean

Squared Error” (MSE):

MSE[

 f (x)] ≡

 

f (x) − f (x)2

 = Var[

 f (x)] + Bias2[

 f (x)],

whereVar[ f (x)] ≡  E 

 f (x) − E [ f (x)]2

Bias[ f (x)] ≡  E [ f (x)] − f (x)

20 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 21/29

Mean Squared Error (II)

Since this isn’t quite our familiar parameter estimation, let’s take a

little time to make sure it is understood:

Suppose  p(x) is an estimator for the PDF p(x), based on data {xi; i =

1, . . . , n}, IID from  p(x). Then

E [  p(x)] =    · · ·   p(x; {xi})Prob({xi})dn({xi})

=

   · · ·

   p(x; {xi})

ni=1

[ p(xi)dxi]

21 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 22/29

Exercise: Proof of formula for the MSE

MSE[ f (x)] = ( f (x) − f (x))2

=    · · ·   f (x; {xi}) − f (x)2   n

i=1

[ p(xi)dxi]

=

   · · ·

   f (x; {xi}) − E ( f ) + E ( f ) − f (x)2   n

i=1

[ p(xi)dxi]

=    · · ·   f (x; {xi}) − E ( f )2 + E ( f ) − f (x)2

− 2 f (x; {xi}) − E ( f )

E ( f ) − f (x)

  ni=1

[ p(xi)dxi]

= Var[ f (x)] + Bias2[ f (x)] + 0.

[In typical treatments of parametric statistics, we assume unbiased es-

timators, hence the “Bias” term is zero.   That isn’t a good assumption

here.]

22 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 23/29

The Problem With Smoothing (I)

Thm:   [Rosenblatt (1956)] A uniform minimum variance unbiased estimator

for p(x) does not exist.

Unbiased:

E [

  p(x)] = p(x).

Uniform minimum variance:

Var [  p(x)| p(x)] ≤  Var [ q (x)| p(x)] ,   ∀x,

for all  p(x), where q (x) is any other estimator of  p(x).

23 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 24/29

The Problem With Smoothing (II)

For example, suppose we have a kernel estimator:

  p(x) = 1

n

ni=1

k(x − xi; w),

Its expectation is:

E [  p(x)] = 1

n

ni=1

   k(x − xi; w) p(xi)dxi

=

   k(x − y) p(y)dy.

Unless  k(x − y) = δ (x − y),  p(x) will be biased for some  p(x).

But  δ (x − y) has infinite variance.

24 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 25/29

The Problem with Smoothing (III)

So the nice properties we strive for in parameter estimation (and some-

times achieve) are beyond reach.

Intuition: smoothing lowers peaks and fills in valleys.

2 4 6 8 10 12 14 16

        0

        1        0        0

        2        0        0

        3        0        0

        4        0        0

       F      r      e      q     u      e      n      c     y

x

Red curve:   PDFHistogram: Sampling from PDF

Black curve: Gaussian kernel esti-

mator for PDF

25 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 26/29

Comment on Number of Bins in Histogram

Note: “Sturges’ rule,” based on optimizing MSE, was used in deciding

how many bins,  k, to make in the histogram:

k = 1 + log2 n.

The argument behind this rule has been criticized (1995):

http://www-personal.buseco.monash.edu.au/∼hyndman/papers/sturges.pdf 

Indeed we see in our example that we would have “by hand” selected

more bins; our histogram is “over-smoothed”. There are other rules for

optimizing the number of bins. For example, “Scott’s rule” for the bin

width is:w = 3.5sn−1/3,

where  s   is the sample standard deviation.

[More later]

26 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 27/29

Dependence on Smoothing Parameter

Plot showing effect of choice of smoothing parameter”:

2 4 6 8 10 12 14 16

        0

       5        0

        1        0        0

        1       5        0

        2

        0        0

x

       F      r      e      q     u

      e      n      c     y

Red: Sampling PDF

Black: Default smoothing w)

Blue: w/2 smoothing

Turquoise: w/4 smoothing

Green: 2w smoothing

27 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 28/29

The Curse of Dimensionality

Roger Barlow   gave a nice example of the impact of the “Curse of 

Dimensionality” in parametric statistics. It is a significant affliction in

density estimation as well.

Difficult to display and visualize as the number of dimensions in-

creases.

“All” the volume (of a bounded region) goes to the boundary (expo-

nentially!) as the dimensions increases. I.e., data becomes “sparse”.

1/2

1/41/8

,   . . .   12d

Tendency for exponentially growing computation requirement with

dimensions.

Even worse than parametric statistics.

28 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006

8/12/2019 sluo2006lec4

http://slidepdf.com/reader/full/sluo2006lec4 29/29

Summary

We have introduced:

Basic notions in (non-parametric) density estimation

Some simple variations on the theme

A foundation towards optimization

An idea of where and how things will fail

Next: Further sophistication on these ideas; and introduction of other

variations in approach and application.

29 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006