Upload
boda-kishan
View
221
Download
0
Embed Size (px)
Citation preview
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 1/29
Density Estimation
Density Estimation:
Deals with the problem of estimating probability density functions (PDFs)based on some data sampled from the PDF.
May use assumed forms of the distribution, parameterized in some
way (parametric statistics);
or
May avoid making assumptions about the form of the PDF (non-
parametric statistics).
We are concerned more here with the non-parametric case (see Roger
Barlow’s lectures for parametric statistics)
1 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 2/29
Some References (I)
Richard A. Tapia & James R. Thompson, Nonparametric Density Esti-
mation, Johns Hopkins University Press, Baltimore (1978).
David W. Scott, Multivariate Density Estimation, John Wiley & Sons,
Inc., New York (1992).
Adrian W. Bowman and Adelchi Azzalini, Applied Smoothing Tech-
niques for Data Analysis, Clarendon Press, Oxford (1997).B. W. Silverman, Density Estimation for Statistics and Data Analysis,
Monographs on Statistics and Applied Probability, Chapman and Hall
(1986);
http://nedwww.ipac.caltech.edu/level5/March02/Silverman/Silver contents.html
K. S. Cranmer, “Kernel Estimation in High Energy Physics”, Comp.
Phys. Comm. 136, 198 (2001) [hep-ex/0011057v1];
http://arxiv.org/PS cache/hep-ex/pdf/0011/0011057.pdf
2 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 3/29
Some References (II)
M. Pivk & F. R. Le Diberder, “sPlot: a statistical tool to unfold data
distributions”, Nucl. Instr. Meth. A 555, 356 (2005).
R. Cahn, “How sPlots are Best” (2005),
http://babar-hn.slac.stanford.edu:5090/hn/aux/auxvol01/rncahn/
rev splots best.pdf
BaBar Statistics Working Group, “Recommendations for Display of Pro- jections in Multi-Dimensional Analyses”,
http://www.slac.stanford.edu/BFROOT/www/Physics/Analysis/
Statistics/Documents/MDgraphRec.pdf
Additional specific references will noted in the course of the lectures.
3 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 4/29
Preliminaries
We’ll couch discussion in terms of observations (dataset) from some
“experiment”. Our dataset consists of the values xi; i = 1, 2, . . . , n.
– Our dataset consists of repeated samplings from a (presumed un-
known) probability distribution.
IID≡“Independent, Identically Distributed”
We’ll note generalizations here and there.
– Order is not important; if we are discussing a time series, we
could introduce ordered pairs {(xi, ti), i = 1, . . . , n}, and call it
two-dimensional [But beware the correlations then; probably not
IID!].
– In general, our quantities can be multi-dimensional; no special no-
tation will be used to distinguish one- from multi-variate cases.
We’ll discuss where issues enter with dimensionality.
4 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 5/29
Notation
At our convenience we may use “E ”, “”, and “¯” all to mean “ex-
pectation”:
E (x) ≡ x̄ ≡ x ≡
xp(x)dx,
where p(x) is the probability density function (PDF) for x (or, more
generally p(x)dx → µ(dx) is the probability measure).
Estimators are denoted with a “hat”: In these lectures, we’ll be
concerned with estimators for the density function itself, hence p(x)
is a random variable giving our estimate for p(x).
We will not be especially rigorous. For example, we won’t make a
notational distinction between the random variable and an instance.
5 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 6/29
Motivation
Why do we want to estimate densities?
– Well, that is the whole point. . .Harder question: Why non-parametric estimates?
– Comparison with models (which may be parametric)
– May be easier/better than parametric modeling for efficiency cor-
rections and background subtraction
– Visualization
– “Unfolding”
– Comparing samples
6 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 7/29
R, A Toolkit, er, Language, You Might be Interested In. . .
The S Language: developed with
statistical analysis of data in mind.
> x <- rnorm(100,10,1)
> hist(x,xlim=range(5,15))>
Histogram of x
x
F r e q u e n c y
6 8 10 12 14
0
5
1 0
1 5
2 0
Free, open source version is R, from the R Project. Downloads avail-
able for Linux/MacOS X/Windows, e.g., at:
http://cran.cnr.berkeley.edu/
Commercial version is S-Plus, at http://www.insightful.com/
7 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 8/29
Empirical Probability Density Function
Place a delta function at each data
point. The estimator is (EPDF, for
“Empirical Probability Density Func-
tion”)
p(x) = 1
n
n
i=1
δ (x − xi).0 200 400 600 800 1000
x
Note that x could be multi-dimensional here.
This is “the” sampling density for the bootstrap (more later; also see
Ilya Narsky lectures).
8 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 9/29
The Histogram
Perhaps our most ubiquitous density estimator is the histogram:
h(x) =
ni=1
B(x − x̃i; w),
where x̃i is the center of the bin in which ob-
servation xi lies, w is the bin width, and
B(x; w) = 1 x ∈ (−w/2, w/2)0 otherwise
(called the “Indicator function” in probabil-
ity).
x i~ x
B(x-x ; w)~i
i
w
1
This is written for uniform bin widths, but may be generalized to
differing widths with appropriate relative normalization factors.The estimator for the probability density function (PDF) is:
p(x) =
1
nwh(x).
9 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 10/29
Histogram Example
0 200 400 600 800 1000
x
0
1
2
3
4
5
6
0 100 200 300 400 500 600 700 800 900
m(p pi) - m(p) - m(pi)
E v e n t s / 1 0 M e V
Left: EPDF; Right: Histogram with w = 10 MeV.
[Actual sampling is 100 points from a ∆(1232) Breit-Wigner (Cauchy)on a second-order polynomial background. Background probability is
50%.]
10 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 11/29
Criticisms of Histogram as Density Estimator
Discontinuous even if PDF is continuous.
Dependence on bin size and bin origin.
Information from location of datum within a bin is ignored.
11 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 12/29
Kernel Estimation
Take the histogram, but replace “bin” function B with something else:
p(x) =
1
n
n
i=1
k(x − xi; w),
where k(x, w) is the “kernel function”, normalized to unity: ∞
−∞k(x; w) dx = 1.
Usually interested in kernels of the form
k(x − xi; w) = 1
wK
x − xi
w
,
indeed this may be used as the definition of “kernel”. The kernel esti-
mator for the PDF is then:
p(x) = 1
nw
ni=1
K
x − xi
w
,
The role of parameter w as a smoothing parameter is clearer.
12 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 13/29
Multi-Variate Kernel Esitmation
Explicit multi-variate case, d = 2 dimensions:
p(x, y) = 1
nwxwy
ni=1
K
x − xi
wx
K
y − yi
wy
.
This is a “product kernel” form, with the same kernel in each dimension,
except for possibly different smoothing parameters. It does not have
correlations.
The kernels we have introduced are classified more explicitly as “fixed
kernels”: The smoothing parameter is independent of x.
13 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 14/29
Ideogram
A simple variant on the kernel idea is to permit the kernel to depend on
additional knowledge in the data.
• Physicists call this an “ideogram”.
• Most common is the “Gaussian ideogram”, in which each data point
is entered as a Gaussian of area one and standard deviation appro-
priate to that datum.
• This addresses a way that the IID assumption might be broken.
[Aside: Be careful to get your likelihood function right if you are in-
corporating variable resolution information in your fits; see, e.g., Punzi:
http://www.slac.stanford.edu/econf/C030908/papers/WELT002.pdf ]
14 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 16/29
Sample Ideograms (II)
Figure 1. A histogram of magnetic field values (black),
compared with a smoothed frequency distribution con-
structed using a Gaussian ideogram technique (red).
Note detailed comparison.
(from J. S. Halekas et al., “Magnetic Properties of Lunar Geologic Ter-
ranes: New Statistical Results”, Lunar and Planetary Science XXXIII
(2002), 1368.pdf)
16 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 17/29
Parametric vs non-Parametric Density Estimation (I)
Distinction is fuzzy
A histogram is non-parametric, in the sense that no assumption aboutthe form of the sampling distribution is made.
– Often an implicit assumption that distribution is “smooth” on scale
smaller than bin size. For example, we know something about the
resolution of our apparatus.
But the estimator of the parent distribution made with a histogram is
parametric – the parameters are populations (or frequencies) in each
bin. The estimators for those parameters are the observed histogram
populations. Even more parameters than a typical parametric fit!
17 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 18/29
Parametric vs non-Parametric Density Estimation (II)
Essence of difference may be captured in notions of “local” and “non-
local”:
If a datum at xi influences the density estimator at some
other point x this is non-local. A non-parametric estimator is
one in which the influence of a point at xi on the estimate at
any x with d(xi, x) > vanishes, asymptotically.†
Notice that for a kernel estimator, the bigger the smoothing paramter
w, the more non-local the estimator,
p(x) =
1
nw
n
i=1 K x − xi
w .
†As we’ll discuss, the “optimal” choice of smoothing parameter depends
on n.
18 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 19/29
Optimization
We would like to make an optimal density estimate from our data.
What does that mean?– Need a criterion for “optimal”
– Choice of criterion is subjective; it depends on what you want to
achieve.
We may compare the estima-
tor for a quantity (here, value of
the density at x) with the true
value: ∆(x) = f (x) − f (x).
x
f(x) x)
f(x) x)
^(x)∆
19 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 20/29
Mean Squared Error (I)
A common choice in parametric estimation is to minimize the sum
of the squares. We may take this idea over here, and form the “Mean
Squared Error” (MSE):
MSE[
f (x)] ≡
f (x) − f (x)2
= Var[
f (x)] + Bias2[
f (x)],
whereVar[ f (x)] ≡ E
f (x) − E [ f (x)]2
Bias[ f (x)] ≡ E [ f (x)] − f (x)
20 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 21/29
Mean Squared Error (II)
Since this isn’t quite our familiar parameter estimation, let’s take a
little time to make sure it is understood:
Suppose p(x) is an estimator for the PDF p(x), based on data {xi; i =
1, . . . , n}, IID from p(x). Then
E [ p(x)] = · · · p(x; {xi})Prob({xi})dn({xi})
=
· · ·
p(x; {xi})
ni=1
[ p(xi)dxi]
21 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 22/29
Exercise: Proof of formula for the MSE
MSE[ f (x)] = ( f (x) − f (x))2
= · · · f (x; {xi}) − f (x)2 n
i=1
[ p(xi)dxi]
=
· · ·
f (x; {xi}) − E ( f ) + E ( f ) − f (x)2 n
i=1
[ p(xi)dxi]
= · · · f (x; {xi}) − E ( f )2 + E ( f ) − f (x)2
− 2 f (x; {xi}) − E ( f )
E ( f ) − f (x)
ni=1
[ p(xi)dxi]
= Var[ f (x)] + Bias2[ f (x)] + 0.
[In typical treatments of parametric statistics, we assume unbiased es-
timators, hence the “Bias” term is zero. That isn’t a good assumption
here.]
22 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 23/29
The Problem With Smoothing (I)
Thm: [Rosenblatt (1956)] A uniform minimum variance unbiased estimator
for p(x) does not exist.
Unbiased:
E [
p(x)] = p(x).
Uniform minimum variance:
Var [ p(x)| p(x)] ≤ Var [ q (x)| p(x)] , ∀x,
for all p(x), where q (x) is any other estimator of p(x).
23 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 24/29
The Problem With Smoothing (II)
For example, suppose we have a kernel estimator:
p(x) = 1
n
ni=1
k(x − xi; w),
Its expectation is:
E [ p(x)] = 1
n
ni=1
k(x − xi; w) p(xi)dxi
=
k(x − y) p(y)dy.
Unless k(x − y) = δ (x − y), p(x) will be biased for some p(x).
But δ (x − y) has infinite variance.
24 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 25/29
The Problem with Smoothing (III)
So the nice properties we strive for in parameter estimation (and some-
times achieve) are beyond reach.
Intuition: smoothing lowers peaks and fills in valleys.
2 4 6 8 10 12 14 16
0
1 0 0
2 0 0
3 0 0
4 0 0
F r e q u e n c y
x
Red curve: PDFHistogram: Sampling from PDF
Black curve: Gaussian kernel esti-
mator for PDF
25 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 26/29
Comment on Number of Bins in Histogram
Note: “Sturges’ rule,” based on optimizing MSE, was used in deciding
how many bins, k, to make in the histogram:
k = 1 + log2 n.
The argument behind this rule has been criticized (1995):
http://www-personal.buseco.monash.edu.au/∼hyndman/papers/sturges.pdf
Indeed we see in our example that we would have “by hand” selected
more bins; our histogram is “over-smoothed”. There are other rules for
optimizing the number of bins. For example, “Scott’s rule” for the bin
width is:w = 3.5sn−1/3,
where s is the sample standard deviation.
[More later]
26 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 27/29
Dependence on Smoothing Parameter
Plot showing effect of choice of smoothing parameter”:
2 4 6 8 10 12 14 16
0
5 0
1 0 0
1 5 0
2
0 0
x
F r e q u
e n c y
Red: Sampling PDF
Black: Default smoothing w)
Blue: w/2 smoothing
Turquoise: w/4 smoothing
Green: 2w smoothing
27 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 28/29
The Curse of Dimensionality
Roger Barlow gave a nice example of the impact of the “Curse of
Dimensionality” in parametric statistics. It is a significant affliction in
density estimation as well.
Difficult to display and visualize as the number of dimensions in-
creases.
“All” the volume (of a bounded region) goes to the boundary (expo-
nentially!) as the dimensions increases. I.e., data becomes “sparse”.
1/2
1/41/8
, . . . 12d
Tendency for exponentially growing computation requirement with
dimensions.
Even worse than parametric statistics.
28 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006
8/12/2019 sluo2006lec4
http://slidepdf.com/reader/full/sluo2006lec4 29/29
Summary
We have introduced:
Basic notions in (non-parametric) density estimation
Some simple variations on the theme
A foundation towards optimization
An idea of where and how things will fail
Next: Further sophistication on these ideas; and introduction of other
variations in approach and application.
29 Frank Porter, SLUO Lectures on Statistics, 15–17 August 2006