ApplicationsHow to t a histogram The Histogram · ApplicationsHow to t a histogram Neyman’s Chi-square: the bias for small n i There are two problems that arise when n i is small:

Applications How to fit a histogram

The Histogram

Histograms were invented for human beings, so that we can visualize theobservations as a density (estimated values of a pdf).

The binning most convenient for visualization is not necessarily the bestfor fitting.

We have already looked in detail at the problem of choosing the optimalbinning for estimation and testing using Pearson’s Chi-square statistic.

For other methods of fitting, it is better to have very narrow bins with veryfew events per bin, since putting events into bins causes us to lose theinformation about exactly where the event was observed.

For maximum likelihood fitting, there is no binning at all.

F. James (CERN) Statistics for Physicists, chap. 7 February-June 2010 1 / 14


How to fit a histogram

Fitting always means minimizing something.The ”something” is always called the objective function(that is the language of function minimization).

The function may however take many forms. The most important are:

Pearson’s Chi-square : The common least squares sum, with sigma takenas the square root of the expected number of events.

Neyman’s Chi-square : The same least squares sum, except that sigma isthe square root of the observed number of events.

Binned Likelihood : −2 lnλ, where λ is the binned likelihood ratio ofBaker and Cousins.

Likelihood : −2 ln L, where L is the usual (unbinned) likelihood of theobserved individual events.



Pearson’s Chi-square to fit a histogramThe Pearson test statistic T is the residual sum of weighted squares ofdifferences between the observed number of events and the expectednumber:

T (θ) =∑bins

(ni − Npi (θ))2

Npi (θ),

where Npi , the expected number of events in bin i , is expected on thebasis of the model function, the p.d.f. f (x |θ)

pi (θ) =

∫bin i

f (x |θ)dx

We nearly always approximate the integral over the bin by the value of f inthe middle of the bin f (xi ), times the bin width:

pi (θ) =

∫bin i

f (x |θ)dx ≈ f (xi |θ)∆x



Pearson’s Chi-square: the integral over the bin

The approximationpi (θ) ≈ f (xi |θ)∆x

is in fact a very good approximation, even if f is rising or falling steeply,since the leading term in the error is proportional only to the secondderivative of f .

All other methods with binning also use this approximation.You must be careful only when the bins are so wide that there issignificant curvature of f inside the bin.

However, you should be sure that xi is in the middle of the i th bin.

If the bins are all of the same width, then the bin width ∆x is just anoverall normalization factor which can usually be absorbed into one of thefit parameters θ, in which case it can simply be dropped.



Pearson’s Chi-square: dependence on the parametersPearson proposed his T-statistic for GOF testing, with no parameters θ.For fitting points y to a curve f (x), it is generally written:

T (θ) =∑

points

(yi − f (xi , θ))2

σ2i

,

and it is assumed that the variances σ2i are known and do not depend on

the parameters θ. If f is a linear function of the θ, the problem is linearand can be solved by matrix inversion without iteration.

However, when we fit histograms by minimizing

T (θ) =∑bins

(ni − Npi (θ))2

Npi (θ),

there is an additional dependence on θ in the denominator which makes alinear problem non-linear and has the additional effect of making theminimum of T occur for larger values of σ2

i = Npi (θ).That means that it biases the fit toward larger values of f .



Pearson’s Chi-square: dependence on the parameters

The bias can be avoided by removing the dependence on θ in thedenominator during the fit. One way to do that is:

1. Fit the histogram using T defined as above, with the unwanteddependence on θ.

2. Remember the values of Npi for each bin at the minimum of T .

3. Perform the fit again, this time replacing in the objective functionNpi (θ) in the denominator for each bin, by the value obtained instep 2 above.

4. If necessary, repeat steps 2 and 3 until the fit is stable.

In this way, the denominator is Npi , the expected number of events basedon f (x), but its derivative with respect to θ is zero during the (last) fit.



Neyman’s Chi-square

Another way to remove the dependence of the denominator on θ, is to usenot the expected number of events, but the observed number, which is ofcourse constant.

This variant is called Neyman’s Chi-square:

TN(θ) =∑bins

(ni − Npi (θ))2

ni,

This is probably the most common form for fitting histograms, because itis the easiest to program, and it can be shown to be asymptoticallyequivalent to Pearson’s Chi-square.

But it is obviously a poor approximation when ni is small, since it thengives an exceptionally high weight to that bin.



Neyman’s Chi-square: the bias for small ni

There are two problems that arise when ni is small:

1. When ni is small, it is often because of a negative fluctuation from alarger expected value, in which case this large fluctuation will begiven a larger weight than it should have, and cause a large biastoward smaller values of f .

This may usually be fixed by taking, instead of σi = ni ,σi = the average of ni over three bins:

TN(θ) =∑bins

(ni − Npi (θ))2

(ni−1 + ni + ni+1)/3,

2. When ni = 0, the value of T diverges, so ni = 0 must be avoidedabsolutely. So, if after having used the trick above, there are stillsome bins with σi = 0 or 1, join those bins with neighbouring binsuntil σi exceeds one.



Binned Likelihood

The best way to avoid all the problems that arisebecause of binning with Chi-square, is to use eitherbinned likelihood, which is insensitive to binning, orlikelihood, which requires no binning at all.

Binned Likelihood treats the bin contents as eitherPoisson-distributed or Multinomial-distributed,whereas the Chi-square statistic implies a Gaussian approximation.

We use here the Poisson form, the most commonly used.

This has been introduced in this course in chapters 2 and 5, but it is notexplicitly given in the book.



Binned LikelihoodThe statistic to be minimized is what Baker and Cousins call thelikelihood chi-square because it is a likelihood ratio but isdistributed under H0 as a χ2(nbins−1):

χ2λ(θ) = −2 lnλ(θ) = 2

∑i

[µi (θ)− ni + ni ln(ni/µi (θ))]

where the sum is over bins and µi (θ) = Npi (θ).

This solves the problems of binning encountered when using Chi-square,since there is no minimum number of events per bin.(The last term in the expression for χ2

λ is taken to be zero when ni = 0).

However, not much seems to be known about optimal binning for GOF,but clearly some bins (most bins?) must have more than one event.

This method has the area-conserving property: at the minimum, thenormalization is automatically correct:

∑i µi (θ̂) =

∑i ni .



Unbinned LikelihoodWe have seen in point estimation that under very general conditions, theoptimal estimator is the Maximum Likelihood estimator, and that puttingevents into bins, as required for Chi-square estimation, always loses someinformation.

So why doesn’t everyone always use M.L. fitting? Basically, because:

1. The value of the likelihood is not a good GOF test statistic(as is the Chi-square).

2. Normalization of the pdf is often a problem. The likelihood is aproduct over observations of the pdf of the model (the fittingfunction). By definition, the pdf should always be normalized, but inpractice it is not always so easy.

In fact, absolute normalization is not necessary, since the value of thelikelihood function is only defined up to an arbitrary multiplicativeconstant, but the integral of the pdf over all the data space must beindependent of the parameters θ.



Calculating the Likelihood

Since the likelihood is only defined up to an arbitrary multiplicativeconstant (one can think of this as being the units of L), the values of Lmay easily overflow or underflow the capacity of the computer arithmetic.Since we are going to use −2 ln L anyway, we therefore accumulate ln L bysumming the logarithms of the pdf. Then the multiplicative constantbecomes an arbitrary additive constant which is easier to handlearithmetically.

A more serious problem is the normalization of the pdf, which has toremain constant as the parameters are varied. Sometimes it is possible towork with correctly normalized functions, but more often we mustnumerically integrate and re-normalize the pdf at each function call.

If the observable space is one-dimensional, the integration can be done tohigh accuracy using numerical quadrature, so it is not a real problem, justa lot of extra work to make sure it works.



Calculating the Likelihood

But when fitting in higher dimensions, the numerical integration is morecomplicated.

Already in two or three dimensions, you have to know what you are doing,and convergence to high accuracy can be rather slow.

In higher dimensions, the integration may have to be done by MonteCarlo, which converges slowly and is usually not accurate enough for thefitting program to calculate numerical derivatives of −2 ln L with respectto the parameters. This in turn means that you may have considerabledifficulty finding an accurate M.L. solution.



Fitting real data

We will look at an example of fitting the energy-loss distribution for pionsand protons in thin silicon detectors.

S. Hancock et al,Energy loss and energy straggling of protons and pions in the momentumrange 0.7 to 115 GeV/c,Phys. Rev. A28 (1983) 615

As many of the methods proposed in this course are valid or optimalonly asymptotically, you may get the impression that everything will workjust fine if there is a lot of data.

Unfortunately, this example shows that large amounts of high-resolutiondata can also pose problems.


Applications GOF intervals

Confidence Intervals and Goodness-of-fit

When I introduced the ”Five Classes of Problems”, I said that someproblems could be approached using the methods of several differentclasses, and that each different method could give different results so youmust know what problem you are trying to solve.

Now we look at a good example of this effect, looking at a confidenceinterval on a fitted parameter as determined by:

1. The method of interval estimation: The ensemble of parameter valuesthat includes the true value with probability 90%.

2. The method of goodness-of-fit: The ensemble of parameter valuesthat gives a good fit to the data, with a P-value greater than 0.10.



Chi-square as afunction of theparameter θ.

At the 90% level, agood fit is defined byχ2 < 63because there are50 bins.

But the 90%confidence intervalfor θ is given byχ2 = χ2

min + 2.70



Another example, this one from Astrophysics

In the famous plots of the expanding universe as determined fromsupernovae observations, the question is whether the expansion is constant(a straight line on the plot) or accelerating (an upward bending of thestraight line on the right side).

The data fit a straight line, but when an additional ”acceleration” term isadded to the model, the fit is more sensitive to the upward curve and theadditional term is significantly different from zero.


lowed up. This approach also made it possible to use theHubble Space Telescope for follow-up light-curve observa-tions, because we could specify in advance the one-square-degree patch of sky in which our wide-field imager wouldfind its catch of supernovae. Such specificity is a require-ment for advance scheduling of the HST. By now, theBerkeley team, had grown to include some dozen collabo-rators around the world, and was called Supernova Cos-mology Project (SCP).

A community effortMeanwhile, the whole supernova community was makingprogress with the understanding of relatively nearby su-pernovae. Mario Hamuy and coworkers at Cerro Tololotook a major step forward by finding and studying manynearby (low-redshift) type Ia supernovae.7 The resultingbeautiful data set of 38 supernova light curves (someshown in figure 1) made it possible to check and improveon the results of Branch and Phillips, showing that typeIa peak brightness could be standardized.6,7

The new supernovae-on-demand techniques that per-mitted systematic study of distant supernovae and the im-proved understanding of brightness variations amongnearby type Ia’s spurred the community to redouble its ef-forts. A second collaboration, called the High-Z SupernovaSearch and led by Brian Schmidt of Australia’s MountStromlo Observatory, was formed at the end of 1994. Theteam includes many veteran supernova experts. The tworival teams raced each other over the next few years—oc-casionally covering for each other with observations whenone of us had bad weather—as we all worked feverishly tofind and study the guaranteed on-demand batches of supernovae.

At the beginning of 1997, the SCP team presented theresults for our first seven high-redshift supernovae.8 Thesefirst results demonstrated the cosmological analysis tech-niques from beginning to end. They were suggestive of anexpansion slowing down at about the rate expected for thesimplest inflationary Big Bang models, but with error barsstill too large to permit definite conclusions.

By the end of the year, the error bars began to tighten,as both groups now submitted papers with a few more su-pernovae, showing evidence for much less than the ex-pected slowing of the cosmic expansion.9–11 This was be-ginning to be a problem for the simplest inflationarymodels with a universe dominated by its mass content.

Finally, at the beginning of 1998, the two groups pre-sented the results shown in figure 3.12,13

What’s wrong with faint supernovae? The faintness—or distance—of the high-redshift super-novae in figure 3 was a dramatic surprise. In the simplest

56 April 2003 Physics Today http://www.physicstoday.org

26

24

22

20

18

16

140.01 0.02 0.04 0.1

0.2 0.4 0.6 1

OB

SE

RV

ED

MA

GN

ITU

DE

22

21

200.2 0.4 0.6 1.0

Acceleratinguniverse

Deceleratinguniverse

with vacuum

energ

y

without v

acuumenergy

Mas

s de

nsi

ty0

rc

Empt

y

REDSHIFT z

0.8 0.7 0.6 0.5LINEAR SCALE OF THE UNIVERSE RELATIVE TO TODAY

Supernova CosmologyProject

High-Z SupernovaSearch

Hamuy et al.

0.0001

0.001

0.01

0.1

1

RE

LA

TIV

E B

RIG

HT

NE

SS

Exploding White Dwarfs

Aplausible, though unconfirmed, scenario would explainhow all type Ia supernovae come to be so much alike,

given the varied range of stars they start from. A lightweightstar like the Sun uses up its nuclear fuel in 5 or 10 billionyears. It then shrinks to an Earth-sized ember, a white dwarf,with its mass (mostly carbon and oxygen) supported againstfurther collapse by electron degeneracy pressure. Then itbegins to quietly fade away.

But the story can have a more dramatic finale if the whitedwarf is in a close binary orbit with a large star that is stillactively burning its nuclear fuel. If conditions of proximityand relative mass are right, there will be a steady stream ofmaterial from the active star slowly accreting onto the whitedwarf. Over millions of years, the dwarf’s mass builds upuntil it reaches the critical mass (near the Chandrasekharlimit, about 1.4 solar masses) that triggers a runaway ther-monuclear explosion—a type Ia supernova.

This slow, relentless approach to a sudden cataclysmicconclusion at a characteristic mass erases most of the orig-inal differences among the progenitor stars. Thus the lightcurves (see figure 1) and spectra of all type Ia supernovaeare remarkably similar. The differences we do occasionallysee presumably reflect variations on the common theme—including differences, from one progenitor star to the next,of accretion and rotation rates, or different carbon-to-oxy-gen ratios.

Figure 3. Observed magnitudeversus redshift is plotted for

well-measures distant12,13 and(in the inset) nearby7 type Ia su-pernovae. For clarity, measure-ments at the same redshift are

combined. At redshifts beyondz = 0.1 (distances greater thanabout 109 light-years), the cos-

mological predictions (indi-cated by the curves) begin to

diverge, depending on the as-sumed cosmic densities of

mass and vacuum energy. Thered curves represent models

with zero vacuum energy andmass densities ranging from thecritical density rc down to zero(an empty cosmos). The best fit

(blue line) assumes a mass density of about rc /3 plus a

vacuum energy density twicethat large—implying an accel-

erating cosmic expansion.

the

d

d

Applications GOF by bootstrap

GOF Testing by Resampling (Bootstrap) I

The Bootstrap, by Bradley Efron, is used for testing hypotheses with smallsamples from unknown distributions.

Example: You have a sample of 16 particles with known energies, and youhave identified 7 of them as antiprotons and 9 as protons.Now you want to test whether the protons are more energetic than theantiprotons.

The traditional approach would require assuming that both energydistributions were Gaussian, then making the standard test for equalmeans, based on the t-statistic

t =Ep − Ep

σpp

Where E denotes the average energy.


Applications GOF by bootstrap

GOF Testing by Resampling (Bootstrap) II

The bootstrap method uses the data itself.

I Put all 16 energies into one sample.

I Consider all the ways of dividing 16 into 7 + 9.(there are 16!/7!9! = 11, 440 different ways)

I for each combination, calculate xi = Ep − Ep

I plot all these values of x in a histogram.

One of the 11,440 values of xi will be xd , the one corresponding to thedata. To get the P-value of the test, simply count the number of values ofxi which are greater than xd (divided by the total 11,440 of course).


Applications Coverage in Practice

Coverage in Practice

Frequentist coverage is mainly a mathematical concept, in the sense that,if we use the Neyman procedure, we are mathematically guaranteed tohave exact coverage (or overcoverage in the case of discrete data).

But of course it should also be observable in the real world. The problem isthat we don’t know the true value when we do the experiment, so we don’tknow how many experimental error bars actually cover the true value.

But if we wait a little while, the true value may become known, or at leastknown much better than when the experiments were performed.

So let us consider coverage from the practical side. We will see, usingexamples from the recent lectures by Giulio D’Agostini [Cern AcademicTraining 2005], that there is considerable misunderstanding about themeaning of coverage.



ErrorDef (UP) to get desired Coverage in Minuit

Coverage probability ofNumber of hypercontour χ2 = χ2

min + UP

Parameters 50% 70% 90% 95% 99%

1 0.46 1.07 2.70 3.84 6.632 1.39 2.41 4.61 5.99 9.213 2.37 3.67 6.25 7.82 11.364 3.36 4.88 7.78 9.49 13.285 4.35 6.06 9.24 11.07 15.096 5.35 7.23 10.65 12.59 16.817 6.35 8.38 12.02 14.07 18.498 7.34 9.52 13.36 15.51 20.099 8.34 10.66 14.68 16.92 21.67

10 9.34 11.78 15.99 18.31 23.2111 10.34 12.88 17.29 19.68 24.71

Table of UP for multi-parameter confidence regions from χ2 or −2 ln L.F. James (CERN) Statistics for Physicists, chap. 7 February-June 2010 21 / 14

. . . or

• Why do we insist in using the ‘frequentistic coverage’ that,apart the high sounding names and attributes (‘exact’,‘classical’, “guarantees ..” , . . . ), manifestly does not cover?

• In January 2000 I was answered that the reason “is becausepeople have been flip-flopping. Had they used a unifiedapproach, this would not have happened” (G. Feldman)

• After six years the production of 90-95% C.L. bounds hascontinued steadly, and in many cases the so called ‘unifiedapproach’ has been used, but still coverage does not do itsjob.

• What will be the next excuse?⇒ I do not know what the so-called ‘flip-plopping’ is,

but we can honestly acknowledge the flop of that reasoning.

Go BackG. D’Agostini, CERN Academic Training 21-25 February 2005 – p.63/72

. . . or

• Why do we insist in using the ‘frequentistic coverage’ that,apart the high sounding names and attributes (‘exact’,‘classical’, “guarantees ..” , . . . ), manifestly does not cover?More precisely (and besides the ‘philosophical quibbles’ ofthe interval that covers the value with a given probability,and not the value being in the interval with that probability):◦ many thousands C.L. upper/lower bounds have been

published in the past years⇒ but never a value has shown up in the 5% or 10% side,

that, by complementarity, the method should cover in 5%or 10% of the cases.

G. D’Agostini, CERN Academic Training 21-25 February 2005 – p.63/72

How does coverage work?

Consider some set of 68 % Confidence Intervals from different experiments.

For example look at the figure in the Introduction to RPP (the same figure has

appeared for several editions). For our purposes, it doesn’t matter what quantity is

being measured here, but it happens to be �� , where� is a constant

parameterizing the violation of the � � � � � Rule in leptonic � decays.

Because physics has made some progress in 30 years,

we now know the true value:� � � .

In the figure, there are 17 Confidence Intervals of 68% CL,

and 12 of them include the true value (zero).

Coverage works well here.

Physicists generally consider coverage a required property of confidence

intervals. (See R. D, Cousins, Am. J. Phys. 63, 5, May 1995)

4

Introduction 11

other values of lower accuracy), the scaled-up error δ x is approximately half the intervalbetween the two discrepant values.

We emphasize that our scaling procedure for errors in no way affects central values.And if you wish to recover the unscaled error δx, simply divide the quoted error by S.

(b) If the number M of experiments with an error smaller than δ0 is at least three,and if χ2/(M − 1) is greater than 1.25, we show in the Particle Listings an ideogram ofthe data. Figure 1 is an example. Sometimes one or two data points lie apart from themain body; other times the data split into two or more groups. We extract no numbersfrom these ideograms; they are simply visual aids, which the reader may use as he or shesees fit.

WEIGHTED AVERAGE0.006 ± 0.018 (Error scaled by 1.3)

FRANZINI 65 HBC 0.2BALDO-... 65 HLBCAUBERT 65 HLBC 0.1FELDMAN 67B OSPK 0.3JAMES 68 HBC 0.9LITTENBERG 69 OSPK 0.3BENNETT 69 CNTR 1.1CHO 70 DBC 1.6WEBBER 71 HBC 7.4MANN 72 HBC 3.3GRAHAM 72 OSPK 0.4BURGUN 72 HBC 0.2MALLARY 73 OSPK 4.4HART 73 OSPK 0.3FACKLER 73 OSPK 0.1NIEBERGALL 74 ASPK 1.3SMITH 75B WIRE 0.3

χ2

22.0(Confidence Level = 0.107)

−0.4 −0.2 0 0.2 0.4 0.6

Figure 1: A typical ideogram. The arrow at the top shows the position of theweighted average, while the width of the shaded pattern shows the error in theaverage after scaling by the factor S. The column on the right gives the χ2

contribution of each of the experiments. Note that the next-to-last experiment,denoted by the incomplete error flag (⊥), is not used in the calculation of S (seethe text).

Each measurement in an ideogram is represented by a Gaussian with a central valuexi, error δxi, and area proportional to 1/δxi. The choice of 1/δxi for the area is somewhatarbitrary. With this choice, the center of gravity of the ideogram corresponds to anaverage that uses weights 1/δxi rather than the (1/δxi)2 actually used in the averages.This may be appropriate when some of the experiments have seriously underestimated

July 6, 2006 13:59


Effects which can interfere with coverageIn practice, there are several effects that can make the apparent coveragewrong:

1. The file drawer effect. If several expts look for an unexpected newresult, the ones that don’t observe any effect will not publish theirresults, so we expect an overabundance of significant P-values inpublished papers.

2. Flip-flopping and other mistakes can cause physicists to publishincorrect confidence intervals that do not cover.

3. Embarrassing results such as signal greater than physically allowedor predicted by any theory, would most likely be ”massaged” beforepublication, even if they only resulted from an unusual fluctuation.Suppressing such results modifies the global coverage.

4. The stopping rule for corrections. Most experimental results requireseveral corrections (for systematic errors, calibration, etc) before theycan be published. There is a tendency to stop applying corrections assoon as one attains the expected result. [W mass at LEP]


You have measured top � � �� GeV.

What statement can you make about the 68 % confidence interval (170,180) ?

1. The method I have used produces confidence intervals, 68 % of which

include the true value of� top.

2. The probability that� top will lie inside my confidence interval is 0.68.

3. The probability that� top lies inside the interval (170,180) is 0.68.

4. The probability that the interval (170,180) includes� top is 0.68.

5. The probability that� top lies inside the interval (170,180) is 0.68, in the

sense of Neyman.

Note that if you make any of the above statements to a journalist, he will report

No. 3 in his newspaper.

12

One last detail

Is there a difference between:

1. The probability that the true value lies inside an interval, and

2. The probability that an interval covers the true value ?

Formally, NO, since � �� .but I think most people prefer to say � �� rather than � � � � � � .

13

Documents

ApplicationsHow to t a histogram The Histogram · ApplicationsHow to t a histogram Neyman’s Chi-square: the bias for small n i There are two problems that arise when n i is small: