Statisitics Summary Notes

8/11/2019 Statisitics Summary Notes

1/18

Statistics of Measurement - Summary Handout

Dr Roberto Trotta

January 23, 2010

This is a summary of the notes I used for the 9 lectures course Statistics of Measurementfor the 2nd year physics degree at Imperial College London. While I hope the material will beuseful for revision, you should not rely solely on it, but complement it with your own notes from

the lectures (in particular, covering examples and derivations, which are mostly omitted fromthis summary) and other textbook sources.

Every effort has been made to correct any typos, but invariably some will remain. I would begrateful if you could point them out to me by e-mailing your corrections to: [email protected] list oferrata corrigewill be posted and maintained on Blackboad.

Contents

1 Probabilities 2

2 Random variables, parent distributions and samples 3

3 Discrete probability distributions 3

4 Properties of discrete distributions: expectation value and variance 6

5 Properties of continous distributions 8

6 The Gaussian (or Normal) distribution 9

7 The Central Limit Theorem 11

8 The likelihood function 11

9 The Maximum Likelihood Principle 13

10 Confidence intervals 15

11 Propagation of errors 16

12 Bayesian statistics 17

1


2/18

2

1 Probabilities

Let A, B , C , . . . denote propositions (e.g., that a coin toss gives tails). Let describethe sample space of the experiment, i.e., is a list of all the possible outcomes of theexperiment (for the coin tossing example, ={T, H}, where T denotes tails and Hdenotes head.

When the experiment is performed, the oucome O (e.g., coin lands T) is selected withprobability

P(O) =n(O)

n (1)

where n(O) is the number of possibilities in favourable to O andn is the total numberof possibilities in .

Frequentist definition of probability: The number of times an event occurs divided by thetotal number of events in the limit of an infinite series of equiprobable trials.

Thejoint probabilityofA and B is the probability ofA and B happening together, and isdenoted by P(A, B).

The conditional probability ofA given B is the probability of A happening given that Bhas happened, and is denoted by P(A|B).

The sum rule:P(A) + P(A) = 1, (2)

whereA denotes the proposition not A.

The product rule:P(A, B) =P(A|B)P(B). (3)

By inverting the order ofA andB we obtain that

P(B, A) =P(B|A)P(A) (4)

and because P(A, B) =P(B, A), we obtain Bayes theoremby equating Eqs. (3) and (4):

P(A|B) = P(B|A)P(A)P(B)

. (5)

The marginalisation rule (follows from the two rules above):

P(A) =P(A, B1) + P(A, B2) + =i

P(A, Bi) =i

P(A|Bi)P(Bi), (6)

where the sum is over all possible outcomes for proposition B .

Two propositions (or events) are said to be independentif and only if

P(A, B) =P(A)P(B). (7)

Statistics of Measurement cRoberto Trotta 2010


3/18


4/18

4

Figure 1: Left panel: uniform discrete distribution for n = 6. Right panel: the correspondingcdf.

It is plotted in Fig. 1 alongside with its cdf for the case of the tossing of a fair die ( n= 6).

The binomial distribution: the binomial describes the probability of obtainingr suc-cesses in a sequence ofn trials, each of which has probability p of success. Here, successcan be defined as one specific outcome in a binary process (e.g., H/T, blue/red, 1/0, etc).The binomial distributionB (n, p) is given by:

P(r|n, p) B(n, p) = nr

pr(1 p)nr, (11)

where the choose symbol is defined asn

r

n!

(n r)!r! (12)

for 0 r n (remember, 0! = 1). Some examples of the binomial for different choices ofn, p are plotted in Fig. 2.

The derivation of the binomial distribution proceeds from considering the probability ofobtaining r successes in n trials (pr), while at the same time obtaining n

r failures

((1p)nr). The combinatorial factor in front is derived from considerations of thenumber of permutations that leads to the same total number of successes.

The Poisson distribution: the Poisson distribution describes the probability of thenumber of events in a process where events occur with a fixed average rate and indepen-dently of each other. E.g.: number of galaxies in the sky, number of murders in London,number of planes landing at Heathrow, number of photons arriving at a photomultiplier,etc.

Assume that it the probabilty of an event occuring per unit time (with = constant.This is the definition of Poisson process!). The probability ofr events happening in a time



5/18

5

Figure 2: Some examples of the binomial distribution, Eq. (11), for different choices ofn, p, andits corresponding cdf.



6/18

6

t is given by the Poisson distribution:

P(r|, t) Poisson() =(t)r

r! et

. (13)Notice that this is a discrete pmf in the number of events r, andnota continuous pdf in t(which is a fixed parameter). The probability of getting r events in a unit time interval isobtained from the above equation by setting t = 1.

The Poisson distribution of Eq. (13) is plotted in Fig. 3 as a function ofr for a few choicesof (notice that in the figure t = 1 has been assumed, in the appropriate units). Thederivation of the Poisson distribution follows from considering the probability of 1 eventtaking place in a small time interval t, then taking the limit tdt0. It can alsobe shown that the Poisson distribution arises from the Binomial in the limit pN forN .

4 Properties of discrete distributions: expectation value andvariance

The discrete distributions above depend on parameters (such as p for the Binomial, forPoisson), which control the shape of the distribution. If we know the value of the param-eters, we can deduce what we will observe when we obtain samples from the distributionsvia the measurement process. This is the subject of probability theory, which concernsitself with the theoretical properties of the distributions. The inverse problem of makinginductions about the parameters from the observed samples is the subject of statisticalinference, addressed later.

Two important properties of distributions are the expectation value (which controls thelocation of the distribution) and the variance or dispersion(which controls how much thedistribution is spread out). Expectation value and variance are functions of a RV.

The expectation value E(X) (often called mean, or expected value1) of the discreteRVX is defined as

E(X) = X i

xiPi. (14)

For example, for the tossing of a fair die (which follows the uniform discrete distribution,Eq. (10)), the expectation value is given by E(X) =

ii 16 = 21/6.

The variance or dispersionVar(X) of the discrete RV X is defined asVar(X) E(X E(X))2 =E(X2) E(X)2. (15)

The square root of the variance is often called standard deviation and is usually denotedby the symbol , so that Var(X) =2. For the above example of die tossing, the varianceis given by

Var(X) =i

(xi X)2Pi=i

x2i Pi

i

xiPi

2=i

i21

6

21

6

2=

105

36 . (16)

1We prefer not to use the term mean to avoid confusion with the sample mean.



7/18

7

Figure 3: Some examples of the Poisson distribution, Eq. (13), for different choices of, and itscorresponding cdf.



8/18

8

For the Binomial distribution of Eq. (11), the expectation value and variance are given by:E(X) =np, Var(X) =np(1

p). (17)

For the Poisson distribution of Eq. (13), the expectation value and variance are given by:E(X) =, Var(X) =. (18)

5 Properties of continous distributions

As we did above for the discrete distribution, we now define the following properties forcontinous distributions.

The expectation valueE(X) of the continous RV X is defined as

E(X) = X xp(x)dx. (19) The variance or dispersionVar(X) of the continous RV X is defined as

Var(X) E(X E(X))2 =E(X2) E(X)2 =

x2p(x)dx

xp(x)dx

2. (20)

The exponential distribution: the exponential distribution describes the time one hasto wait between two consecutive events in a Poisson process, e.g. the waiting time betweentwo radioactive particles decays, or the time between cars passing by a certain point on aroad, or (swapping time for length) the distance between galaxies in the sky.

To derive the exponential distribution, one can consider the arrival time of Poisson dis-tributed counts, for example the arrival time of customers in a queue, then derive theprobability density that the first person arrives at time t by considering the probability(which is Poisson distributed) that nobody arrives in the interval [0, t] and then that oneperson arrives during the interval [t, t+ t]. Taking the limit t 0 it follows that theprobability density for observing precisely 1 event at time t is given by

P(1 event at time t|) = et, (21)where is the mean number of events per unit time. This is the exponential distribution.

If we have already waited for a time s for the first event to occur (and no event has

occurred), then the probability that we have to wait for another time t before the firstevent happens satisfiesP(T > t + s|T > s) =P(T > t). (22)

This means that having waited for time s without the event occuring, the time we canexpect to have to wait has the same distribution as the time we have to wait from thebeginning. The exponential distribution has no memory of the fact that a time s hasalready elapsed.

For the exponential distribution of Eq. (21), the expectation value and variance are givenby

E(t) = 1/, Var(t) = 1/2. (23)



9/18

9

Figure 4: Two examples of the Gaussian distribution, Eq. (24), for different choices of, , andits corresponding cdf. It is clear that the expectation value controls the location of the pdf,while controls its width.

6 The Gaussian (or Normal) distribution

The Gaussian pdf (often called the Normal distribution) is perhaps the most importantdistribution. It is used as default in many situations involving continous RV (the reasonbecomes clear once we have studied the Central Limit Theorem, section 7). A heuristicderivation of how the Gaussian arises follows from the example of darts throwing (givenin the lecture).

The Gaussian pdf is a continous distribution with mean and standard deviation isgiven by

p(x|, ) = 12

exp

1

2

(x )22

, (24)

and it is plotted in Fig. 4 for two different choices of{, }. The Gaussian is the famousbell-shaped curve.

For the Gaussian distribution of Eq. (24), the expectation value and variance are given by:

E(X) =, Var(X) =2. (25)

It can be shown that the Gaussian arises from the Binomial in the limit n andfrom the Poisson distribution in the limit . As shown in Fig. 5, the Gaussianapproximation to either the Binomial or the Poisson distribution is very good even forfairly moderate values ofn and.

The probability content of the Gaussian for a given symmetric interval around the mean



10/18

10

Figure 5: Gaussian approximation to the Binomial (left panel) and the Poisson distribution(right panel). The solid curve gives in each case the Gaussian approximation to each pmf.

of widthon each side is given by

P( < x < + ) = +

12

exp

1

2

(x )22

dx (26)

=

2

1/2

0 expy2 dy (27)

= erf(1/

2), (28)

where the error functionerf is defined as

erf(x) = 2

x0

expy2 dy, (29)

and can be found by numerical integration (also often tabulated and available as a built-infunction in most mathematical software). Also recall the useful integral:

exp

1

2

(x )22 dx=

2. (30)

Eq. (26) allows to find the probability content of the Gaussian pdf for any symmetricinterval around the mean. Some commonly used values are given in Table 1.

In particular, the usual notation, e.g. for a measurement of a temperature of the formT = (100 1) K, means that1 K is the 1 errorbar. This means that 68.4% of theprobability is contained with1 K of the mean.

The discovery threshold in particle physics is traditionally set at 5. This means thatone needs to have a probability in excess of 1 5.7 107 before being able to claim thediscovery of a new effect.



11/18

11

P( < x < ) Usually callednumber of sigma Probability content

1 0.683 12 0.954 23 0.997 34 0.9993 45 1 5.7 107 5

1.64 0.90 90% probability interval1.96 0.95 95% probability interval2.57 0.99 99% probability interval3.29 0.999 99.9% probability interval

Table 1: Relationship between the size of the interval around the mean and the probabilitycontent for a Gaussian distribution.

7 The Central Limit Theorem

The Central Limit Theorem (CLT) is a very important result justifying why the Gaussiandistribution is ubiquitous.

Simple formulation of the CLT: Let X1, X2, . . . , X N be a collection of independent RVwith finite expectation value and finite variance 2. Then, for N , thir sum isGaussian distributed with meanN and variance N 2.

Note: it does not matter what the detailed shape of the underlying pdf for the individual

RVs is!Consequence: whenever a RV arises as the sum of several independent effects (e.g., noisein a temperature measurement), we can be confident that it will be very nearly Gaussiandistributed.

More rigorous (and more general) formulation of the CLT: Let X1, X2, . . . , X Nbe a col-lection of independent RV, each with finite expectation value i and finite variance

2i .

Then the variable

Y =

Ni=1 Xi

Ni=1 iN

i=1 2i

(31)

is distributed as a Gaussian with expectation value 0 and unit variance.

Proof: not required. Very simple using characteristic functions.

8 The likelihood function

The problem ofinferencecan be stated as follows: given a collection of samples, {x1, x2, . . . ,xN},and a generating random process, what can be said about the properties of the underlyingprobability distribution?



12/18

12

Figure 6: The likelihood function for the probability of heads () for the coin tossing example.

Schematically, we have that:pdf - e.g., Gaussian with a given (, ) Probability of observation

Underlying (, ) Observed events(32)

The connection between the two domains is given by the likelihood function.

Given a pdf or a pmf p(X|), where X represents a random variable and a collectionof parameters describing the shape of the pdf (e.g., for a Gaussian ={, }) and theobserved data x= {x1, x2, . . . , xN}, the likelihood functionL (or likelihood for short) isdefined as

L() = p(X=x|) (33)i.e., the probability, as a function of the parameters , of observing the data that havebeen obtained. Notice that the likelihood is nota pdf in .

Example: in tossing a coin, let be the probability of obtaining heads in one throw.Suppose we make n = 5 flips and obtain the sequence x= {H , T , T , T , T }. The likelihoodis obtained by taking the Binomial, Eq. (11), and replacing for r the number of headsobtained (r= 1) in n = 5 trials. Thus

L() =

5

1

1(1 )4 = 5(1 )4, (34)

which is plotted as a function of in Fig. 6.

This example leads to the formulation of the Maximum Likelihood Principle (see below):if we are trying to determine the value of given what we have observed (the sequence



13/18

13

of H/T), we should choose the value that maximises the likelihood. Notice that this isnotnecessarily the same as maximising the probability of . Doing so requires the use of

Bayes theorem, see section 12. A common problem is how to estimate the mean and the standard deviation of a Gaussian.

Given a list of samples x={x1,x2, . . . , xN}, the estimator for the (unknown) mean ofthe underlying Gaussian they are drawn from is given by

x= 1N

Ni=1

xi, (35)

i.e., the sample mean. The law of large numbers implies that

limN

= . (36)

This means that for large samples, the estimated sample mean converges to the true meanof the distribution. An estimator with this property is said to be unbiased.

The estimator for the standard deviation of the Gaussian is given by

2 = 1

N 1Ni=1

(xi x)2. (37)

Notice the 1/(N 1) factor in front, that ensures that the estimator is unbiased. Indeedlim

N2 =2, (38)

i.e., the above estimator converges to the true value for a large number of samples.

9 The Maximum Likelihood Principle

The Maximum Likelihood Principle(MLP): given the likelihood function L() and seekingto determine the parameter, we should choose the value of in such a way that the valueof the likelihood is maximised. The Maximum Likelihood Estimator (MLE) for is thus

ML max

L() (39)

Properties of the MLE: it is asymptotically unbiased (i.e., ML for N ) and it isasymptotically the minimum variance estimator, i.e. the one with the smallest errors.

To find the MLE, we maximise the likelihood by requiring its first derivative to be zeroand the second derivative to be negative:

L()

ML

= 0, and 2L()

2

ML


14/18

14

Application of the MLP to a Gaussian likelihood: for N independent samples from aGaussian distribution, the joint likelihood function is given by

L() =p(x|) =Ni=1

12

exp

1

2

(xi )22

, (42)

where ={, } are the mean and standard deviation of the distribution. Note: oftenthe Gaussian above is written as

L() = L0exp2/2 (43)

where the so-called chi-squared is defined as

2

=i

(xi

)2

2 . (44)

The MLE for the mean is obtained by solving

ln L

= 0 ML= 1N

Ni=1

xi, (45)

i.e., the MLE for the mean is just the sample mean that we already encountered above.

The MLE for works out to be

ln L

= 0 2ML= 1NNi=1

(xi )2, (46)

which however is biased, because we have that E(2ML) = (1 1N)2 = 2 (for finite N).In order to obtain an unbiased estimator we replace the factor 1/N by 1/(N 1). Also,because the true is usually unknown, we replace it in Eq. (47) by the MLE estimator,ML. Thus an unbiased estimator for the variance is

2 = 1

N 1Ni=1

(xi ML)2. (47)

MLE recipe:1. Write down the likelihood. This depends on the kind of random process you are

considering.

2. Find the best fit value of the parameter by maximising the likelihood L as a functionof . This is your MLE, ML.

3. Evaluate the uncertainty onML (see next section).



15/18

15

10 Confidence intervals

Consider a general likelihood function,

L() and let us expand ln

Laround its maximum:

ln L() = ln L(ML) + ln L()

ML

( ML) +12

2 ln L()2

ML

( ML)2 + . . . (48)

The second term on the RHS vanishes (by definition of the Maximum Likelihood value),hence we can approximate the likelihood as

L() L(ML)exp

12

( ML)22

, (49)

with1

2 =

2 ln L()2 ML

. (50)

So a general likelihood function can be approximated as a Gaussian around its peak, asshown by Eq. (49).

Application: going back to the example given by the likelihood of Eq. (42), we can usethe above result to estimate the width of the likelihood function around the peak. Thisexpresses the uncertainty in our estimation of the mean, Eq. (45). Applying Eq. (50) tothe likelihood of Eq. (42) we obtain

2 = 2/N. (51)

(this result can also be derived directly by manipulating the likelihood function). Thismeans that the standard deviation (i.e., the uncertainty) on our ML estimate for is

proportional to 1/

N, with Nbeing the number of measurements.

As the likelihood function can be approximated as a Gaussian (at least around the peak),we can use the results for a Gaussian distribution to approximate the probability contentof an interval around the ML estimate for the mean. The interval [min, max] is called a100% confidence intervalfor the mean ifP(min< < max) =.

So, for example, the interval [ML < < ML+ ] is a 68.3% confidence intervalfor the mean (a so-called 1 interval), while [ML 2 < < ML+ 2] is a 95.4%confidence interval (a 2 interval).

One has to be careful with the interpretation of confidence intervals as this is often mis-understood! Interpretation: if we were to repeat an experiment many times, and each

time report the observed 100% confidence interval, we would be correct 100% of thetime. This means that (ideally) a 100% confidence intervals contains the true value ofthe parameter 100% of the time.

In a frequentist sense, it does not make sense to talk about the probability of. Thisis because every time the experiment is performed we get a different realization (differentsamples), hence a different numerical value for the confidence interval. Each time, eitherthe true value of is inside the reported confidence interval (in which case, the probabilityof being inside is 1) or the true value is outside (in which case its probability of beinginside is 0). Confidence intervals do not give the probability of the parameter! In order todo that, you need Bayes theorem.



16/18

16

11 Propagation of errors

Suppose we have measured a quantity x obtaining a measurement x

x

. How do wepropagate the measurement onto another variable y = y(x)?

Taylor expanding y(x) around x we obtain:

y(x) y(x) + (x x) yx

x=x

+ . . . (52)

Truncating the expansion at linear order, the expectation value ofy is given by:

E(y) =E(y(x)) +y

x

x=x

E(x x) = y(x) (53)

because E(x x) = 0.The variance ofy is given by:

V(y) = E([y(x) E(y(x))]2) =E([y(x) y(x)]2) =

y

x

x=x

22x. (54)

So the variance on y is related to the variance on x by

2y =

y

x

x=x

22x. (55)

Generalization to functions of several variables: ify = y(x1, . . . , xN) then

2y =

Ni=1

yxi

x=x

22xi . (56)

Special cases:1. Linear relationship: y= ax. Then y = ax.

2. Product or ratio: e.g. y(x1, x2) =x1 x2 or y(x1, x2) =x1/x2. Then2yy2

=2x1

x21+

2x2x22

. (57)

Systematic vs random errors: errors are often divided in this two categories. Any mea-

surement is subject to statistical fluctuations, whice means that if we repeat the samemeasurement we will obtain every time a slightly different outcome. This is a statistical(or random) error. Random errors manifest themeselves as noise in the measurement,which leads to variability in the data each time a measurement is made.

On the other hand, systematic errorsdo not lead to variability in the measurement, butare the cause for data to be systematically off all the time (e.g., measuring a current inA while the apparatus really gives mA would lead to a factor of 1000 systematic error allthe time). Systematic errors are usually more difficult to track down. They might arise byexperimental mistakes, or because of unmodelled (or unrecognized) effects in the systemyou are measuring.



17/18

17

12 Bayesian statistics

Bayes theorem, Eq. (5), encapsulates the notion of probability as degree of belief. TheBayesian outlook on probability is more general than the frequentist one, as the formercan deal with unrepeatable situations that the latter cannot address.

We replace in Bayes theorem, Eq. (5), A (the parameters) and B d (the observeddata, or samples), obtaining

P(|d) = P(d|)P()P(d)

. (58)

On the LHS, P(|d) is the posterior probability for (or posterior for short), and itrepresents our degree of belief about the value of after we have seen the data d.

On the RHS, P(d|) = L() is the likelihood we already encountered. It is the probability ofthe data given a certain value of the parameters. The quantity P() isthe prior probabilitydistribution (or prior for short). It representes our degree of b elief in the value of before we see the data. This is an essential ingredient of Bayesian statistics. In thedenominator, P(d) is a normalizing constant (often called the evidence), than ensuresthat the posterior is normalized to unity:

P(d) =

dP(d|)P(). (59)

The evidence is important for Bayesian model selection (not covered in this course).

Interpretation: Bayes theorem relates the posterior probability for (i.e., what we knowabout the parameter after seeing the data) to the likelihood. It can be thought of asa general rule to update our knowledge about a quantity (here, ) from the prior tothe posterior. A result known as Cox theorem shows that Bayes theorem is the uniquegeneralization of boolean algebra in the presence of uncertainty.

Remember that in generalP(|d) =P(d|) (see ex. of pregnant woman), i.e. the posteriorand the likelihood are two different quantities with different meaning!

Bayesian inference works by updating our state of knowledge about a parameter (or hy-pothesis) as new data flow in. The posterior from a previous cycle of observations becomesthe prior for the next. The price we have to pay is that we have to start somewhere byspecifying an initial prior, which is not determined by the theory, but it needs to be given

by the user. The prior should represent fairly the state of knowledge of the user about thequantity of interest. Eventually, the posterior will converge to a unique (objective) resulteven if different scientists start from different priors (provided their priors are non-zero inregions of parameter space where the likelihood is large). See Fig. 7 for an illustration.

There is a vast literature about how to select a prior in an appropriate way. Some aspectsare fairly obvious: if your parameter describes a quantity that has e.g. to be strictlypositive (such as the number of photons in a detector, or an amplitude), then the priorwill be 0 for values


18/18

18

(a) (b) (c) (d)

Figure 7: Converging views in Bayesian inference. Two scientists having different prior believesp() about the value of a quantity (panel (a), the two curves representing two different priors)observe one datum with likelihoodL() (panel (b)), after which their posteriors p(|d) (panel

(c), obtained via Bayes Theorem, Eq. (5)) represent their updated states of knowledge on theparameter. This posterior then becomes the prior for the next observation. After observing 100data points, the two posteriors have become essentially indistinguishable (d).

A standard (but by no means trivial) choice is to take a uniform prior (also called flatprior) on , defined as:

P() =

1(maxmin) for min max

0 otherwise (60)

With this choice of prior in Bayes theorem, Eq. (58), the posterior becomes functionally

identical to the likelihood up to a proportionality constant:

P(|d) P(d|) = L(). (61)

In this case, all of our previous results about the likelihood carry over (but with a differentinterpretation). In particular, the probability content of an interval around the mean forthe posterior should be interpreted as a statement about our degree of belief in the valueof (differently from confidence intervals for the likelihood).

Under a change of variable, = (), the prior transforms according to:

P() =P() dd. (62)

In particular, a flat prior on is no longer flat in if the variable transformation isnon-linear.


Documents

Statisitics Summary Notes