50
Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer Modern Methods of Data Analysis Introduction SS 2010 – Universität Heidelberg Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis Introduction

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis

Introduction

SS 2010 – Universität Heidelberg

Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

What Do You Learn in this Lecture?● How to extract PHYSICS knowledge from

measurements in (particle) physics

● Acquire competence in

– understanding the statistical tools needed for data analysis

– understanding the role of uncertainties and probabilities in relating experimental data and theory

● Beeing able to perform analysis of real data applying these techniques

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

What you NOT learn in this Lecture?

● How experimental detectors work

● How the standard model is build

● How to calculate Feynman diagrams

● How to design large software projects

● Mathematical regirous proof of all tools

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

The Particle Physics case ...

Why would one need to know “statistical methods for data analysis” in physics?

Why should I need to learn such methods?

Let's just consider the case of the Standard Model in particle physics ...

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Components of the Standard Model

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

THE STANDARD MODEL!!!

Experiments confirm Standard Model to incredible accuracy !

Everything is great -

We found THE THEORY!

... is this really all ???

Example: Z cross-section:

uncertainties smaller than symbol size, data confirm hypothesis: 3 light neutrinos

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Methodology of Particle Physics

Here statistical methods are crucial!

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

How do we reach those goals?

● Introduction to statistical concepts in the lectures with integrated exercises

● Hands-on work in the computer exercises optional at end of semester: reproduce physics analysis on real (LHC) data -> presentation

● Slides of lecture available Monday afternoon before the lecture http://www.physi.uniheidelberg.de/~menzemer/statistik_vorl_10.html

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Computer Exercises:● Place: CIP-Pool, Albert-Ueberle-Str. 3-5

● Potential dates:

– Wednesday: 14.00-17.00h

– Thursday: 14.00-17.00h

● Which programming skills do you have?

– C++, ROOT, other languages

● Do you plan to attend computer course?

– Do you have a Rechenzentrums-account?

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Credit Points

● Lecture: 2 CP - requirements: short oral exam at end of the term

(can be decided shortly before end of the term)

● Computer Exercises: 2 CP - requirement: (very) frequent presence

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Outline of the Lecture

● Basics Concepts & Definitions ● Characteristics of distribution ● Monte Carlo generators ● Important distributions

● Error propagation ● Estimators

– Maximum Likelihood– Least Square method

● Confidence Level and Limits ● Hypothesis Tests

● Add. Material

– multivariant systems, analysis bias, numerical methods, function minimization

introduction/bring everyoneto same level

core part of the lecture

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

The Cheating Baker● Once upon a time, in a holiday resort the landlord L. ran a

profitable B&B, and every morning bought 30 rolls for breakfast. By law the mass of a single roll was required to be 75 g. One fine day the owner of the bakery changed, and L. suspected that the new baker B. might be cheating. So he decided to check the mass of what he bought, using a kitchen scales with a resolution of 1g. After one month he had collected a fair amount of data:

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Data Reduction● the raw list of number is not very useful

– need some kind of data reduction● assume that all measurements are equivalent

– the sequence of individual data does not matter– all relevant information is contained in the number of

counts per reading

- much improved presentation of the collected information- the above numbers cover the entire data set- most of the measurements are lower than 75g ...

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Visualization● an even better presentation of the available information

– bar-chart of the tables given before● example for the concept of a histogram

– define bins for the possible values of a variable– plot the number of entries in each bin

● immediate grasp of center and width of the distribution

The rolls produced by baker B. definitely show a lack of doe. So L.was right in his suspicion, that B tried to make some extra profit by cheating ....

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

and the Conclusion● As a consequence of his findings, L. complained to B. and even

threatened to inform the police about his doings. B. apologized and claimed that the low mass of the rolls was an accident which will be corrected in the future. L., however, continues to monitor the quality delivered by the baker. One month later, B. inquired again about the quality of his products, asking whether now everything was all right. L., for his part, acknowledged that the weight of the rolls now matched his expectations, but he also voiced the opinion that B. was still cheating ...

B. simply selected the heaviest rolls for L. !

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Literature● Statistische und numerische Methoden der

Datenanalyse (V. Blobel/E.Lohrmann)

● Statistical Data Analysis (Glen Cowan)

● Statistics: A guide to the use of statistical methods in Physical Sciences (R. J. Barlow)

Recommendation for a rainy Sunday afternoon:

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Many good lectures on statistical methods around

● http://elbanet.ethz.ch/wikifarm/cgrab/index.php?n=Stamet.StametMain Lecture SS07 Christoph Grab & Christian Regenfus Uni Zuerich

● Michael Feindt, Guenther Quast, IEKP Karlsruhe, many different lectures and many good exercise examples!

● Guenter Duckeck, Madjid Boutemeur Lecture 7.10. - 11.10.2002http://www-alt.physik.uni-muenchen.de/kurs/Computing/stat/stat.html

● Ian C. Brock, WS 2000/01, Universitaet Bonn

http://www-zeus.physik.uni-bonn.de/~brock/teaching/stat_ws0001/index.html

● Michael Schmelling, Heidelberger Graduiertentag 2006

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis

Lecture I - 20.04.2010

Content:

- Basic Concepts of Probability

- Definition of random variables, Probability density, mean, variance ...

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Predictable

For easy classical physical processes the result isprecisely determined→ determinism “Determinisums”

Examples are :

pendular, orbit of planets, billard ...

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Contingency

Random events are as a matter of principle not predictable (even with precise knowledge of the starting conditions)

Examples:

► Lotto (depend on many quantities, deterministic chaos)

► radioactive decay (quantum mechanics)

► electronic noise

► measurement uncertainties

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Quantum MechanicsEach time something different happens ....

events of the CDF experiment

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Measurements

Experiments measure frequency distributions:

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Probability vs. Statistics● Probability: from theory to data

– start with a well-defined problem and calculate all possible outcomes of a specific experiment

● Statistics: from data to theory

– Try to solve the inverse problem: starting from a data set, want to deduce the rules/laws → this is analyzing experimental data

– deals with parameter estimation ● determine parameters and uncertainties in

unbiased and efficient way● hypothesis testing (agreement,confidence …)

- most challenging: to match question and answers

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Definition/Interpretation of Probability

● mathematical probability (axiomatic) all definitions of probability need to follow these axioms

● interpretation as rel. Frequency(Frequentists)● subjective probability (Bayesian)

Beware of different “names” used in literature!

Lucky enough, in all useful cases they use the same combinatorial rules of probabilities!

Interpretation of results differ sometimes ...Not go in detail about difference at this stage of the lecture.

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Kolmogorov Axioms (1931)

Formal approach, however from pragmatic point of view rather useless ...

Elementary events are considered, which exclude each other:

elementary event

set of all elements

probability of event

positive definite

additive

normalized

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Combination of Probabilities

Can be deduced from Kolmogorov axioms:

: A or B

: A and B

: not A

: P(A) if is reduced to B (conditional “bedingte” probability)

A and B are called independent if:

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Frequentist Probability● empirical definition via frequency of occurrence

(R. von Mises)

– perform experiment N times in identical trials, assume event E occurs k times, then: P(E) = lim k/N

– intuitive definition for particle physics (many repetitions)

● very useful, but has some problems

– limes in strict mathematical sense does not exist, – i.e. can not be proven that it converges ...

how large is N? When does it converge?– What means repeatable under identical conditions?

Is similar OK as well? This IS difficult, if not impossible ...

– Not applicable for single events.

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Subjective (Bayesian) Probability

Reverend Thomas Bayes (1702-1761)

Probability is the degree of believe, that an experiment has a specific result.

Subjective probability compatible with Kolmogorov axioms

Essay Towards Solving a Problem in the Doctrine of Chances (1763), published postum inPhilosophical Transactions of the Royal Society of London

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Examples for Bayesian Probability

Frequentists interpretation often not applicable.

Than Bayesian interpretation only possible one.

Probability is the degree of belief that a hypothesis is true:

- The particle in this event is a positron.

- Nature is supersymmetric.

- It will rain tomorrow.

- Germany will win the soccer championship this year.

Often criticized as “subjective” and “non scientific”.

However based on simple probability computation (axioms).

Not contradicting the Frequentists approach, but include it.

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Comments on Probability● There are strong rivaling schools of these approaches

● Be beware of names:

– Bayesians call frequentists the “classical ones”– some call the frequency def. of probability “objective”

● Most scientist would claim to belong to the frequency school

● But we often talk in Bayesian probability terms; and we think

of probabilites as objective numbers

● In particular, any attempt of interpreating results of an experiment falls into the trap of repeatability.

● Bayes' Theorem will highlight this more ....

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Bayes' Theorem (1)

Conditional (“bedingte”) probability

Due to follows:

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Examples: Rare Disease (1)

Probability of disease A: P(A) = 0.001 P(not A) = 0.999

Test for disease:

P(+| A) = 0.98 P(+| not A) = 0.03P(- | A) = 0.02 P(- | not A) = 0.97

Do you need to be worried if you get “+” as test result?What is the posteriori probability P(A|+)?

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Example: Rare Disease (2)

The posterior probability is only 3.2%, due to thetiny prior probability and the non negligible misidentification rate.

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Exercise: Particle ID

● proton-antiproton collision: 90% pions, 10% kaons

● kaon ID 95% efficient, pion mis-ID 6%

Example particle ID detector:

Question:

If the particle ID indicates a kaon, what is the probabilitythat it is a real kaon/a real pion?

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Bayes' Theorem (2)

Important through the interpretation A=theory,B=data

Posterior

Likelihood

Evidence

Prior

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Confirmation and Rejection

If I believe the SM is true, what is the probability to see no Higgs below 200 GeV:

P(no Higgs below 200 GeV | SM is true) < 0.000001

no Higgs below 200 GeV → SM is ruled out

If I see a Higgs below 200 GeV, how large is the probability that the SM is true?

P(SM is true | Higgs below 200 GeV exists)

depends on P(Higgs below 200 GeV in any model) and on the a priori probability that P(SM is true)

LikelihoodP

osteriori Probability

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Some Definitions

● experiment – e.g. throwing 2 times a dice

● realization: outcome of one specific experiment

● event space/sample space:

all possible outcomes

● random variable “Zufallsvariable”– function of outcome of experiment (e.g. sum

of dices)– event space and realization defined

accordingly for random variables– random variable can be either discrete or

continues

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Example● experiment: throwing two dices● event space of experiment

{11,12,13,14,15,16,21,22,23,...,64,65,66}● random variable x = sum of dices

– event space of x: {2,3,4,5,6,7,8,9,10,11,12}● realization of experiment e.g. 25 accordingly

realization of x=7

cumulated distribution u(x): probability to observe x or smaller value

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Probability density function (pdf)● Repeat an experiment with outcome characterized by

single continuous variable x● Definition: the probability to measure a value x in the

interval [x,x+dx] is give by probability density function f(x) (pdf) “Wahrscheinlichkeitsdichte”

pdf f(x) >=0 and normalized

pdf f(x) is NOT a probability, it has dimension 1/x!

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Question on PDFs

● List some examples of a probability density function f(x)– continuous ones– discrete ones

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Cumulative distribution function (cdf)

● Cumulative distribution function F(x), also know as probability distribution function. “Wahrscheinlichkeitsverteilung = Verteilungsfunktion”

● F(x') is interpreted as probability to find value x <= x'● F(x) is continuously non-decreasing function● F(-∞) = 0 and F(+∞)=1● is directly related to the probability density

function f(x) by:

● f(x) is given (for well-behaved distributions) by:

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Example: cdf

● f(X) =

● Compute the function F(x)

● What is the probability to find x > 1.5?

● What is the probability to find x in [0.5,1]

|1-x| for x in [0,2]

0 elsewhere

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Arithmetic Mean

● Average number of children per family in Germany is 2.3

● Average lifetime expectation for men is 74 for women 78

● Average amount of semester for physics studies inHeidelberg is 11.2

● ...

Definition:

Examples:

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Weighted Mean

● 5 measurements { } with different uncertainties { }

● arithmetic mean is special case of weighted means (same weight for each measurement)

Definition:

Example:

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

... even more averages● Mode/Modus:

most probable value (highest bin in distribution)definition not unique, unimodal, bimodal distributions

● Median: smallest value which is ≥ than 50% of the events (median more robust against outliers than artithm .mean)

For unimodal, symmetric functions centered around :median = mode = arithmetic mean

else, for unimodal functions, empiric rulemean-mode = 3 x (mean-median)

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Examples:

● Give median, mode, arithmetic mean of:

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Energy loss in Material● Assume tracking stations with 12 layers

● The energy loss per traversed material (dE/dx)

of the particle traversing a layer follows a Landau distribution

mode

dE/dx

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Energy loss in Material● The mode of the dE/dx distribution as a function of

particle momentum is used to seperate different particle species

1 10[GeV/c]momentum

How to get estimate of mode from 12 measurements?

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Truncated Mean● Discard 10/20% measurements with larges value

(symmetrize the function)

● Take arithmetic mean of remaining ones

● All about estimating the true mode/median/mean of distributions from given data set. More about estimators later.