32
Statistical Issues at Statistical Issues at LHCb LHCb Yuehong Xie Yuehong Xie University of Edinburgh University of Edinburgh (on behalf of the LHCb (on behalf of the LHCb Collaboration) Collaboration) PHYSTAT-LHC Workshop PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007 CERN, Geneva June 27-29, 2007

Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

Embed Size (px)

Citation preview

Page 1: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

Statistical Issues at LHCbStatistical Issues at LHCb

Yuehong XieYuehong XieUniversity of EdinburghUniversity of Edinburgh

(on behalf of the LHCb Collaboration)(on behalf of the LHCb Collaboration)

PHYSTAT-LHC WorkshopPHYSTAT-LHC WorkshopCERN, Geneva June 27-29, 2007CERN, Geneva June 27-29, 2007

Page 2: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

2

LHCb: a dedicated B physics experiment at LHCLHCb: a dedicated B physics experiment at LHC

Page 3: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

3

Physicists know where to look for Physicists know where to look for new physics in flavour sectornew physics in flavour sector

LHCb will look for effects of new physics in CP violation and LHCb will look for effects of new physics in CP violation and rare phenomena in B meson decaysrare phenomena in B meson decays– CMS and ATLAS will search for new particles directly producedCMS and ATLAS will search for new particles directly produced

In the Standard Model quark flavour mixing and CP violation In the Standard Model quark flavour mixing and CP violation are fully determined by the are fully determined by the CKMCKM matrix with four parameters matrix with four parameters– Over-constraining the CKM matrix is a stringent test of the SMOver-constraining the CKM matrix is a stringent test of the SM– Any Any inconsistencyinconsistency will mean new source of flavour mixing and CP will mean new source of flavour mixing and CP

violationviolation FCNCFCNC (flavour changing neutral currents) are forbidden at tree (flavour changing neutral currents) are forbidden at tree

level in the SM but new physics can have significant effects in level in the SM but new physics can have significant effects in FCNC processes FCNC processes – Comparing asymmetry and/or rate measurements with their SM Comparing asymmetry and/or rate measurements with their SM

predictions in FCNC processes is a sensitive test of the SM predictions in FCNC processes is a sensitive test of the SM – Any Any discrepancydiscrepancy will indicate presence of new physics particles in will indicate presence of new physics particles in

FCNC processes FCNC processes

Page 4: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

4

Statisticians know how to quantify Statisticians know how to quantify new physics effectsnew physics effects

In the language of statistics, LHCb will perform hypothesis In the language of statistics, LHCb will perform hypothesis testingtesting

The null hypothesis: the SM is valid at the energy scale The null hypothesis: the SM is valid at the energy scale relevant to B meson decaysrelevant to B meson decays

No alternative hypothesis is given explicitly, but rejecting the No alternative hypothesis is given explicitly, but rejecting the SM means new physics is needed to describe B meson decaysSM means new physics is needed to describe B meson decays

What LHCb need do is What LHCb need do is – Identify a test-statistic which has high power to separate the SM and Identify a test-statistic which has high power to separate the SM and

potential NP modelspotential NP models– Measure the test-statistic from data Measure the test-statistic from data – Evaluate the extreme probability in the null hypothesis, called Evaluate the extreme probability in the null hypothesis, called p-value– If the p-value is too small, reject the null hypothesisIf the p-value is too small, reject the null hypothesis– Otherwise go for another test-statistic and repeat the testOtherwise go for another test-statistic and repeat the test

Page 5: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

5

Where statistics/statisticians Where statistics/statisticians can help B physics/physicists?can help B physics/physicists?

B flavour tagging on a statistical basisB flavour tagging on a statistical basis Separating signal/background eventsSeparating signal/background events Data modeling and fittingData modeling and fitting Setting confidence intervals and limitsSetting confidence intervals and limits Controlling and treating systematicsControlling and treating systematics Optimizing analysesOptimizing analyses Test of the SM Test of the SM Providing analysis tools Providing analysis tools

Page 6: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

6

Flavour taggingFlavour tagging CP violation measurements with neutral B decays need to know CP violation measurements with neutral B decays need to know

the flavour of the B at productionthe flavour of the B at production Information is carried by taggersInformation is carried by taggers

– Charge of the particles accompanying the signal B at production: same Charge of the particles accompanying the signal B at production: same side tagging side tagging

– Charge of the Charge of the ±±,, e e± ± or Kor K±±from the decay of the opposite B hadron: lepton from the decay of the opposite B hadron: lepton and kaon taggingand kaon tagging

– Weighted sum of charges of all particles found to be compatible with Weighted sum of charges of all particles found to be compatible with being from the opposite B decay: vertex charge tagging being from the opposite B decay: vertex charge tagging

Tagging result of a signal B is a decision on statistical basis withTagging result of a signal B is a decision on statistical basis with– Average tagging efficiency Average tagging efficiency typically ~typically ~ – Average mistag probability Average mistag probability N NWW/(N/(NWW+N+NRR), typically 30-), typically 30-

35% 35% – Average statistical power Average statistical power (1-2(1-2))22, typically 4-10% , typically 4-10%

How to maximize statistical power?How to maximize statistical power?– Require appropriate statistical methods Require appropriate statistical methods

Page 7: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

7

Statistical issues in flavour taggingStatistical issues in flavour tagging Neural net used to get event-by-event mistag of each Neural net used to get event-by-event mistag of each

tagger. Performances depend on the way the NNs work tagger. Performances depend on the way the NNs work and the way they are trainedand the way they are trained

Treatment of correlation between vertex charge and Treatment of correlation between vertex charge and other taggers is non-trivialother taggers is non-trivial– The other taggers may be included in the vertex The other taggers may be included in the vertex

– Compromise between correct handling of correlations and Compromise between correct handling of correlations and available statistics to get properties of sub-samples available statistics to get properties of sub-samples

Hard assignment of particles to vertices causes loss of Hard assignment of particles to vertices causes loss of statistical power. Need to investigate probability-based statistical power. Need to investigate probability-based assignment of particles to vertices assignment of particles to vertices

Page 8: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

8

Separating signal/background events Separating signal/background events A demanding task in a hadron machine experiment A demanding task in a hadron machine experiment

– After trigger the ratio of inclusive bbbar background to signal in a typical After trigger the ratio of inclusive bbbar background to signal in a typical channel is at the million level, even bigger for very rare decayschannel is at the million level, even bigger for very rare decays

– In each bbbar event there are not only the two B hadrons but also ~100 In each bbbar event there are not only the two B hadrons but also ~100 tracks from pp interactionstracks from pp interactions

Information availableInformation available– PIDPID– Kinematical: momentum, invariant massesKinematical: momentum, invariant masses– Geometrical: vertex Geometrical: vertex 22, event topology, event topology– More … More …

Typically 10-20 variables to look at, each alone with limited Typically 10-20 variables to look at, each alone with limited separation powerseparation power– Cut-based analysis not optimal for statistical precision Cut-based analysis not optimal for statistical precision – Multivariate analysis more powerful but also more difficult for Multivariate analysis more powerful but also more difficult for

understanding systematics understanding systematics – Need trade-off between precision and accuracyNeed trade-off between precision and accuracy

Page 9: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

9

Multivariate analysis Multivariate analysis Essentially about how to construct a best test statistic from Essentially about how to construct a best test statistic from

many input variables for a hypothesis test many input variables for a hypothesis test Neyman-Pearson-Lemma: likelihood ratio is the best Neyman-Pearson-Lemma: likelihood ratio is the best

– The method of representing PDFs by multi-dimensional The method of representing PDFs by multi-dimensional histograms using Monte Carlo data becomes impractical when the histograms using Monte Carlo data becomes impractical when the dimension of the PDFs is too bigdimension of the PDFs is too big

The alternatives is to construct estimators to approach the The alternatives is to construct estimators to approach the likelihood ratio likelihood ratio – Decorrelated likelihood methodDecorrelated likelihood method– Linear estimator: Fisher’s discriminants, …Linear estimator: Fisher’s discriminants, …– Nonlinear estimator: neural networks, …Nonlinear estimator: neural networks, …– Boosted decision treesBoosted decision trees– ……

An implementation: An implementation: TMVATMVA ( (TToolkit for oolkit for MMultiultiVVariate data ariate data AAnalysis) nalysis) (arXiv:physics/0703039 )(arXiv:physics/0703039 )

Application in LHCb: SM forbidden BApplication in LHCb: SM forbidden Bss →e →e± ∓ analysis– High efficiency has higher priority than small systematics

Page 10: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

10

Application of TMVA Application of TMVA in Bin Bss →e →e±∓

Two phasesTwo phases– Training Training

– ApplicationApplication

Leave variables with Leave variables with clear separation power clear separation power outside TMVA and cut outside TMVA and cut on themon them

5 input variables in 5 input variables in TMVA: no linear TMVA: no linear correlation expectedcorrelation expected

Winner: decorrelated Winner: decorrelated likelihood methodlikelihood method– As Neyman-Peasrson As Neyman-Peasrson

Lemma tells us?Lemma tells us?

Our interest is hereOur interest is here Signal eff. (%)Signal eff. (%)

NNbgbg/fb/fb-1-1

Page 11: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

11

Overtraining with TMVA TMVA has a mechanism to monitor overtraining TMVA has a mechanism to monitor overtraining

using two independent training and testing samples using two independent training and testing samples

Way to control overtraining in early phase is Way to control overtraining in early phase is desirabledesirable

Page 12: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

12

Data modeling and fittingData modeling and fitting

Maximum likelihood fit is generally used in B Maximum likelihood fit is generally used in B physics measurementsphysics measurements

Modeling and fitting made easy with Modeling and fitting made easy with RooFit RooFit (http://rootsourceforge.net)(http://rootsourceforge.net)

LHCb can benefit from this package and wishes toLHCb can benefit from this package and wishes to– Have a goodness-of-fit for unbinned maximum Have a goodness-of-fit for unbinned maximum

likelihood fit implemented likelihood fit implemented – Understand how to speed up toy event generation for Understand how to speed up toy event generation for

complicated PDFcomplicated PDF– Understand how to make fit converge if a non-Understand how to make fit converge if a non-

factorizable multi-dimensional PDF has no analytical factorizable multi-dimensional PDF has no analytical normalization and can only rely on numerical integrationnormalization and can only rely on numerical integration

Page 13: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

13

Confidence intervals/limitsConfidence intervals/limits

As in all other experiments, we need to quote confidence As in all other experiments, we need to quote confidence intervals/limits for all measurementsintervals/limits for all measurements

The issue is especially important when working on very The issue is especially important when working on very rare decays with small signal and huge backgroundrare decays with small signal and huge background– significant signals: intervals to establish discrepancy with SM:significant signals: intervals to establish discrepancy with SM:– Insignificant signals: limits to exclude some new physics modelsInsignificant signals: limits to exclude some new physics models

An example: BAn example: Bss → → sensitivity sensitivity – BR(BBR(Bss → → ))SMSM = 3.4 x 10 = 3.4 x 10-9 -9 but enhanced in some new physics but enhanced in some new physics

modelsmodels– Exclusion limitExclusion limit: confidence limit for the signal decay branching : confidence limit for the signal decay branching

ratio when the generated ratio when the generated NN events in an Monte Carlo experiment events in an Monte Carlo experiment are all background, a measure of sensitivity are all background, a measure of sensitivity

Page 14: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

14

Exclusion limit for BR(BExclusion limit for BR(Bss → →

Step 1: construct geometrical, muon-id and invariant mass Step 1: construct geometrical, muon-id and invariant mass likelihood ratios betweenlikelihood ratios between signal and background hypotheses signal and background hypotheses for each eventfor each event– decorrelated likelihood methoddecorrelated likelihood method

Step 2: divide the 3D space into a number of bins and count Step 2: divide the 3D space into a number of bins and count events events ddii in each bin in each bin – no cut and N-countingno cut and N-counting

Step 3: estimate number of expected background events Step 3: estimate number of expected background events bbii and signal events and signal events ssii (for each assumed branching ratio) in (for each assumed branching ratio) in each bineach bin

Step 4: construct a total Step 4: construct a total likelihood ratiolikelihood ratio between the between the signal+background and background-only hypotheses for the signal+background and background-only hypotheses for the whole configurationwhole configuration

i iii

iiii

bddPoisson

bsddPoissonX

,

,

Page 15: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

15

Exclusion limit for BR(BExclusion limit for BR(Bss → →

Step 5: evaluate the Step 5: evaluate the p-value of each hypothesis of each hypothesis– Signal+background: Signal+background: probability(X<Xprobability(X<Xobsobs))

– Background-only: Background-only: probability(X>Xprobability(X>Xobsobs))

Step 6: compute Step 6: compute CLCLss (Thomas Junk, CERN-EP/99-041)(Thomas Junk, CERN-EP/99-041)

onlybackgroundofhypotheisofvalue-1

hypothesisbackgroundplussignalofvalue-

p

pCLs

Step 7: make statistical statement:Step 7: make statistical statement:

If If CLCLss(BR) < (BR) < , the assumed BR is excluded at 1-, the assumed BR is excluded at 1-

confidence level confidence level

Page 16: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

16

BR(BBR(Bss → → --) ) exclusion limit results

0.0 0.1 0.2 0.3 0.4 0.51

10

100

BR

(x

10 -9

) ex

clud

ed (

CLs

1

0 %

)

Integrated Luminosity (fb -1)

Counting N-counting

Page 17: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

17

Exclusion limit for BR(BExclusion limit for BR(Bss → →

The combination of various techniques is shown to The combination of various techniques is shown to work better than the cut-and-count method work better than the cut-and-count method

““Statistics Review” of PDG2006 claims the Statistics Review” of PDG2006 claims the CLCLss limit limit is conservative and the confidence level is under-is conservative and the confidence level is under-estimated estimated – The usual procedure requires the The usual procedure requires the p-value, not the , not the CLCLss, of a , of a

hypothesis to be smaller thanhypothesis to be smaller than to exclude it at to exclude it at 1-1- confidence levelconfidence level

What is the common understanding of how to set What is the common understanding of how to set better limits in circumstance of insignificant signal?better limits in circumstance of insignificant signal?

Page 18: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

18

Controlling systematicsControlling systematics

Systematics arise from incorrect modeling of Systematics arise from incorrect modeling of detector and/or background effectsdetector and/or background effects

Need delicate statistical methods to acquire these Need delicate statistical methods to acquire these effects from real data and model themeffects from real data and model them

Example: two methods to deal with efficiency as a Example: two methods to deal with efficiency as a function of decay time function of decay time (t)(t) in time-dependent in time-dependent analysisanalysis– Per-event Per-event ii(t): (t): not covered in this talk not covered in this talk

– Normalization trick: Normalization trick: Described by Stéphane T’Jampens https://oraweb.slac.stanford.edu/pls/slacquery/BABAR_DOCUMENTS.DetailedIndex?P_BP_ID=3629 (French thesis)

Page 19: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

19

Normalization trickNormalization trick

Factorized Signal PDFFactorized Signal PDF– AA: physical parameters: physical parameters

– tt: decay time: decay time

– position in phase space position in phase space

Likelihood Likelihood Maximization Maximization

dtdtgtfAh

tgtfAhAtp

iii

i

iii

i

,

,);,(

j

jjj

j AtplL ;,

ii

jjii

iii

jjiij

Ah

gtfAh

dA

d

dtdtgtfAh

tgtfAh

dA

d

dA

ldln

,

,ln

ln

– Integrated factor Integrated factor ii obtained using MC simulation obtained using MC simulation

– No need for specific shape of No need for specific shape of (t,(t,))

Page 20: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

20

Fitting with backgroundFitting with background

Unable to know the “ideal” Unable to know the “ideal” background distributions w/o background distributions w/o detector effect from physical law detector effect from physical law

Solution: use pseudo-log-Solution: use pseudo-log-likelihood to avoid need of likelihood to avoid need of background distributions background distributions

Phy.Rew. D71(2005) 032005Phy.Rew. D71(2005) 032005

bsig N

jjjsig

N

i b

sbiisig Atp

N

NAtpL

11

,,,,lnln

NNsigsig/N/Nbb: number of events in the signal/background region: number of events in the signal/background region

NNsbsb: number of expected background events in the signal region : number of expected background events in the signal region

Page 21: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

21

Fitting with background Fitting with background Errors return by Minuit are too optimistic Errors return by Minuit are too optimistic The true variance of a fit parameter is given byThe true variance of a fit parameter is given by

2

22 11)(

b

s

b

sbs N

NsVar

Three itemsThree items– One from signal events in signal regionOne from signal events in signal region– One from background events in signal regionOne from background events in signal region

– One from events in background region, vanishes as NOne from events in background region, vanishes as Nbb→∞→∞

No need for background PDFNo need for background PDF

(Private communication with(Private communication with Joe Boudreau)

Page 22: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

22

Estimating systematicsEstimating systematics

Every effort has been made to model the detector/background Every effort has been made to model the detector/background effect correctlyeffect correctly

However, have to use some assumptionsHowever, have to use some assumptions– E.g., background events in sideband and signal region have same E.g., background events in sideband and signal region have same

propertiesproperties Also require good agreement of MC data and real dataAlso require good agreement of MC data and real data

– Not everything can be obtained from real dataNot everything can be obtained from real data What is the proper procedure to estimate systematic errors in What is the proper procedure to estimate systematic errors in

the treatment of detector/background effects so that at least the treatment of detector/background effects so that at least different people will come to consistent estimates of the different people will come to consistent estimates of the systematic uncertainties if the same analysis method is used?systematic uncertainties if the same analysis method is used?– What is the rule to set e.g. “1-What is the rule to set e.g. “1- systematic error”? Vary what systematic error”? Vary what

quantities to obtain it? By how much? quantities to obtain it? By how much? – What is the statistical meaning of “1-What is the statistical meaning of “1- systematic error”? systematic error”?

Page 23: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

23

Analysis optimizationAnalysis optimization

What is the target function to optimize in What is the target function to optimize in order to obtain the best statistical precision order to obtain the best statistical precision for CP measurements?for CP measurements?– Signal events are not with equal weightsSignal events are not with equal weights– Events with smaller mistag probability and Events with smaller mistag probability and

better time resolution contribute more better time resolution contribute more

Page 24: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

24

SM test: CKM fitSM test: CKM fit Different measurements of Different measurements of

the sides and angles of the the sides and angles of the triangle should be consistent triangle should be consistent – UT Fit: Bayesian UT Fit: Bayesian – CKM Fitter: frequentistCKM Fitter: frequentist– Which one better serves Which one better serves this this

purposepurpose?? Questions Questions

– How to deal with theoretical How to deal with theoretical uncertainties? Do they have uncertainties? Do they have frequentist properties?frequentist properties?

– How do we know if an How do we know if an inconsistency is due to NP or inconsistency is due to NP or underestimated systematics underestimated systematics and theoretical uncertainties?and theoretical uncertainties?

Should give global Should give global 22 as measure as measure

of agreement with SMof agreement with SM

Page 25: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

25

SM test: rare decays? SM test: rare decays?

LHCb will measure many rare decaysLHCb will measure many rare decays Individually they are all good probes of NPIndividually they are all good probes of NP How to combine them to get best sensitivity?How to combine them to get best sensitivity? Not a easy jobNot a easy job

– The SM relations between these quantities are not explicitly givenThe SM relations between these quantities are not explicitly given– SM prediction for each of them is with big uncertaintySM prediction for each of them is with big uncertainty

Need a lot of work in physics Need a lot of work in physics – Understand the correlations between the SM predictions from Understand the correlations between the SM predictions from

physics and quantify the correlations using an error matrixphysics and quantify the correlations using an error matrix Also some thinking in statisticsAlso some thinking in statistics

– Construct a test statistic using the measurements and their SM Construct a test statistic using the measurements and their SM predictions, latter (or both) with correlated errorspredictions, latter (or both) with correlated errors

Page 26: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

26

Analysis tools Analysis tools

Tools we have usedTools we have used– Data storage and general processing / RootData storage and general processing / Root– Minimization / Minuit Minimization / Minuit – Data modeling and fitting / RooFitData modeling and fitting / RooFit– Multivariate analysis / TMVAMultivariate analysis / TMVA– Data unfolding / sPlot Data unfolding / sPlot – Neural networksNeural networks

Tools we may needTools we may need– Frequentist limit setting toolFrequentist limit setting tool– Bayesian analysis toolBayesian analysis tool– Reliable numerical multi-dimensional integrator Reliable numerical multi-dimensional integrator

Page 27: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

27

Wish-listWish-list A well supported tool for data modeling and fitting, which can A well supported tool for data modeling and fitting, which can

handle general multi-dimensional problems numerically handle general multi-dimensional problems numerically Better understanding of how to do multivariate analysisBetter understanding of how to do multivariate analysis Better understanding of how to treat systematics and Better understanding of how to treat systematics and

theoretical uncertainties in SM test theoretical uncertainties in SM test New statistical methods to control systematics using real data New statistical methods to control systematics using real data New statistical methods to improve flavour tagging New statistical methods to improve flavour tagging Better understanding of how to set confidence limits in case of Better understanding of how to set confidence limits in case of

insignificant signalinsignificant signal Statisticians’ recommendation on statistical procedures in data Statisticians’ recommendation on statistical procedures in data

analysisanalysis

And most importantly a successful LHC(b)!And most importantly a successful LHC(b)!

Page 28: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

28

spare slidesspare slides

Page 29: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

29

Page 30: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

30

Page 31: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

31

Decorrelated likelihood method used for Bs →

•Decorrelate the input variables for signal and background separately

•A very similar method is described by Dean Karlen,Computers in Physics Vol 12, N.4, Jul/Aug 1998

s1s2s3 .sn

b1b2b3 .bn

x1x2x3 .xn

n input variables(IP, DOCA…)

n variables for signal independent and Gaussian distributed

χ2S = Σ si

2

same for background

χ2B = Σ bi

2

-2ln(Lsig/Lbg) = χ2S - χ2

B

Discriminating variable:Discriminating variable:

Page 32: Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007

32

Lessons learned with TMVA Overtraining is a general problem in this kind of Overtraining is a general problem in this kind of

analysis when the training sample has low statisticsanalysis when the training sample has low statistics TMVA splits the sample into twoTMVA splits the sample into two

– One for training and one for test and evaluationOne for training and one for test and evaluation Would it make more sense to use three independent Would it make more sense to use three independent

samples?samples?– Sample A for trainingSample A for training– Sample B for deciding when to stop training by looking Sample B for deciding when to stop training by looking

at the performance difference between A and B at the performance difference between A and B – Sample C for evaluation of performanceSample C for evaluation of performance– This is because correlation may have been introduced This is because correlation may have been introduced

between A and B when B is used to decide when to stop between A and B when B is used to decide when to stop trainingtraining