Application of statistical methods for the comparison of data distributions

Barbara MascialinoIEEE-NSS October 21th, 2004

Application of statistical methods for the comparison

of data distributions

Susanna Guatelli, Barbara Mascialino, Andreas Pfeiffer, Maria Grazia Pia, Alberto

Ribon, Paolo Viarengo


• The comparison of two data distribution is fundamental in experimental practice

• Many algorithms are available for the comparison of two data distributions (the two-sample problem)

Aim of this study:Aim of this study: compare the algorithms available in statistics literature to select the most appropriate one in every specific case

Outline

Detector monitoringDetector monitoring (current versus reference data)Simulation validation (experiment versus simulation)Simulation validation (experiment versus simulation)

Reconstruction versus expectationReconstruction versus expectationRegression testing (two versions of the same software)Regression testing (two versions of the same software)

Physics analysisPhysics analysis (measurement versus theory, experiment A versus experiment B)

Parametric statistics Non-parametric statistics(Goodness-of-Fit testing)


The two-sample problem

EXAMPLE 1EXAMPLE 1: binned data

Which is the most suitable goodness-of-fit test?

EXAMPLE 2EXAMPLE 2: unbinned dataX-ray fluorescence spectrum Dosimetric distribution from a medical

LINAC


• Applies to binnedbinned distributions

• It can be useful also in case of unbinned distributions, but the data must be grouped into classes

• Cannot be applied if the counting of the theoretical frequencies in each class is < 5

– When this is not the case, one could try to unify contiguous classes until the minimum theoretical frequency is reached

– Otherwise one could use Yates’ formula

Chi-squared testChi-squared test


0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

EMPIRICAL DISTRIBUTION FUNCTIONORIGINAL DISTRIBUTIONS

• Kolmogorov-Smirnov test

• Goodman approximation of KS test

• Kuiper test

)(

4 22

nmnmDmn

)()( xGxFSupD mnmn

)()()()(* xFxGMaxxGxFMaxD nmmn

Dmn

Tests based on the supremum statisticsTests based on the supremum statisticsunbinned distributionsunbinned distributions

SUPREMUMSUPREMUMSTATISTICSSTATISTICS


• Fisz-Cramer-von Mises test

• k-sample Anderson-Darling test

i

ii xFxFnnnnt 2

21221

21 )]()([)(

i k kkk

kiikk

iK nhHnH

HnnFhnkn

nA

4)(

)(1)1(

)1( 2

22

Tests containing a weighting functionTests containing a weighting functionbinned/unbinned distributionsbinned/unbinned distributions

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

EMPIRICAL DISTRIBUTION FUNCTIONORIGINAL DISTRIBUTIONS

QUADRATICQUADRATICSTATISTICSSTATISTICS

+ + WEIGHTING WEIGHTING FUNCTIONFUNCTION

Sum/integral of all the distances


G.A.P Cirrone, S. Donadio, S. Guatelli, A. Mantero, B. Mascialino, S. Parlati, M.G. Pia, A. Pfeiffer, A. Ribon, P. Viarengo

“A Goodness-of-Fit Statistical Toolkit”IEEE- Transactions on Nuclear Science (2004), 51 (5): October issue.

http://www.ge.infn.it/geant4/analysis/HEPstatistics/


Power evaluation

N=1000Monte Carlo replications

Confidence Level = 0.05

Pseudoexperiment: a random drawing

of two samples from two parent distributions

For each test, the p-value computed by the GoF Toolkit derives from analytical calculation of the asymptotic distribution, often depending on the samples sizes.

The power of a test is the probability of rejecting the null

hypothesis correctly

Parent distribution 1

Sample 1n

Sample 2m

GoFtest

Parent distribution 2

PowerPower = # pseudoexperiments with p-value < (1-CL)

# pseudoexperiments


Parent distributions

1)(1 xfUniform

)2

(2

2

21)(

x

exf

Gaussian

||3

21)( xexf

Double exponential

241

11)(x

xf

Cauchy

xexf )(5

Exponential

Contaminated Normal Distribution 2

)1,1(5.0)4,1(5.0)(7 xf

)9,0(1.0)1,0(9.0)(6 xfContaminated Normal Distribution 1


Skewness and tailweight

025.05.0

5.0975.0

xxxxS

125.0875.0

025.0975.0

xxxxT

ParentParent SS TTf1(x) Uniform 1 1.267

f2(x) Gaussian 1 1.704

f3(x) Double exponential 1 2.161

f4(x) Cauchy 1 5.263

f5(x) Exponential 4.486 1.883

f6(x) Contamined normal 1

1 1.991

f7(x) Contamined normal 2

1.769 1.693

SkewnessSkewness TailweightTailweight


Power increases as a function of the sample size (analytical calculation of the

asymptotic distribution)

N sample

Pow

er

Kolmogorov-Smirnov testCL = 0.05

The “location-scale problem”Case Parent1 = Parent 2

UniformNormalExponentialDouble ExponentialContaminated Normal 1

Contaminated Normal 2Cauchy

small sized samples

moderate sized samples


The “general shape problem”

Distribution1 – Distribution 2 KSKS CVMCVM ADADCN2-Normal 55.6±1.8 15.2±1.1 86.1±1.1

CN2-CN1 24.9±1.4 25.2±1.1 44.8±1.6

CN2-Double Exponential 37.6±1.5 40.2±1.6 51.6±1.6

T2

Case Parent1 ≠ Parent 2

Pow

erTailweight Distribution 2

CL = 0.05

Kolmogorov-Smirnov

Cramér-von Mises

Anderson-Darling

(S1 = S2 = 1)Distribution 1

Double exponential (T1 = 2.161)

A) Symmetric versus symmetric

B) Skewed versus symmetric

KSKS CVCVMM

ADAD~ <For very long tailed distributions:

KSKS CVCVMM

ADAD~ ~For short-medium tailed distributions:


Comparative evaluation of testsComparative evaluation of tests

ShortShort(T(T<1.5)<1.5)

MediumMedium(1.5 < T < 2)(1.5 < T < 2)

LongLong(T>2)(T>2)

SS~~11 KSKS KS – CVMKS – CVM CVM - ADCVM - ADSS>1.5>1.5 KS - ADKS - AD ADAD CVM - ADCVM - ADSk

ewne

ssSk

ewne

ss

TailweightTailweight

22 Supremum Supremum statistics statistics

teststests

Tests Tests containing a containing a

weight functionweight function< <


Results for the data examples

Extremely skewed – medium tail

ANDERSON-DARLING TESTA2=0.085 – p>0.05

Moderate skewed – medium tail

KOLMOGOROV-SMIRNOV TESTD=0.27 – p>0.05

X-variable: Ŝ=4 T=1.43

Y-variable: Ŝ=4 T=1.50

X-variable: Ŝ=1.53 T=1.36

Y-variable: Ŝ=1.27 T=1.34

^

^

^

^

EXAMPLE 1EXAMPLE 1: binned data EXAMPLE 2EXAMPLE 2: unbinned data


• Studied several goodness-of-fit tests for location-scale alternatives and general alternatives

• There is nono clear winner for all the considered distributions in general

• To select one test in practice:1.1. first classifyclassify the type of the distributions in terms of skewness SS and tailweight TT2.2. choose the mostmost appropriate test for the classified type of distribution

Conclusions

Topic still subject to research activity in the domain of statisticsTopic still subject to research activity in the domain of statistics

Documents

Application of statistical methods for the comparison of data distributions