26
7/23/2019 Distribution Fitting Fix http://slidepdf.com/reader/full/distribution-fitting-fix 1/26 FINDING A PROPER STATISTICAL DISTRIBUTION FOR DATASET  from http://www.mathwave.com/artic les/distribution_tting_faq. html 1

Distribution Fitting Fix

Embed Size (px)

Citation preview

Page 2: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 2/26

What is DistributionFitting?

• Distribution tting is the procedure ofselecting a statistical distribution thatbest ts to a data set generated by

some random process. In other words,if you have some random dataavailable, and would lie to now what

particular distribution can be used todescribe your data, then distributiontting is what you are looing for.

2

Page 3: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 3/26

Who and Wh Shou!d Us"Distributions?

• !andom factors a"ect all areas of our life, andbusinesses motivated to succeed in today#s highlycompetitive environment need a tool to deal withris and uncertainty involved. $sing probabilitydistributions is a scientic way of dealing withuncertainty and maing informed businessdecisions.

• In practice, probability distributions are applied insuch diverse elds as actuarial science andinsurance, ris analysis, investment, maret

research, business and economic research,customer support, mining, reliability engineering,chemical engineering, hydrology, image processing,physics, medicine, sociology, demography etc.

3

Page 4: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 4/26

Wh is it I#$ortant toS"!"%t Th" B"st Fitting

Distribution?%robability distributions can be viewed as a tool for dealingwith uncertainty: you use distributions to perform speciccalculations, and apply the results to mae well&groundedbusiness decisions. 'owever, if you use a wrong tool, you willget wrong results. If you select and apply an inappropriatedistribution (the one that doesn#t t to your data well), your

subsequent calculations will be incorrect, and that willcertainly result in wrong decisions.

In many industries, the use of incorrect models can haveserious consequences such as inability to complete tass orpro*ects in time leading to substantial time and money loss,wrong engineering design resulting in damage of e+pensive

equipment etc. In some specic areas such as hydrology,using appropriate distributions can be even more critical.Distribution tting allows you to develop valid models of

random processes you deal with, protecting you from potentialtime and money loss which can arise due to invalid modelselection, and enabling you to mae better business decisions.

4

Page 5: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 5/26

Can&t I 'ust Assu#" Th"Nor#a! Distribution?

•  he -ormal distribution has been developed more than 0 years ago, and isprobably one of the oldest and frequently used distributions out there. 1o whynot *ust use it2

It Is S##"tri%

•  he probability density function of the -ormal distribution is symmetric about itsmean value, and this distribution cannot be used to model right&sewed or left&sewed data:

It Is Unbound"d

•  he -ormal distribution is dened on the entire real a+is (&Innity, 3Innity), andif the nature of your data is such that it is bounded or non&negative (can onlytae on positive values), then this distribution is almost certainly not a good t:

Its Sha$" Is Constant

•  he shape of the -ormal distribution does not depend on the distributionparameters. 4ven if your data is symmetric by nature, it is possible that it is bestdescribed by one of the heavy&tailed models such as the 5auchy distribution:

5

Page 6: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 6/26

Whi%h Distribution Shou!d IChoos"?

 6ou cannot 7*ust guess7 and use any other particular distributionwithout testing several alternative models as this can result inanalysis errors.

In most cases, you need to t two or more distributions, comparethe results, and select the most valid model. he 7candidate7

distributions you t should be chosen depending on the nature ofyour probability data. 8or e+ample, if you need to analy9e thetime between failures of technical devices, you should t non-negative distributions such as 4+ponential or eibull, since thefailure time cannot be negative.

 6ou can also apply some other identication methods based onproperties of your data. 8or e+ample, you can build a histogramand determine whether the data are symmetric, left&sewed, orright&sewed, and use the distributions which have the sameshape.

6

Page 7: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 7/26

Whi%h Distribution Shou!d IChoos"?

•  o actually t the 7candidate7 distributions youselected, you need to employ statistical methodsallowing to estimate distribution parametersbased on your sample data. he solution of thisproblem involves the use of certain algorithmsimplemented in speciali9ed software.

• ;fter the distributions are tted, it is necessary todetermine how well the distributions youselected t to your data. his can be done usingthe specic goodness of t tests or visually  bycomparing the empirical (based on sample data)and theoretical (tted) distribution graphs. ;s aresult, you will select the most valid modeldescribing your data.

7

Page 8: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 8/26

4+planatory Data ;nalysis(4D;)

4D; includes:D"s%ri$ti(" statisti%s (numerical summaries): mean, median,

range, variance, standard deviation, etc. In 1%11 choose Ana!)"*D"s%ri$ti(" Statisti%s* D"s%ri$ti("s.

+o!#ogoro(,S#irno( - Sha$iro,Wi!. t"sts: hese methods testwhether one distribution (e.g. your dataset) is signicantly di"erent

from another (e.g. a normal distribution) and produce a numericalanswer, yes or no. $se the 1hapiro&il test if the sample si9e isbetween < and 000 and the =olmogorov&1mirnov test if the samplesi9e is greater than 000. $nfortunately, in some circumstances,both of these tests can produce misleading results, so 7real7statisticians prefer graphical plots to tests such as these.

Gra$hi%a! #"thods: frequency distribution histograms stem > leaf plots scatter plots bo+ > whiser plots -ormal probability plots: %% and ?? plots @raphs with error bars (@raphs: 4rror Aar)

8

Page 9: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 9/26

%robability %lots•  he assumption of a normal model for a

population of responses will be r"/uir"d in

ord"r to $"r0or# %"rtain in0"r"n%"procedures. 1istogra# %an b" us"d to g"t anid"a o0 th" sha$" of a distribution. 'owever,there are more sensitive tools for checing if theshape is close to a normal model B a 2,2 P!ot3

• 2,2 P!ot is a $!ot o0 th" $"r%"nti!"s 4or/uinti!"s5 o0 a standard normal distribution (orany other specic distribution) against thecorresponding percentiles of the observed data.

If the observations follow appro+imately a normaldistribution, the resulting plot should be roughlya straight line with a positive slope.

9

Page 10: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 10/26

10

%robability plots

• Q-Q plot: Quantile-quantile plot

• Graph of the qi-quantile of a fitted (model) distribution versusthe qi-quantile of the sample distribution.

• If F ̂ (x) is the correct distribution that is fitted, for a largesample size, then F ̂ (x) and F n(x) ill be close together and the

Q-Q plot ill be appro!imatel" linear ith intercept # andslope $.

• %or small sample, even if F ̂ (x) is the correct distribution, thereill some departure from the straight line.

.,...&,$,)('

)(

)(

$

$

ni X q F  x

q F  x

iin

q

i

 M 

q

i

i

===

=

Page 11: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 11/26

11

%robability plots

• - plot: robabilit"-robabilit" plot.

• It is valid for both continuous as ell as discrete data sets.

• If F ̂ (x) is the correct distribution that is fitted, for a large

sample size, then F ̂ (x) and F n(x) ill be close together and

the - plot ill be appro!imatel" linear ith intercept #and slope $.

( )

( ) .,...&,$,'

" probabilitsample

 eagainst th " probabilitmodeltheof graph*

)(

)(

niq X  F 

 X  F 

iin

i

==

Page 12: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 12/26

12

%robability plots

• +he Q-Q plot ill amplif" the differences beteen the tails of

the model distribution and the sample distribution.

• hereas, the  P-P  plot ill amplif" the differences at the

middle portion of the model and sample distribution.

Page 13: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 13/26

13

%robability plots: Dataset

Page 14: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 14/26

14

%robability plots: Dataset

Page 15: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 15/26

15

%robability plots: Dataset

Page 16: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 16/26

16

%robability plots: Dataset

The Detrended Normal QQ

plot shows the differenes

!etween the o!ser"ed and

e#peted "al$es of a normaldistri!$tion% &f the

distri!$tion is normal' the

points sho$ld l$ster in a

hori(ontal !and aro$nd (ero

with no pattern%

Page 17: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 17/26

17

%robability plots: Dataset

Page 18: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 18/26

18

%robability plots: Dataset

Page 19: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 19/26

?? %lot he graphs below are e+amples for which a normal model for the

response is not reasonable.

C. he ?&? plot above left indicates the e+istence of two clusters of observations.

. he ?&? plot above right shows an e+ample where the shape of distributionappears to be sewed right.

<. he ?&? plot below left shows evidence of an underlying distribution that has

heavier tails compared to those of a normal distribution. 19

Page 20: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 20/26

?? %lot•

 he ?&? plot below right shows evidence of an underlyingdistribution which is appro+imately normal e+cept for onelarge outlier that should be further investigated.

20

Page 21: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 21/26

?? %lot

• It is most important that you can seethe departures in the above graphsand not as important to now if the

departure implies sewed left versussewed right and so on. ; histogramwould allow you to see the shape and

type of departure from normality.

21

Page 22: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 22/26

Data ransformations and heir$ses

Data Trans0or#ation Can Corr"%t For

og ransformation (log(E)) %ositive 1ew, $nequal Fariances

1quare !oot ransformation(sqrt(E))

%ositive 1ew, $nequal Fariances

!eciprocal ransformation (C/E) %ositive 1ew, $nequal Fariances!everse 1core ransformation Ball of the above can correct fornegative sew, but you must rstreverse the scores. Gust subtracteach score from the highest score

in the data set 3 C.

-egative 1ew

Page 23: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 23/26

Hutliers and 4+treme 1cores 

23

Page 24: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 24/26

)*)) + ,#plore -o#*lot

• The top of the !o# is the upperfourth or  75th perentile%

• The !ottom of the !o# is the lowerfourth or  25th perentile%

• 50 . of the sores fall within the!o# or interquartile range%

• The hori(ontal line is the median%• The ends of the whis/ers

represent the larest and smallest"al$es that are not o$tliers%

•  n outlier ' O' is defined as a "al$ethat is smaller or larer than 1%5!o#lenths%

•  n extreme "al$e' E ' isdefined as a "al$e that is smalleror larer than 3 !o#lenths%

• Normall distri!$ted sorestpiall ha"e whis/ers that area!o$t the same lenth and the!o# is tpiall smaller than the

whis/ers%

Page 25: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 25/26

5hoosing a 1core to Dene Hutliers

)ore . !o"e . !o"e

3%0 0%0013 0%00263%1 0%0010 0%0020

3%2 0%0007 0%0014

3%3 0%0005 0%0010

3%4 0%0003 0%0006

Page 26: Distribution Fitting Fix

7/23/2019 Distribution Fitting Fix

http://slidepdf.com/reader/full/distribution-fitting-fix 26/26

Decisions for 4+tremes and Hutliers

C. 5hec your data to verify all numbers are entered

correctly.. Ferify your devices (data testing machines) are

woring within manufacturer specications.<. $se -on&parametric statistics, they donJt require

a normal distribution.

K. Develop a criteria to use to label outliers andremove them from the data set. 6ou must reportthese in your methods section.

C. If you remove outliers consider including a statisticalanalysis of the results with and without the outlier(s). Inother words, report both, see 1tevens (CLL0) Detecting

outliers.. Do a log transformation.

C. If you data have negative numbers you must shift thenumbers to the positive scale (eg. add 0 to each).

. ry a natural log transformation rst in 1%11 use -().<. ry a log base C0 transformation, in 1%11 use @C0().