3
W.R. Wilcox  , Clarkson Universi ty  Last revis ed September 17, 2012 Definitions of descriptive statistics of a single variable generated by the Descriptive Statistics tool in Excel’s Data Analysis Background Imagine that we want to know the distance from the f ront wall of this room to i ts back wall. We measu re it. We meas ure it again and obtain a slight ly diff erent resu lt. We migh t guess that the average of these two measurements would be closer to the true (unknown) value, and that the more measurements we make the closer the average will be to the true value. In princi ple, the number of possible measurements is unlimited. We might also meas ure the diameter of pisto ns being produced in an automotiv e plant. Each of these will be some what differ ent, reflecti ng not only error s in our method of measu ring but also real variations in the actual diameter. Again, in principle, there is no limit to the number of pistons that could be produced and measured. In both our eamples, we define the !population" as the number of measurements that could be made and !sample s" as the actual measurements made. #he challe nge of statist ics is to use the samp les to estim ate character isti cs of the populati on. $ften , we use diffe rent symbol s for these characteristics, depending on whether they are for the population or for the samples. %or eample, the population mean (average) is generally given the &reek letter mu, ', and the sample mean is written x . #he sua re root of the aver age suar e of the devi ation of i ndivid ual value s of the  population from ' is the population standard deviation, and is given the &reek letter sigma, . #he samp le standard devi ation , s, is defined below and is an estimat e of . As the samp le si*e n is increased, x  becomes closer to ' and s closer to . In the following, we denote the individual value of the sample or measurement as i, where i goes fro m + to n. #he terms below appea r in the orde r they are pro duce d by Ecel s -escr ipt ive tatistics. Each term is followed in capit al letters by the Ecel functio n that produces the same value, a definition or eplanation of the statistic, and then the relevant euation.  /ote that the mean, standard error, median, mode, standard deviation, range, minimum, maimum, sum and confidence level all have the same units as the sample values i. Mean (A 0E1A&E)2 #he sum of all samples divided by the number of values2 n n + i = Standard Error2 #he population standard deviation of many measurements of a mean of n samples. It is estimated by the standard deviation of one measurement of the mean divided by the suare root of n2 ( ) ( ) + n n n s n + 3 i = Median (4E-IA/)2 If n i s odd, t he val ue of i for which half of the remaining values are larger and half are smaller . If n is even, the average of the two values in the mi ddle. Mode (4$-E)2 #he most freuently occurring value, if any. +

descdefnstat

Embed Size (px)

Citation preview

Page 1: descdefnstat

8/9/2019 descdefnstat

http://slidepdf.com/reader/full/descdefnstat 1/3

W.R. Wilcox , Clarkson University Last revised September 17, 2012

Definitions of descriptive statistics of a single variablegenerated by the Descriptive Statistics tool in Excel’s Data Analysis

Background

Imagine that we want to know the distance from the front wall of this room to its back wall. Wemeasure it. We measure it again and obtain a slightly different result. We might guess that theaverage of these two measurements would be closer to the true (unknown) value, and that themore measurements we make the closer the average will be to the true value. In principle, thenumber of possible measurements is unlimited.

We might also measure the diameter of pistons being produced in an automotive plant. Each of these will be somewhat different, reflecting not only errors in our method of measuring but alsoreal variations in the actual diameter. Again, in principle, there is no limit to the number of pistonsthat could be produced and measured.

In both our e amples, we define the !population" as the number of measurements that could bemade and !samples" as the actual measurements made. #he challenge of statistics is to use thesamples to estimate characteristics of the population. $ften, we use different symbols for thesecharacteristics, depending on whether they are for the population or for the samples. %or e ample,the population mean (average) is generally given the &reek letter mu, ', and the sample mean iswritten x . #he s uare root of the average s uare of the deviation of individual values of the

population from ' is the population standard deviation, and is given the &reek letter sigma, . #hesample standard deviation, s, is defined below and is an estimate of . As the sample si*e n isincreased, x becomes closer to ' and s closer to .

In the following, we denote the individual value of the sample or measurement as i, where i goesfrom + to n. #he terms below appear in the order they are produced by E cel s -escriptive

tatistics. Each term is followed in capital letters by the E cel function that produces the samevalue, a definition or e planation of the statistic, and then the relevant e uation.

/ote that the mean, standard error, median, mode, standard deviation, range, minimum, ma imum,sum and confidence level all have the same units as the sample values i.

Mean (A0E1A&E)2 #he sum of all samples divided by the number of values2n

n

+

i∑=

Standard Error 2 #he population standard deviation of many measurements of a mean of n samples. It

is estimated by the standard deviation of one measurement of the mean divided by the s uare root of n2( )

( )+nnn

s

n

+

3

i

−=

Median (4E-IA/)2 If n is odd, the value of i for which half of the remaining values are larger and halfare smaller. If n is even, the average of the two values in the middle.

Mode (4$-E)2 #he most fre uently occurring value, if any.

+

Page 2: descdefnstat

8/9/2019 descdefnstat

http://slidepdf.com/reader/full/descdefnstat 2/3

Standard Deviation ( #-E0)2 %rom E cel s 5elp on this function, !#he standard deviation is a

measure of how widely values are dispersed from the average value (the mean)." ( )

+ns

3

i

=∑

Sample variance (0A1)2 uare of the standard deviation2( )

+ns

n

+

3

i

3

−= ∑

Kurtosis (671#)2 %rom E cel s 5elp on this function,!6urtosis characteri*es the relative peakedness or flatness of a distribution compared with the normal distribution. 8ositivekurtosis indicates a relatively peaked distribution. /egativekurtosis indicates a relatively flat distribution." #he kurtosisof a sample is consistent with a normal distribution for a

population if it is small, e.g. less than 9.:.

Ske ness ( 6EW)2 ! kewness characteri*es the degree of asymmetry of a distribution around its mean. 8ositiveskewness indicates a distribution with an asymmetric taile tending toward more positive values. /egative skewnessindicates a distribution with an asymmetric tail e tendingtoward more negative values." #he skewness of a sample isconsistent with a normal distribution for a population if it sabsolute value is small, e.g. less than 9.:.

!ange 2 4a imum value minus minimum value. (7sually increases as n increases, making it a poormeasure of the dispersion or spread of the population values.)

Mimimum (4I/)2 4inimum value.

Maximum (4A;)2 4a imum value.

Sum ( 74)2 um of all values, ∑n

+

i

"ount (<$7/#)2 /umber of values, n

"onfidence #evel $chosen %& 2If the population is normally distributed and you choose the default of =>? (@ 9.9>), then the

probability is =>? that Bevel<onfidence ±=µ . #he <onfidence Bevel n

ts, where t is tudent s t

(or, often, Cust t). #hus the probability is + D @ thatn

ts ±=µ , or @ that the true value of ' lies outside

these confidence limits. #he value of t can be calculated by E cel s #I/0 function, in which nF+ isthe degrees of freedom and @ is the probability (chance that the confidence limits do not include the true'). #here are several important things to note2

3

Page 3: descdefnstat

8/9/2019 descdefnstat

http://slidepdf.com/reader/full/descdefnstat 3/3

• #he E cel function <$/%I-E/<E does not give the same results unless n is greater than about+99. #he reason is that the -escriptive tatistics tool correctly uses the tudent s t distribution for afinite si*ed sample, while <$/%I-E/<E uses the normal distribution, which is for an infinite

population. ee normally distributed for a more detailed e planation and for 4A#BAG programs tocalculate tudent s t and descriptive statistics.

• #he more the absolute values of skewness or kurtosis e ceed +, the greater is the probability that the population is not normally distributed, and the less chance that the confidence level calculated byE cel is correct.

• E ercise Ha shows how E cel can provide a graphical test of normalcy.• #he probability @ that a >−µ can be found using E cel as follows. <alculate

s

nat = . #hen

@ #-I #(t,n,3). #his is called a twoFtailed test.

#he probability that a >−µ is of #-I #(t,n,3), or #-I #(t,n,+). #his is called a oneFtailed test.

'utliers

$utliers are values i which differ significantly from the mean . #he most modern criterion seems to be &rubbs #est (the t discussed on that page is tudent s t). If an outlier is so identified, you shouldlook at the source of the data to see if there is any reason why this value might be invalid. If so, it is

permissible to throw it out and recalculate all of the statistics. Gut it should not be thrown out simply because it is an outlier.

1eturn to the E cel tutorial home .

Comments and s !!estions al"ays "elcome. #mail to "ilcox$clarkson.ed .

: