descdefnstat

8/9/2019 descdefnstat

http://slidepdf.com/reader/full/descdefnstat 1/3

W.R. Wilcox , Clarkson University Last revised September 17, 2012

Definitions of descriptive statistics of a single variablegenerated by the Descriptive Statistics tool in Excel’s Data Analysis

Background

Imagine that we want to know the distance from the front wall of this room to its back wall. Wemeasure it. We measure it again and obtain a slightly different result. We might guess that theaverage of these two measurements would be closer to the true (unknown) value, and that themore measurements we make the closer the average will be to the true value. In principle, thenumber of possible measurements is unlimited.

We might also measure the diameter of pistons being produced in an automotive plant. Each of these will be somewhat different, reflecting not only errors in our method of measuring but alsoreal variations in the actual diameter. Again, in principle, there is no limit to the number of pistonsthat could be produced and measured.

In both our e amples, we define the !population" as the number of measurements that could bemade and !samples" as the actual measurements made. #he challenge of statistics is to use thesamples to estimate characteristics of the population. $ften, we use different symbols for thesecharacteristics, depending on whether they are for the population or for the samples. %or e ample,the population mean (average) is generally given the &reek letter mu, ', and the sample mean iswritten x . #he s uare root of the average s uare of the deviation of individual values of the

population from ' is the population standard deviation, and is given the &reek letter sigma, . #hesample standard deviation, s, is defined below and is an estimate of . As the sample si*e n isincreased, x becomes closer to ' and s closer to .

In the following, we denote the individual value of the sample or measurement as i, where i goesfrom + to n. #he terms below appear in the order they are produced by E cel s -escriptive

tatistics. Each term is followed in capital letters by the E cel function that produces the samevalue, a definition or e planation of the statistic, and then the relevant e uation.

/ote that the mean, standard error, median, mode, standard deviation, range, minimum, ma imum,sum and confidence level all have the same units as the sample values i.

Mean (A0E1A&E)2 #he sum of all samples divided by the number of values2n

n

+

i∑=

Standard Error 2 #he population standard deviation of many measurements of a mean of n samples. It

is estimated by the standard deviation of one measurement of the mean divided by the s uare root of n2( )

( )+nnn

s

n

+

3

i

−

−=

∑

Median (4E-IA/)2 If n is odd, the value of i for which half of the remaining values are larger and halfare smaller. If n is even, the average of the two values in the middle.

Mode (4$-E)2 #he most fre uently occurring value, if any.

+

http://people.clarkson.edu/~wilcox/

http://people.clarkson.edu/~wilcox/



Standard Deviation ( #-E0)2 %rom E cel s 5elp on this function, !#he standard deviation is a

measure of how widely values are dispersed from the average value (the mean)." ( )

+ns

3

i

−

−

=∑

Sample variance (0A1)2 uare of the standard deviation2( )

+ns

n

+

3

i

3

−

−= ∑

Kurtosis (671#)2 %rom E cel s 5elp on this function,!6urtosis characteri*es the relative peakedness or flatness of a distribution compared with the normal distribution. 8ositivekurtosis indicates a relatively peaked distribution. /egativekurtosis indicates a relatively flat distribution." #he kurtosisof a sample is consistent with a normal distribution for a

population if it is small, e.g. less than 9.:.

Ske ness ( 6EW)2 ! kewness characteri*es the degree of asymmetry of a distribution around its mean. 8ositiveskewness indicates a distribution with an asymmetric taile tending toward more positive values. /egative skewnessindicates a distribution with an asymmetric tail e tendingtoward more negative values." #he skewness of a sample isconsistent with a normal distribution for a population if it sabsolute value is small, e.g. less than 9.:.

!ange 2 4a imum value minus minimum value. (7sually increases as n increases, making it a poormeasure of the dispersion or spread of the population values.)

Mimimum (4I/)2 4inimum value.

Maximum (4A;)2 4a imum value.

Sum ( 74)2 um of all values, ∑n

+

i

"ount (<$7/#)2 /umber of values, n

"onfidence #evel $chosen %& 2If the population is normally distributed and you choose the default of =>? (@ 9.9>), then the

probability is =>? that Bevel<onfidence ±=µ . #he <onfidence Bevel n

ts, where t is tudent s t

(or, often, Cust t). #hus the probability is + D @ thatn

ts ±=µ , or @ that the true value of ' lies outside

these confidence limits. #he value of t can be calculated by E cel s #I/0 function, in which nF+ isthe degrees of freedom and @ is the probability (chance that the confidence limits do not include the true'). #here are several important things to note2

3

http://people.clarkson.edu/~wilcox/ES100/datadesc.doc




• #he E cel function <$/%I-E/<E does not give the same results unless n is greater than about+99. #he reason is that the -escriptive tatistics tool correctly uses the tudent s t distribution for afinite si*ed sample, while <$/%I-E/<E uses the normal distribution, which is for an infinite

population. ee normally distributed for a more detailed e planation and for 4A#BAG programs tocalculate tudent s t and descriptive statistics.

• #he more the absolute values of skewness or kurtosis e ceed +, the greater is the probability that the population is not normally distributed, and the less chance that the confidence level calculated byE cel is correct.

• E ercise Ha shows how E cel can provide a graphical test of normalcy.• #he probability @ that a >−µ can be found using E cel as follows. <alculate

s

nat = . #hen

@ #-I #(t,n,3). #his is called a twoFtailed test.

#he probability that a >−µ is of #-I #(t,n,3), or #-I #(t,n,+). #his is called a oneFtailed test.

'utliers

$utliers are values i which differ significantly from the mean . #he most modern criterion seems to be &rubbs #est (the t discussed on that page is tudent s t). If an outlier is so identified, you shouldlook at the source of the data to see if there is any reason why this value might be invalid. If so, it is

permissible to throw it out and recalculate all of the statistics. Gut it should not be thrown out simply because it is an outlier.

1eturn to the E cel tutorial home .

Comments and s !!estions al"ays "elcome. #mail to "ilcox$clarkson.ed .

:



http://people.clarkson.edu/~wilcox/ES100/nrmprbpt.xls

http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm


http://people.clarkson.edu/~wilcox/extut.htm


mailto:[email protected]


http://people.clarkson.edu/~wilcox/ES100/nrmprbpt.xls



mailto:[email protected]

Documents

descdefnstat