© 2007 Thomson South-Western. All Rights Reserved Slide 1slrace/aa591/Chapter 3_LectureNotes.pdf · Whenever a data set has extreme values, the median ... Slide 8 Median 12 22 24

1Slide© 2007 Thomson South-Western. All Rights Reserved


Chapter 3Descriptive Statistics: Numerical Measures

Measures of Location

Measures of Variability

Measures of Distribution Shape, Relative Location, and Detecting Outliers

Measures of Association Between Two Variables

Weighted Mean


Measures of Location

If the measures are computedfor data from a sample,

they are called sample statistics.

If the measures are computedfor data from a population,

they are called population parameters.

A sample statistic is referred toas the point estimator of the

corresponding population parameter.

Mean

Median

Mode

Percentiles

Quartiles


Mean

The mean of a data set is the average of all the data values.

The sample mean is the point estimator of the population mean m.

x


Sample Mean x

Number ofobservationsin the sample

Sum of the valuesof the n observations

ix

xn


Population Mean m

Number ofobservations inthe population

Sum of the valuesof the N observations

ix

Nm


Median

Whenever a data set has extreme values, the medianis the preferred measure of central location.

A few extremely large incomes or property valuescan inflate the mean.

The median is the measure of location most oftenreported for annual income and property value data.

The median of a data set is the value in the middlewhen the data items are arranged in ascending order.


Median

12 22 26 27 2724 28

For an odd number of observations:

in ascending order

26 28 27 22 24 27 12 7 observations

the median is the middle value.

Median = 26


28

Median

For an even number of observations:

in ascending order

27 8 observations

the median is the average of the middle two values.

Median = (26 + 27)/2 = 26.5

3012 22 26 27 2724

26 28 27 22 24 30 12


Mean VS Median

The mean IS affected by outliers (extreme observations)

The median IS NOT affected by outliers


Mode

The mode of a data set is the value that occurs withgreatest frequency.

The greatest frequency can occur at two or moredifferent values.

If the data have exactly two modes, the data arebimodal.

If the data have more than two modes, the data aremultimodal.


Percentiles

A percentile provides information about how thedata are spread over the interval from the smallestvalue to the largest value.

Admission test scores for colleges and universitiesare frequently reported in terms of percentiles.


The pth percentile of a data set is a value such that at least p percent of the items take on this value or less and at least (100 - p) percent of the items take on this value or more.

Percentiles


Percentiles

Arrange the data in ascending order.

Compute index i, the position of the pth percentile.

i = (p/100)n

If i is not an integer, round up. The pth percentileis the value in the ith position.

If i is an integer, the pth percentile is the averageof the values in positions i and i+1.


Note on Excel’s Percentile Function

The formula that Excel uses is differentfrom the one used in the textbook!

In order to find the observation where the median occurs, Excel uses the following formula:

Lp = (p/100)n + (1 – p/100)

Once the observation is identified Excel will: 1. If Lp is a whole number (e.g. 12),

Excel’s result will be the same as the textbook’s.2. If Lp is not a whole number (e.g. 12.3) Excel’s

result will be different from the textbook’s.


Quartiles

Quartiles are specific percentiles.

First Quartile = 25th Percentile

Second Quartile = 50th Percentile = Median

Third Quartile = 75th Percentile



It is often desirable to consider measures of variability(dispersion), as well as measures of location.

For example, in choosing supplier A or supplier B wemight consider not only the average delivery time foreach, but also the variability in delivery time for each.



Range

Interquartile Range

Variance

Standard Deviation

Coefficient of Variation


Range

The range of a data set is the difference between thelargest and smallest data values.

It is the simplest measure of variability.

It is very sensitive to the smallest and largest datavalues.


Interquartile Range

The interquartile range of a data set is the differencebetween the third quartile and the first quartile.

It is the range for the middle 50% of the data.

It overcomes the sensitivity to extreme data values.


The variance is a measure of variability that utilizesall the data.

Variance

It is based on the difference between the value ofeach observation (xi) and the mean ( for a sample,m for a population).

x


Variance

The variance is computed as follows:

The variance is the average of the squareddifferences between each data value and the mean.

for asample

for apopulation

m2

2

( )x

Nis

xi x

n

22

1

( )


Standard Deviation

The standard deviation of a data set is the positivesquare root of the variance.

It is measured in the same units as the data, makingit more easily interpreted than the variance.


The standard deviation is computed as follows:

for asample

for apopulation

Standard Deviation

s s 2 2


The coefficient of variation is computed as follows:

Coefficient of Variation

100 %s

x

The coefficient of variation indicates how large thestandard deviation is in relation to the mean.

for asample

for apopulation

100 %

m


Measures of Distribution Shape,Relative Location, and Detecting Outliers

Distribution Shape

z-Scores

Chebyshev’s Theorem

Empirical Rule

Detecting Outliers


Distribution Shape: Skewness

An important measure of the shape of a distribution is called skewness.

The formula for computing skewness for a data set is somewhat complex.

• Skewness can be easily computed using statistical software.

Excel’s SKEW function can be used to compute the

skewness of a data set.



Symmetric (not skewed)

• Skewness is zero.

• Mean and median are equal.R

elat

ive

Fre

qu

ency

.05

.10

.15

.20

.25

.30

.35

0

Skewness = 0


Rel

ativ

e F

req

uen

cy

.05

.10

.15

.20

.25

.30

.35

0


Moderately Skewed Left

• Skewness is negative.

• Mean will usually be less than the median.

Skewness = .31



Moderately Skewed Right

• Skewness is positive.

• Mean will usually be more than the median.R

elat

ive

Fre

qu

ency

.05

.10

.15

.20

.25

.30

.35

0

Skewness = .31


The z-score is often called the standardized value.

It denotes the number of standard deviations a datavalue xi is from the mean.

z-Scores

zx x

si

i


z-Scores

A data value less than the sample mean will have az-score less than zero.

A data value greater than the sample mean will havea z-score greater than zero.

A data value equal to the sample mean will have az-score of zero.

An observation’s z-score is a measure of the relativelocation of the observation in a data set.



At least (1 - 1/z2) of the items in any data set will be

within z standard deviations of the mean, where z is

any value greater than 1.


At least of the data values must be

within of the mean.

75%

z = 2 standard deviations



within of the mean.

89%



within of the mean.

94%



Empirical Rule

For data having a bell-shaped distribution:

of the values of a normal random variable

are within of its mean.

68.26%

+/- 1 standard deviation



95.44%

+/- 2 standard deviations



99.72%

+/- 3 standard deviations


Empirical Rule

xm – 3 m – 1

m – 2m + 1

m + 2m + 3m

68.26%

95.44%

99.72%


Detecting Outliers

An outlier is an unusually small or unusually largevalue in a data set.

A data value with a z-score less than -3 or greaterthan +3 might be considered an outlier.

It might be:

• an incorrectly recorded data value

• a data value that was incorrectly included in the

data set

• a correctly recorded data value that belongs in

the data set


Measures of Association Between Two Variables

Covariance

Correlation Coefficient


Covariance

Positive values indicate a positive relationship.

Negative values indicate a negative relationship.

The covariance is a measure of the linear associationbetween two variables.


Covariance

The covariance coefficient is computed as follows:

forsamples

forpopulations

sx x y y

nxy

i i

( )( )

1

m m

xyi x i yx y

N

( )( )



Values near +1 indicate a strong positive linearrelationship.

Values near -1 indicate a strong negative linearrelationship.

The coefficient can take on values between -1 and +1.


The correlation coefficient is computed as follows:

forsamples

forpopulations

rs

s sxy

xy

x y

xy

xy

x y




Just because two variables are highly correlated, it does not mean that one variable is the cause of theother.

Correlation is a measure of linear association and notnecessarily causation.


Weighted Mean

When the mean is computed by giving each datavalue a weight that reflects its importance, it isreferred to as a weighted mean.

In the computation of a grade point average (GPA),the weights are the number of credit hours earned foreach grade.

When data values vary in importance, the analystmust choose the weight that best reflects theimportance of each value.


Weighted Mean

i i

i

w xx

w

where:

xi = value of observation i

wi = weight for observation i

Documents

© 2007 Thomson South-Western. All Rights Reserved Slide 1slrace/aa591/Chapter 3_LectureNotes.pdf · Whenever a data set has extreme values, the median ... Slide 8 Median 12 22 24