45
1 Slide © 2007 Thomson South - Western. All Rights Reserved

© 2007 Thomson South-Western. All Rights Reserved Slide 1slrace/aa591/Chapter 3_LectureNotes.pdf · Whenever a data set has extreme values, the median ... Slide 8 Median 12 22 24

Embed Size (px)

Citation preview

1Slide© 2007 Thomson South-Western. All Rights Reserved

2Slide© 2007 Thomson South-Western. All Rights Reserved

Chapter 3Descriptive Statistics: Numerical Measures

Measures of Location

Measures of Variability

Measures of Distribution Shape, Relative Location, and Detecting Outliers

Measures of Association Between Two Variables

Weighted Mean

3Slide© 2007 Thomson South-Western. All Rights Reserved

Measures of Location

If the measures are computedfor data from a sample,

they are called sample statistics.

If the measures are computedfor data from a population,

they are called population parameters.

A sample statistic is referred toas the point estimator of the

corresponding population parameter.

Mean

Median

Mode

Percentiles

Quartiles

4Slide© 2007 Thomson South-Western. All Rights Reserved

Mean

The mean of a data set is the average of all the data values.

The sample mean is the point estimator of the population mean m.

x

5Slide© 2007 Thomson South-Western. All Rights Reserved

Sample Mean x

Number ofobservationsin the sample

Sum of the valuesof the n observations

ix

xn

6Slide© 2007 Thomson South-Western. All Rights Reserved

Population Mean m

Number ofobservations inthe population

Sum of the valuesof the N observations

ix

Nm

7Slide© 2007 Thomson South-Western. All Rights Reserved

Median

Whenever a data set has extreme values, the medianis the preferred measure of central location.

A few extremely large incomes or property valuescan inflate the mean.

The median is the measure of location most oftenreported for annual income and property value data.

The median of a data set is the value in the middlewhen the data items are arranged in ascending order.

8Slide© 2007 Thomson South-Western. All Rights Reserved

Median

12 22 26 27 2724 28

For an odd number of observations:

in ascending order

26 28 27 22 24 27 12 7 observations

the median is the middle value.

Median = 26

9Slide© 2007 Thomson South-Western. All Rights Reserved

28

Median

For an even number of observations:

in ascending order

27 8 observations

the median is the average of the middle two values.

Median = (26 + 27)/2 = 26.5

3012 22 26 27 2724

26 28 27 22 24 30 12

10Slide© 2007 Thomson South-Western. All Rights Reserved

Mean VS Median

The mean IS affected by outliers (extreme observations)

The median IS NOT affected by outliers

11Slide© 2007 Thomson South-Western. All Rights Reserved

Mode

The mode of a data set is the value that occurs withgreatest frequency.

The greatest frequency can occur at two or moredifferent values.

If the data have exactly two modes, the data arebimodal.

If the data have more than two modes, the data aremultimodal.

12Slide© 2007 Thomson South-Western. All Rights Reserved

Percentiles

A percentile provides information about how thedata are spread over the interval from the smallestvalue to the largest value.

Admission test scores for colleges and universitiesare frequently reported in terms of percentiles.

13Slide© 2007 Thomson South-Western. All Rights Reserved

The pth percentile of a data set is a value such that at least p percent of the items take on this value or less and at least (100 - p) percent of the items take on this value or more.

Percentiles

14Slide© 2007 Thomson South-Western. All Rights Reserved

Percentiles

Arrange the data in ascending order.

Compute index i, the position of the pth percentile.

i = (p/100)n

If i is not an integer, round up. The pth percentileis the value in the ith position.

If i is an integer, the pth percentile is the averageof the values in positions i and i+1.

15Slide© 2007 Thomson South-Western. All Rights Reserved

Note on Excel’s Percentile Function

The formula that Excel uses is differentfrom the one used in the textbook!

In order to find the observation where the median occurs, Excel uses the following formula:

Lp = (p/100)n + (1 – p/100)

Once the observation is identified Excel will: 1. If Lp is a whole number (e.g. 12),

Excel’s result will be the same as the textbook’s.2. If Lp is not a whole number (e.g. 12.3) Excel’s

result will be different from the textbook’s.

16Slide© 2007 Thomson South-Western. All Rights Reserved

Quartiles

Quartiles are specific percentiles.

First Quartile = 25th Percentile

Second Quartile = 50th Percentile = Median

Third Quartile = 75th Percentile

17Slide© 2007 Thomson South-Western. All Rights Reserved

Measures of Variability

It is often desirable to consider measures of variability(dispersion), as well as measures of location.

For example, in choosing supplier A or supplier B wemight consider not only the average delivery time foreach, but also the variability in delivery time for each.

18Slide© 2007 Thomson South-Western. All Rights Reserved

Measures of Variability

Range

Interquartile Range

Variance

Standard Deviation

Coefficient of Variation

19Slide© 2007 Thomson South-Western. All Rights Reserved

Range

The range of a data set is the difference between thelargest and smallest data values.

It is the simplest measure of variability.

It is very sensitive to the smallest and largest datavalues.

20Slide© 2007 Thomson South-Western. All Rights Reserved

Interquartile Range

The interquartile range of a data set is the differencebetween the third quartile and the first quartile.

It is the range for the middle 50% of the data.

It overcomes the sensitivity to extreme data values.

21Slide© 2007 Thomson South-Western. All Rights Reserved

The variance is a measure of variability that utilizesall the data.

Variance

It is based on the difference between the value ofeach observation (xi) and the mean ( for a sample,m for a population).

x

22Slide© 2007 Thomson South-Western. All Rights Reserved

Variance

The variance is computed as follows:

The variance is the average of the squareddifferences between each data value and the mean.

for asample

for apopulation

m2

2

( )x

Nis

xi x

n

22

1

( )

23Slide© 2007 Thomson South-Western. All Rights Reserved

Standard Deviation

The standard deviation of a data set is the positivesquare root of the variance.

It is measured in the same units as the data, makingit more easily interpreted than the variance.

24Slide© 2007 Thomson South-Western. All Rights Reserved

The standard deviation is computed as follows:

for asample

for apopulation

Standard Deviation

s s 2 2

25Slide© 2007 Thomson South-Western. All Rights Reserved

The coefficient of variation is computed as follows:

Coefficient of Variation

100 %s

x

The coefficient of variation indicates how large thestandard deviation is in relation to the mean.

for asample

for apopulation

100 %

m

26Slide© 2007 Thomson South-Western. All Rights Reserved

Measures of Distribution Shape,Relative Location, and Detecting Outliers

Distribution Shape

z-Scores

Chebyshev’s Theorem

Empirical Rule

Detecting Outliers

27Slide© 2007 Thomson South-Western. All Rights Reserved

Distribution Shape: Skewness

An important measure of the shape of a distribution is called skewness.

The formula for computing skewness for a data set is somewhat complex.

• Skewness can be easily computed using statistical software.

Excel’s SKEW function can be used to compute the

skewness of a data set.

28Slide© 2007 Thomson South-Western. All Rights Reserved

Distribution Shape: Skewness

Symmetric (not skewed)

• Skewness is zero.

• Mean and median are equal.R

elat

ive

Fre

qu

ency

.05

.10

.15

.20

.25

.30

.35

0

Skewness = 0

29Slide© 2007 Thomson South-Western. All Rights Reserved

Rel

ativ

e F

req

uen

cy

.05

.10

.15

.20

.25

.30

.35

0

Distribution Shape: Skewness

Moderately Skewed Left

• Skewness is negative.

• Mean will usually be less than the median.

Skewness = .31

30Slide© 2007 Thomson South-Western. All Rights Reserved

Distribution Shape: Skewness

Moderately Skewed Right

• Skewness is positive.

• Mean will usually be more than the median.R

elat

ive

Fre

qu

ency

.05

.10

.15

.20

.25

.30

.35

0

Skewness = .31

31Slide© 2007 Thomson South-Western. All Rights Reserved

The z-score is often called the standardized value.

It denotes the number of standard deviations a datavalue xi is from the mean.

z-Scores

zx x

si

i

32Slide© 2007 Thomson South-Western. All Rights Reserved

z-Scores

A data value less than the sample mean will have az-score less than zero.

A data value greater than the sample mean will havea z-score greater than zero.

A data value equal to the sample mean will have az-score of zero.

An observation’s z-score is a measure of the relativelocation of the observation in a data set.

33Slide© 2007 Thomson South-Western. All Rights Reserved

Chebyshev’s Theorem

At least (1 - 1/z2) of the items in any data set will be

within z standard deviations of the mean, where z is

any value greater than 1.

34Slide© 2007 Thomson South-Western. All Rights Reserved

At least of the data values must be

within of the mean.

75%

z = 2 standard deviations

Chebyshev’s Theorem

At least of the data values must be

within of the mean.

89%

z = 3 standard deviations

At least of the data values must be

within of the mean.

94%

z = 4 standard deviations

35Slide© 2007 Thomson South-Western. All Rights Reserved

Empirical Rule

For data having a bell-shaped distribution:

of the values of a normal random variable

are within of its mean.

68.26%

+/- 1 standard deviation

of the values of a normal random variable

are within of its mean.

95.44%

+/- 2 standard deviations

of the values of a normal random variable

are within of its mean.

99.72%

+/- 3 standard deviations

36Slide© 2007 Thomson South-Western. All Rights Reserved

Empirical Rule

xm – 3 m – 1

m – 2m + 1

m + 2m + 3m

68.26%

95.44%

99.72%

37Slide© 2007 Thomson South-Western. All Rights Reserved

Detecting Outliers

An outlier is an unusually small or unusually largevalue in a data set.

A data value with a z-score less than -3 or greaterthan +3 might be considered an outlier.

It might be:

• an incorrectly recorded data value

• a data value that was incorrectly included in the

data set

• a correctly recorded data value that belongs in

the data set

38Slide© 2007 Thomson South-Western. All Rights Reserved

Measures of Association Between Two Variables

Covariance

Correlation Coefficient

39Slide© 2007 Thomson South-Western. All Rights Reserved

Covariance

Positive values indicate a positive relationship.

Negative values indicate a negative relationship.

The covariance is a measure of the linear associationbetween two variables.

40Slide© 2007 Thomson South-Western. All Rights Reserved

Covariance

The covariance coefficient is computed as follows:

forsamples

forpopulations

sx x y y

nxy

i i

( )( )

1

m m

xyi x i yx y

N

( )( )

41Slide© 2007 Thomson South-Western. All Rights Reserved

Correlation Coefficient

Values near +1 indicate a strong positive linearrelationship.

Values near -1 indicate a strong negative linearrelationship.

The coefficient can take on values between -1 and +1.

42Slide© 2007 Thomson South-Western. All Rights Reserved

The correlation coefficient is computed as follows:

forsamples

forpopulations

rs

s sxy

xy

x y

xy

xy

x y

Correlation Coefficient

43Slide© 2007 Thomson South-Western. All Rights Reserved

Correlation Coefficient

Just because two variables are highly correlated, it does not mean that one variable is the cause of theother.

Correlation is a measure of linear association and notnecessarily causation.

44Slide© 2007 Thomson South-Western. All Rights Reserved

Weighted Mean

When the mean is computed by giving each datavalue a weight that reflects its importance, it isreferred to as a weighted mean.

In the computation of a grade point average (GPA),the weights are the number of credit hours earned foreach grade.

When data values vary in importance, the analystmust choose the weight that best reflects theimportance of each value.

45Slide© 2007 Thomson South-Western. All Rights Reserved

Weighted Mean

i i

i

w xx

w

where:

xi = value of observation i

wi = weight for observation i