80
1 Descriptive Statistics: Numerical Methods Chapter 4

Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

Embed Size (px)

Citation preview

Page 1: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

1

Descriptive Statistics:

Numerical Methods

Chapter 4

Page 2: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

2

IntroductionIn this chapter we use numerical measures to

describe data sets, that represent populations or samples.

Usually, we focus our attention on two types of measures when describing population characteristics: Measure of the central location. Measure of dispersion.

Page 3: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

3

Why both the central location and the variability are used to describe a set of number?

Observe the following example.

Introduction

Page 4: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

4

IntroductionThink of a sample portfolio composed of three stocks.

100 sharesARR = 10%

200 sharesARR = 15% 100 shares

ARR = 20%

A central measure for this portfolio’s ARR for is 15%.Now observe the following portfolio

100 sharesARR = 5%100 sharesARR = 5%

200 sharesARR = 15% 100 shares

ARR = 25%100 sharesARR = 25%

A central measure of this portfolio’s ARR for is 15% too.

Page 5: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

5

IntroductionConsidering the average ARR only the two portfolios

are equal. But are they really? Is the dispersion of ARR the same for the two portfolio?The dispersion (variability) is an important property

when describing a set of numbers, at least as important as the central location.

We’ll have more detailed discussions on these two important measures later.

Page 6: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

6

4.1 Measures of Central Location

With one data pointclearly the central location is at the pointitself.

The central data point reflects the locations of all the actual data points.

How?With two data points,the central location should fall in the middlebetween them (in order to reflect the location ofboth of them).

Page 7: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

7

4.1 Measures of Central LocationThe central data point reflects the locations of all

the actual data points.How?

If the third data point appears in the centerthe measure of central location will remainin the center, but… (click)

But if the third data point appears on the left hand-sideof the midrange, it should “pull”the central location to the left.

Page 8: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

8

As more and more data points are added, the central location moves (left and right) as requiredin order to reflect the effects of all the points.

4.1 Measures of Central Location

Page 9: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

9

Sum of the measurementsNumber of measurementsMean =

This is the most popular and useful measure of central location

The Arithmetic Mean

Page 10: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

10

nx

x in

1i

Sample mean Population mean

Nx i

N1i

Sample size Population size

nx

x in

1i

The Arithmetic Mean

Page 11: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

11

Find the mean rate of return for a portfolio equally invested in five stocks having the following annual rate of returns: 11.2%, 8.07%, 5.55%, 13.7%, 21%.

Solution

Example 1

%764.95

217.1355.507.82.11x

The Arithmetic Mean

Page 12: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

12

The median of a set of measurements is the value that falls in the middle when the measurements are arranged in order of magnitude.

When determining the median pay attention to the number of observations (k). ‘k’ is odd

Median = the number at the (k+1)/2th location of the ordered array.

‘k’ is Even Median = the average of the two numbers in the middle (The number at the (k/2)th and the [(k/2)+1)]th locations of the ordered array.)

The Median

Page 13: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

13

30,32,60,3126,26,28,29,

Odd number of observations

26,26,28,29,30,32,60

Example 2 The salaries of seven employeeswere recorded (in 1000s): 28, 60, 26, 32, 30, 26, 29.Find the median salary.

Suppose an additional salary of $31,000is added to the group of salaries recorded before. Find the median salary.

Even number of observations

29.5,

The Median

There are seven salaries (K = 7). The (k+1)/2th salary of the ordered array is the number at the (7+1)/2th = 4th location.The median is 29.

There are eight salaries (K = 8). The two salaries in the middle are 29 (in the (k/2)th =4th location), and 30 (in the [(k/2)+1]th=5th location.The median is the average number – 29.5.

Page 14: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

14

The Mode of a set of measurements is the value that occurs most frequently.

A Set of data may have one mode (or modal class), or two or more modes.

The modal classFor large data setsthe modal class is much more relevant than a single-value mode.

The Mode

Page 15: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

15

Example 3 The manager of a men’s clothing store observes the waist

size (in inches) of trousers sold last week: 31, 34, 36, 33, 28, 34, 30, 34, 32, 40.

The mode of this data set is 34 in.

This information seems to be valuable (for example, for the design of a new display in the store), much more than “ the median is 33.5 in.”

The Mode

Page 16: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

16

Relationship among Mean, Median, and Mode

If a distribution is symmetrical, the mean, median and mode coincide

If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.

A positively skewed distribution(“skewed to the right”)

MeanMedian

Mode

Page 17: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

17

If a distribution is symmetrical, the mean, median and mode coincide

If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.

A positively skewed distribution(“skewed to the right”)

MeanMedian

Mode MeanMedian

Mode

A negatively skewed distribution(“skewed to the left”)

Relationship among Mean, Median, and Mode

Page 18: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

18

Using the Mean, Median, and Mode

When to use (not use) each measure of central location):• The mean - is very sensitive to extreme values, thus, should

not be used when a few extreme values residing away from most of the observations, are present. The mean is used in most statistical analyses.

• The median – is not effected by extreme values therefore, can be used in their presence. Yet, the medians does not reflect all the values included in the data set, but rather the location of the observation in the middle.

• The mode – should be used mainly for categorical data.

Page 19: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

19

Example 4 A professor of statistics wants to report the results of a midterm exam, taken by 100 students.

• The mean of the test marks is 73.90• The median of the test marks is 81• The mode of the test marks is 84

Describe the information each one provides.The mean provides informationabout the over-all performance level of the class. It can serve as a tool for making comparisons with other classes and/or other exams.

The Median indicates that half of the class received a grade below 81%, and half of the class received a grade above 81%. A student can use this statistic to place his/her mark relative to other students in the class.

The mode must be used when data is qualitative. If marks are classified by letter grade, the frequency of each grade can be calculated. Then, the mode becomes a logical measure to compute.

Summary Examples

Page 20: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

20

Summary Examples

Example 5 The following sample represents the lateness of arriving flights in a

certain domestic flight airport (in minutes): 22, 12, 4, -3,… (the data is found in Lateness.xls)(a) Find the mean, median, and mode of this sample. Are these data form

a skewed distribution? negative, positive? (b) Which measure should not be used? Change the largest lateness to 34

minutes (rather than 67). Which central location measures are effected?(c) A person is waiting for the arrival of a certain flight. He is told the flight will

probably be late not more than10 minutes. Should he believe this is a reliable estimate? Use the distribution of data requested in part (b).

Page 21: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

21

Example 5 - solution We run the data on Excel using the ‘Descriptive

Statistics’ tool.Lateness

Mean 10.8709677Standard Error 2.6436135Median 6Mode 4Standard Deviation 14.719017Sample Variance 216.649462Kurtosis 6.39059859Skewness 2.17922953Range 75Minimum -8Maximum 67Sum 337Count 31

The distribution of these data shows a positive skewness:

Do not use the mean, because an ‘outlier’ of 67 minutes lateness effects (increases) the mean value to be almost 11 minutes.

Lateness

201510 5 0

-8 5 18 31 44 57 70

Summary Examples

Page 22: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

22

Lateness

0

10

20

-1 8 17 26 35 More

Frequency

.00%

50.00%

100.00%

Example 5 - solution When changing the largest observation from 67 to 34, the mean reduces

to 9.80 minutes, but the median and mode do not change.

Lateness

Mean 9.806451613Standard Error 2.034339265Median 6Mode 4Standard Deviation 11.32672166Sample Variance 128.2946237Kurtosis 0.919374432Skewness 1.051857781Range 48Minimum -8Maximum 40Sum 304Count 31

• It is reasonable to believe that the lateness will not exceed 10 minutes. From the Ogive we see that about 60 % of the flights arrive within 10 minutes of the scheduled arrival time.

Summary Examples

Page 23: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

23

Problems

P4-1: Consider the following sample of measurements: 27, 32, 30, 28, 31, 32, 35, 28, 28, 29. Compute the mean, median, mode.Does it appear that the mode is a good measure of central location for this set of numbers?

P4-2: The manager at a local supermarket (facing tough competition) tries to improve service to customers waiting to pay by adding a second cashier. The goal is to have customers wait at most 4.5 minutes before leaving the cashier area. From the data presented in P4-02.xls, was the manager successful in achieving this goal? Use Excel and numerical descriptive measures.

Page 24: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

24

4.2 Measures of Variability

Measures of central location fail to tell the whole story about the distribution.

A question of interest still remains unanswered:

How much are the values of a given set spread out around the mean value?

Page 25: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

25

Observe two hypothetical data sets:

The mean provides a good representation of thevalues in the data set.

Set 1: Small variability

Why do we need measures of variability?

Page 26: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

26

Why do we need measures of variability?

Observe two hypothetical data sets:

Set 1: Small variability

Set 2: Larger variabilityThe mean is the same as before but no longer represents the set values as good as before.

The mean provides a good representation of thevalues in the data set.

Page 27: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

27

The range of a set of measurements is the difference between the largest and smallest measurements.

Its major advantage is the ease with which it can be computed.

Its major shortcoming is its failure to provide information on the dispersion of the values between the two end points.

? ? ?

But, how do all the measurements spread out?

Smallestmeasurement

Largestmeasurement

The range cannot assist in answering this questionRange

The Range

Page 28: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

28

This measure reflects the dispersion of all the measurement values.

The variance of a population of N measurements x1, x2,…,xN having a mean is defined as

The variance of a sample of n measurementsx1, x2, …,xn having a mean is defined as

x

N

)x( 2i

N1i2

1n

)xx(s

2i

n1i2

The Variance

Page 29: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

29

Consider two small populations:

1098

74 10

11 12

13 16

8-10= -2

9-10= -111-10= +1

12-10= +2

4-10 = - 6

7-10 = -3

13-10 = +3

16-10 = +6

Sum = 0

Sum = 0

The mean of both populations is 10...

…but measurements in Bare more dispersedthen those in A.

A measure of dispersion should agree with this observation.

Can the sum of deviations from the meanbe a good measure of dispersion?

A

B

The Variance

Page 30: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

30

The sum of deviations is zero for both populations, therefore, is not a good measure of dispersion, since clearly their dispersion is not equal.

The Variance

Page 31: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

31

Let us calculate the variance of the two populations

185

)1016()1013()1010()107()104( 222222B

25

)1012()1011()1010()109()108( 222222A

Why is the variance defined as the average squared deviation?Why not use the sum of squared deviations as a measure of dispersion instead?

After all, the sum of squared deviations increases in magnitude when the dispersionof a data set increases!!

The Variance

Page 32: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

32

Which data set has a larger dispersion?

1 3 1 32 5

A B

Data set Bis more dispersedaround the mean

Let us calculate the sum of squared deviations for both data sets

The Variance

Page 33: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

33

1 3 1 32 5

A B

SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10SumB = (1-3)2 + (5-3)2 = 8

SumA > SumB. This is inconsistent with the observation that set B is more dispersed.

The Variance

Page 34: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

34

1 3 1 32 5

A B

However, when calculated on “per observation” basis (variance), the dispersions are properly ordered.

A2 = SumA/N = 10/10 = 1

B2 = SumB/N = 8/2 = 4

The Variance

Page 35: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

35

Example 6 Find the variance of the following set of numbers,

representing annual rates of returns for a group of mutual funds. Assume the set is (i) a sample, (ii) a population: -2, 4, 5, 6.9, 10

Solution

2

2222

in

1i2

percent59.19

)78.410(...)78.44()78.42(15

11n

)xx(s

4.785

23.95

106.95425

xx i

61i

Assuming a sample

The Variance

Page 36: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

36

Example 6 - solution continued

2

2222

in

1i2

percent6736.15

)78.410(...)78.44()78.42(51

n)xx(

Assuming a population

The Variance

Page 37: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

37

The standard deviation of a set of measurements is the square root of the set variance.

2

2

:deviationandardstPopulation

ss:deviationstandardSample

Standard Deviation

Page 38: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

38

Example 7 The daily percentage of defective items in two weeks of production (10 working days) were calculated for two production lines?Which line provides good items more consistently?

Line 1: 8.3, 6.2, 20.9, 2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05

Line 2: 12.1, 2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, 1.3, 11.4

Standard Deviation

Page 39: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

39

Example 7, SolutionLine 1 Line 2

Mean 16 Mean 12Standard Error 5.295 Standard Error 3.152Median 14.6 Median 11.75Mode #N/A Mode #N/AStandard Deviation 16.74 Standard Deviation 9.969Sample Variance 280.3 Sample Variance 99.37Kurtosis -1.34 Kurtosis -0.46Skewness 0.217 Skewness 0.107Range 49.1 Range 30.6Minimum -6.2 Minimum -2.8Maximum 42.9 Maximum 27.8Sum 160 Sum 120Count 10 Count 10

Line 1 should be considered less consistent because the standard deviation of its defective proportion is larger (i.e. therefore the standard deviation of the good item proportion is also larger).

Standard Deviation

Let us use the Excel printout obtained from the “Descriptive Statistics” sub-menu.

Page 40: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

40

Interpreting the Standard Deviation

The standard deviation can be used to compare the variability of several distributions make a statement about the general shape of a

distribution.When describing the shape of a distribution we

refer to A distribution with any shape A mound shaped distribution

Page 41: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

41

The Empirical Rule – Describing a Mound Shaped Data Set

If a sample of measurements has a mound-shaped distribution, the interval…

tsmeasuremen the of 68%ely approximat contains )sx,sx(

tsmeasuremen the of 95%ely approximat contains )s2x,s2x(

tsmeasuremen the of 99.7%ely approximat contains )s3x,s3x(

Page 42: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

42

Example 10 Describe the set of data provided in Data 10 using numerical descriptive measures.

The Empirical Rule

05

1015

17 17.4 17.8 18.2 18.6 More

Measurements

Frequency

Solution From the histogram it

appears that the distribution is approximately mound shaped. We ’ll use the empirical rule to describe the data.

Page 43: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

43

From the empirical rule we get: Approximately 68% of the data lie between 17.403 and 18.515

[17. 959-1(.556), 17.959 + 1(.556)]

Approximately 95% of the data lie between 16.847 and 19.071 [17. 959-2(.556), 17. 959+2(.556)]

Approximately 99.7% of the data lie between 16.291 and 19.627

[17. 959-3(.556), 17. 959+3(.556)]

Example 10 – solution continued Running the Descriptive statistics tool in Excel we have

Mean = 17.959Standard deviation (sample) = 0.556

The Empirical Rule – Interpreting the Standard Deviation

Actual count: 26 (100%)

Actual count: 25(96%)

Actual count: 19 (73%)

Page 44: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

44

The proportion of observations in any sample that lie within k standard deviations of the mean is at least 1-1/z2

for any z > 1.This theorem is valid for any set of measurements

(sample, population) of any shape!!K Interval Minimum %1 at least 75%2 at least 89%3 at least 94%

s3x,s3x s2x,s2x

s4x,s4x

The Chebyshev Theorem - Describing Any Data Set

(1-1/22)

(1-1/32)

(1-1/42)

Page 45: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

45

Example 9 Employee salaries were recorded and a histogram was

created. Describe this data using the correct numerical measures.

The Chebyshev Theorem

Histogram

0

5

10

15

20

155 200 245 290 335 380 425

Salary

Frequency

Solution Creating the histogram we realize

that the distribution is positively skewed. Chebychev Theorem needs to be used to describe the data.

Page 46: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

46

Example 9 – solution continued From Excel we have:

Mean = 243.2Standard deviation = 58.354

Applying Chebychev Theorem

• At least 75% of the salaries lie within [243.2-2(58.354), 243.2+2(58.354)] = [126.492, 359.908]

• At least 88.9% of the salaries lie within [243.2-3(58.354), 243.2+3(58.354)] = [68.138, 418.262]

The Chebyshev Theorem

Actual count

39 (97.5%)

All (100%)

Page 47: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

47

The coefficient of variation of a set of measurements is the standard deviation divided by the mean value.

This coefficient provides a proportionate measure of variation.

CV :variation oft coefficien Population

xs

cv :variation oft coefficien Sample

A standard deviation of 10 may be perceivedlarge when the mean value is 100, but only moderately large when the mean value is 500

The Coefficient of Variation

Page 48: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

48

4.3 Measures of Relative Location and Box Plots

Additional information on the general shape of a data set can be obtained by describing the relative location of 5 values within the data set.

We use percentiles to describe these 5 relative locations. What is a percentile?

Page 49: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

49Your score

Percentile The pth percentile of a set of measurements is the

value for which • At most p% of the measurements are less than that value• At most (100-p)% of all the measurements are greater

than that value.Example

Suppose your score is the 60th percentile of a SAT test. Then

60% of all the scores lie here 40%

4.3 Measures of Relative Location and Box Plots

Page 50: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

50

Here are two possible approaches commonly used to describe a set of values.

The five number summary: Smallest value First quartile (Q1) Median (Q2) Third quartile (Q3) Largest value

- OR -•The first decile (the 10th percentile)•First quartile (Q1)•Median (Q2)•Third quartile (Q3)•The ninth decile (90th percentile)

4.3 Measures of Relative Location and Box Plots

Page 51: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

51

First (lower)decile = 10th percentile First (lower) quartile, Q1, = 25th percentile Median, = 50th percentile Third quartile, Q3, = 75th percentile Ninth (upper)decile = 90th percentile

Lower decile

A demostration of Commonly used percentiles

10% 90% lie here

Page 52: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

52

Commonly used percentiles: First (lower)decile = 10th percentile First (lower) quartile, Q1, = 25th percentile Median, = 50th percentile Third quartile, Q3, = 75th percentile Ninth (upper)decile = 90th percentile

Lower quartile

A demostration of Commonly used percentiles - optional

10% 90% lie here

25% 75% lie here

Click

Page 53: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

53

Commonly used percentiles: First (lower)decile = 10th percentile First (lower) quartile, Q1, = 25th percentile Median, = 50th percentile Third quartile, Q3, = 75th percentile Ninth (upper)decile = 90th percentile

Middle decile-Median

A demostration of Commonly used percentiles

And so on…

25% 75% lie here

50%lie here

50% lie here

Click

Page 54: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

54

There are two general cases to consider: The percentile is a member of the data set The percentile is not a member of the data set; It

falls in between two values of the data set.Let us demonstrate the two cases with two

examples.

Determining Percentiles and their Location

Page 55: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

55

Example 11 Find the quartiles for the data set of flight lateness presented in example 4.5.Data: 8.3, 6.2, 20.9, 2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05

Determining Percentiles and their Location

Page 56: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

56

At most (.25)(10) = 2.5 measurements should appear below the first quartile.Check the smallest 2 measurements on the left hand side.

At most (.75)(10)=7.5 measurements should appear above the first quartile.Check the largest 7 measurements on the right hand side.

The first quartile10 measurements

Example 11 - SolutionSort the measurements

2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9

Determining Percentiles and their Location

Page 57: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

57

Example 11 – solution continued The second quartile (Median):

• At most (.5)(10) = 5 numbers lie below and above Q2

• 2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9

Q2

Q2 = (8.3 + 20.9)/2 = 14.6

Determining Percentiles and their Location

Page 58: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

58

Example 11 – solution continued The third quartile

• At most (.75)10 = 7.5 numbers lie below Q3

• At most (.25)10 = 2.5 numbers lie above Q3

• 2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9

Q3

Determining Percentiles and their Location

Page 59: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

59

Example 12 Find the 20th percentile for the data set of flight lateness presented in example 11.

Solution Following the procedure applied to the previous example,

• At most (.20)10 = 2 numbers should fall below the 20th percentile.• At least (.80)10 = 8 numbers should fall above the 20th percentile.

• The sorted data set is: 2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.• From the sorted data set we see that every number greater than 3.1

and smaller than 5.2 meets these two conditions. • We show next how to determine the location and value of a percentile

whose value is not one of the data set points.

Determining Percentiles and their Location

Page 60: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

60

Find the location of any percentile using the formula

percentilePtheoflocationtheisLwhere100P

)1n(L

thP

P

Determining Percentiles and their Location

Page 61: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

61

Example 12-solution continued Finding the location of the 20th percentile:

2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9 Finding the value of the 20th percentile.

The 20th percentile is located at location 2.75, that is, at .75 the distance from 3.1 to 5.2. Therefore,

75.210020

)110(100P

)1n(LP

2 3

3.1

5.2

2.75

P20 = 3.1 + .75(5.2 – 3.1) = 4.675

Determining Percentiles and their Location

Page 62: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

62

Quartiles and Variability

Quartiles can provide an idea about the shape of a histogram

Q1 Q2 Q3

Positively skewedhistogram

Q1 Q2 Q3

Negatively skewedhistogram

Page 63: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

63

This is a measure of the spread of the middle 50% of the observations

Large value indicates a large spread of the observations

Interquartile range = Q3 – Q1

Inter-quartile Range

Page 64: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

64

1.5(Q3 – Q1) 1.5(Q3 – Q1)

A box plot is a pictorial display that provides the main descriptive measures of the measurement set:

• L - the largest measurement• Q3 - The upper quartile

• Q2 - The median

• Q1 - The lower quartile• S - The smallest measurement

S Q1 Q2 Q3 LWhisker Whisker

Box Plot

An outlier is defined as any valuethat is more than 1.5(Q3 – Q1)away from the box.

Page 65: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

65

Example 13 Create a box plot for the data regarding the GMAT scores of 200 applicants (see Data13.xls)

Box Plot

GMAT512531461515...

Smallest = 449Q1 = 512Median = 537Q3 = 575Largest = 788IQR = 63Outliers = (788, 788, 766, 763, 756, 719, 712, 707, 703, 694, 690, 675, )

537512449 575417.5512-1.5(IQR) 575+1.5(IQR)

669.5 788

Page 66: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

66

Interpreting the box plot results• The scores range from 449 to 788.• About half the scores are smaller than 537, and about half are larger than

537.• About half the scores lie between 512 and 575.• About a quarter lies below 512 and a quarter above 575.

Q1

512Q2

537Q3

575

25% 50% 25%

449 669.5

Box PlotExample 13 - continued

Page 67: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

67

50%

25% 25%

The data set is positively skewed

Q1

512Q2

537Q3

575

25% 50% 25%

449 669.5

Box PlotExample 13 - continued

Page 68: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

68

4.4 Measures of Linear RelationshipThe covariance and the coefficient of correlation are

used to measure the direction and strength of the linear relationship between two variables. The Covariance answers the question: Is there any pattern

to the way two variables move together? The Correlation Coefficient answers the question: How

strong is the linear relationship between two variables.

Page 69: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

69

N

)y)((xY)COV(X,covariance Population yixi

x (y) is the population mean of the variable X (Y).N is the population size.

1-n)yy)(x(x

y) cov(x,covariance Sample ii

Covariance

x (y) is the population mean of the variable X (Y).n is the sample size.

Page 70: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

70

If the two variables move the same direction, (both increase or both decrease), the covariance is a large positive number.

Covariance

1

3

4

6

10

8

X

Y

Page 71: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

71

If the two variables move in two opposite directions, (one increases when the other one decreases), the covariance is a large negative number.

Covariance

X

Y

4

6

3

101

8

Page 72: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

72

If the two variables are unrelated, the covariance will be close to zero.

Covariance

1

3

6

104

8

X

Y

Page 73: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

73

yx

)Y,X(COV

ncorrelatio oft coefficien Population

yxss)Y,Xcov(

r

ncorrelatio oft coefficien Sample

The coefficient of correlationThe coefficient of correlation measures the strength of the linear relationship between two variables.

Page 74: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

74

If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship).

The coefficient of correlation

Page 75: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

75

The coefficient of correlation

If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship).

Page 76: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

76

A weak linear relationship is indicated by a coefficient close to zero.

Also, a non-linear relationship translates to a weak linear relationship

The coefficient of correlation

Page 77: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

77

Example 14 Compute the covariance and the coefficient of

correlation to measure how are car speed (mile per hour) and gas consumption (miles per gallon) related to one another (see data next).

Solution We believe speed affects gas consumption. Thus

• Speed is labeled X• Miles per gallon is labeled Y

The coefficient of correlation and the covariance

Page 78: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

78

Car x y x2 y2 xy

nx

x1n

1s

nyx

yx1n

1

)y,x(CovFurmulasShortcut

2n1i2

in

1i2

in

1iin

1iii

n1i

The coefficient of correlation and the covariance

Example 14 – solution continued

1 15 7.1 225 50.41 106.52 35 15.5 1225 240.25 542.53 35 18.5 1225 342.25 647.54 40 19.7 1600 388.09 7885 40 22.4 1600 501.76 8966 45 21.3 2025 453.69 958.57 45 22.8 2025 519.84 10268 45 23.1 2025 533.61 1039.59 50 22.8 2500 519.84 114010 50 21.3 2500 453.69 1065

Total 400 194.5 16950 4003.43 8209.5

7.4710

)4.194)(400(5.8209

1101

)y,x(Covfurmulashortcut theUsing

Page 79: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

79

Car x y x2 y2 xy

nx

x1n

1s

nyx

yx1n

1

)y,x(CovFurmulasShortcut

2n1i2

in

1i2

in

1iin

1iii

n1i

The coefficient of correlation and the covariance

Example 14 – solution continued

1 15 7.1 225 50.41 106.52 35 15.5 1225 240.25 542.53 35 18.5 1225 342.25 647.54 40 19.7 1600 388.09 7885 40 22.4 1600 501.76 8966 45 21.3 2025 453.69 958.57 45 22.8 2025 519.84 10268 45 23.1 2025 533.61 1039.59 50 22.8 2500 519.84 114010 50 21.3 2500 453.69 1065

Total 400 194.5 16950 4003.43 8209.5

948.410

5.194)43.4003(

1101

s

27.1010

400)16950(

1101

s

:have wefurmulashortcut the From Y. andX of deviation satandard the computefirst wencorrelatio oft coefficienthe calculate To

2

y

2

x

Page 80: Numerical Descriptive Measures - California State …mihaylofaculty.fullerton.edu/sites/zgoldstei… · PPT file · Web view · 2009-08-18About half the scores lie between 512 and

80

The coefficient of correlation and the covariance

Example 14 – solution continued

9938.)948.4)(27.10(

7.47

ss)Y,Xcov(

ryx

Interpretation: Speed and mileage per gallon are strongly positively linearly related for the speed range of 15 to 50 miles per hour.