MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set...

Preview:

Citation preview

MEASURES OF

VARIABILITY

Variance and Standard Deviation

Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency.

Measures of variability or dispersion are used to describe the spread, or variability, of a distribution

Common measures of dispersion: range, variance, and standard deviation

Measures of Variability or

Dispersion

2

Why Study Dispersion?

For example, if your nature guide told you that

the river ahead averaged 3 feet in depth,

would you want to wade across on foot without

additional information? Probably not. You

would want to know something about the

variation in the depth.

A second reason for studying the dispersion

in a set of data is to compare the spread in two

or more distributions.

3

Observe two hypothetical

data sets:

The average value provides

a good representation of the

observations in the data set.

Small variability

This data set is now

changing to...

Why Study Dispersion?

4

Observe two hypothetical

data sets:

The average value provides

a good representation of the

observations in the data set.

Small variability

Larger variability

The same average value does not

provide as good representation of the

observations in the data set as before.

Why Study Dispersion?

5

Samples of Dispersions

6

Measures of Dispersion

Range

Mean Deviation

Variance and

Standard Deviation

7

EXAMPLE – Range

The number of cappuccinos sold at the Starbucks location in the Orange Country Airport between 4 and 7 p.m. for a sample of 5 days last year were 20, 40, 50, 60, and 80. Determine the mean deviation for the number of cappuccinos sold.

Range = Largest – Smallest value

= 80 – 20 = 60

8

range H L

Range: The difference in value between the highest-

valued (H) and the lowest-valued (L) pieces of data:

Example: Consider the sample {12, 23, 17, 15, 18}.

Find the range

Solutions:

range H L 23 12 11

Range

9

Skor (X)

Kekerapan (f )

fX

50

1

50

58

1

58

63

2

126

65

3

195

67

2

134

74

5

370

78

2

156

80

2

160

86

1

86

89

1

89

N = f = 20

fX= 1424

Ungrouped Data

10

Range for Ungrouped Data

R = H – L H = highest value

L = lowest value

R = 89 – 50 = 39

Range for Ungrouped Data

11

For Grouped Data Class

Interval

Class Limit Class Mid-

point (m)

Frequency (less

than UCL) (f)

mf

50 – 54

49.5 – 54.5

52

1

52

55 – 59

54.5 – 59.5

57

1

57

60 – 64

59.5 – 64.5

62

2

124

65 – 69

64.5 – 69.5

67

5

335

70 – 74

69.5 – 74.5

72

5

360

75 – 79

74.5 – 79.5

77

2

154

80 – 85

79.5 – 84.5

82

2

164

85 – 89

84.5 – 89.5

87

2

174

N = f = 20

mf =

1420

12

Range for Grouped Data

For Grouped Data in a Frequency Distribution

Range = UCL of Highest Class – LCL of Lowest Class

Range = 89.5 – 49.5 = 40

Range for Grouped Data

13

The range of a set of observations is the

difference between the largest and

smallest observations.

Its major advantage is the ease with which it can be computed.

Its major shortcoming is its failure to

provide information on the dispersion of the

observations between the two end points.

? ? ?

But, how do all the observations spread out?

Smallest

observation

Largest

observation

The range cannot assist in answering this question Range

Range

14

Semi Inter-Quartile Range

There are three quartiles:

First Quartile, Q1, at the 25% frequency

Second Quartile, Q2, at the 50% frequency, equals the median

Third Quartile, Q3, at the 75% frequency

Semi Inter-Quartile Range, Q = ( Q3 - Q1 )/2

Semi Inter-Quartile Range

15

Semi Inter-Quartile Range

Semi Inter-Quartile Range, Q = ( Q3 - Q1 )/2

Q = (77.0 – 65.5) = 5.75

2

Semi Inter-Quartile Range

16

• Other measures of dispersion are based on the following

quantity

Deviation from the Mean: A deviation from the

mean, ,is the difference between the value of x and

the mean .

x xx

Deviation from the Mean

17

Example: Consider the sample {12, 23, 17, 15, 18}.

Find each deviation from the mean.

Solutions:

-5

6

0

-2

1

Data Deviation from Mean

_________________________

12

23

17

15

18

x x x

x 1

512 23 17 15 18 17( )

Deviation from the Mean

18

Consider two small populations:

10 9 8

7 4 10

11 12

13 16

8-10= -2

9-10= -1

11-10= +1

12-10= +2

4-10 = - 6

7-10 = -3

13-10 = +3

16-10 = +6

Sum = 0

Sum = 0

The mean of both

populations is 10...

…but measurements in B

are more dispersed

then those in A.

A measure of dispersion

Should agrees with this

observation.

Can the sum of deviations

Be a good measure of dispersion?

A

B

The sum of deviations is

zero for both populations,

therefore, is not a good

measure of dispersion.

Why not use the sum of deviations

from the mean?

19

Mean Deviation

The number of cappuccinos sold at the Starbucks location in the Orange Country Airport between 4 and 7 p.m. for a sample of 5 days last year were 20, 40, 50, 60, and 80. Determine the mean deviation for the number of cappuccinos sold.

20

This measure reflects the dispersion of all the

observations

The variance of a population of size N x1, x2,…,xN

whose mean is m is defined as

The variance of a sample of n observations

x1, x2, …,xn whose mean is is defined as x

N

)x( 2i

N1i2 m

1n

)xx(s

2i

n1i2

The Variance

21

Population Variance, 2 = (x - m)2 ,

N

m = population mean

N = population size

Population Standard Deviation, = 2

Population Variance

22

Sample Variance: The sample variance, s2, is the mean

of the squared deviations, calculated using n 1 as the

divisor:

s n

x x 2 2 1

1

( ) where n is the sample size

s xn

2

1

SS( ) ( ) SS ( ) ( ) x x x x

n x 2 2 2 1

Note: The numerator for the sample variance is called the

sum of squares for x, denoted SS(x):

where

Sample Variance

23

Let us calculate the variance of the two populations

185

)1016()1013()1010()107()104( 222222B

25

)1012()1011()1010()109()108( 222222A

Why is the variance defined as

the average squared deviation?

Why not use the sum of squared

deviations as a measure of

variation instead?

After all, the sum of squared

deviations increases in

magnitude when the variation

of a data set increases!!

Population Variance – Example

24

Example:

The following sample consists of the number of jobs six students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and variance

Solution:

2

2222

in

1i2

jobs2.33

)1413...()1415()1417(16

1

1n

)xx(s

jobs146

84

6

1397231517

6

xx

i6

1i

Sample Variance – Example

25

( )

1

2

2

2

n

n

xx

s

The shortcut formula for the sample variance:

Sample Variance – Shortcut

Formula

26

( ) ( )

2

2222

2i

n1i2

i

n

1i

2

jobs2.33

6

13...151713...1517

16

1

n

)x(x

1n

1s

Sample Variance – Shortcut

Formula

27

Sample Variance – Example 2

The hourly wages for a sample of part-time employees at Home Depot are: $12, $20, $16, $18, and $19. What is the sample variance?

28

The standard deviation of a set of

observations is the positive square root of

the variance .

2

2

:deviationandardstPopulation

ss:deviationstandardSample

The unit of measure for the standard deviation

is the same as the unit of measure for the data

Standard Deviation

29

Example: Find the (1) variance and (2) standard

deviation for the data {5, 7, 1, 3, 8}:

2 . 8 ) 8 . 32 ( 4

1 2 s 1) 86 . 2 2 . 8 s 2)

x x x ( ) x x 2

0.2

2.2

-3.8

-1.8

3.2

0.04

4.84

14.44

3.24

10.24

5

7

1

3

8

24 0 32.08 Sum:

Solutions:

x 1 5

5 7 1 3 8 4 8 ( ) . First:

Standard Deviation

30

The number of traffic citations issued during the last

five months in Beaufort County, South Carolina, is 38,

26, 13, 41, and 22. What is the population variance?

EXAMPLE – Variance and

Standard Deviation

31

EXAMPLE – Sample Variance

The hourly

wages for a

sample of part-

time employees

at Home Depot

are: $12, $20,

$16, $18, and

$19. What is

the sample

variance?

32

If the data is given in the form of a

frequency distribution, we need to make

a few changes to the formulas for the

mean, variance, and standard deviation

Complete the extension table in order to

find these summary statistics

Calculation of Mean, Variance and

Standard Deviation

33

Population variance, 2 =

Sample variance, s2 =

Population standard deviation, =

Sample standard deviation, s =

Ν

Ν

ΧΧ

22 )(

1

)( 22

n

n

ΧΧ

2

2s

Variance and Standard Deviation

for Ungrouped Data

34

X

X2

50

58

63

63

65

65

65

67

67

74

74

74

74

74

78

78

80

80

86

89

2 500

3 364

3 969

3 969

4 225

4 225

4 225

4 489

4 489

5 476

5 476

5 476

5 476

5 476

6 084

6 084

6 400

6 400

7 396

7 921

X = 1 424

X2 = 103 120

X2 = 103120; (X)2 = 2 027 776

2 =

= 86.56

= = 9.30

2 027 776103120

20

20

56.86

For Ungrouped Data

35

Example: A survey of students in the first grade at a local school asked for the number of brothers and/or sisters for each child. The results are summarized in the table below. Find (1) the mean, (2) the variance, and (3) the standard deviation:

s2

2239 93

6262 1

163

( )

.2) s 163 128. .3) x 93 62 15/ .1)

Solutions:

First

: 17

23

5

2

17

46

20

10

1

2

4

5

62 93 Sum:

15 0 0

x f xf x f2

17

92

80

50

239

0

For a Frequency Distribution

36

Standard Deviation of Grouped

Data

37

Standard Deviation of Grouped

Data - Example Refer to the frequency distribution for the Whitner

Autoplex data used earlier. Compute the standard deviation of the vehicle selling prices

38

The standard deviation can be used to

compare the variability of several distributions

make a statement about the general shape of a distribution.

The empirical rule: If a sample of observations has a mound-shaped distribution, the interval

tsmeasuremen the of 68%ely approximat contains )sx,sx(

tsmeasuremen the of 95%ely approximat contains )s2x,s2x(

tsmeasuremen the of 99.7%ely approximat contains )s3x,s3x(

Interpreting Standard Deviation

39

Standard deviation is a measure of

variability, or spread

Two rules for describing data rely on the

standard deviation:

Empirical rule: applies to a variable that is

normally distributed

Chebyshev’s theorem: applies to any

distribution

Interpreting Standard Deviation

40

The Empirical Rule

41

Notes:

The empirical rule is more informative than Chebyshev’s theorem since we know more about the distribution (normally distributed)

Also applies to populations

Can be used to determine if a distribution is normally distributed

1. Approximately 68% of the observations lie within 1

standard deviation of the mean

2. Approximately 95% of the observations lie within 2

standard deviations of the mean

3. Approximately 99.7% of the observations lie within 3

standard deviations of the mean

Empirical Rule: If a variable is normally distributed, then:

The Empirical Rule

42

xx s x sx s2 x s2x s3 x s3

68%

95%

99.7%

The Empirical Rule

43

1) What percentage of weights fall between 5.7 and 7.3?

2) What percentage of weights fall above 7.7?

Example: A random sample of plum tomatoes was selected from a local grocery store and their weights recorded. The mean weight was 6.5 ounces with a standard deviation of 0.4 ounces. If the weights are normally distributed:

Solutions:

( , ) ( . (0. ), . (0. )) ( . , . ) x s x s 2 2 6 5 2 4 6 5 2 4 5 7 7 3 Approximat ely 95% of the weigh ts fall be tween 5.7 and 7.3

1)

( , ) ( . (0. ), . (0. )) ( . , . ) x s x s 3 3 6 5 3 4 6 5 3 4 5 3 7 7 Approximat ely 99.7% of the wei ghts fall between 5. 3 and 7.7 Approximat ely 0.3% of the weigh ts fall out side (5.3, 7.7) Approximat ely (0.3 / 2) =0.15% of t he weights fall abov e 7.7

2)

Empirical Rule – Example

44

1. Find the mean and standard deviation for the data

2. Compute the actual proportion of data within 1, 2, and 3 standard deviations from the mean

3. Compare these actual proportions with those given by the empirical rule

4. If the proportions found are reasonably close to those of the empirical rule, then the data is approximately normally distributed

Note: The empirical rule may be used to determine whether or not a set of data is approximately normally distributed

Empirical Rule

45

Example: A statistics practitioner wants

to describe the way returns on

investment are distributed.

The mean return = 10%

The standard deviation of the return = 8%

The histogram is bell shaped.

More Examples

46

Example: – solution

The empirical rule can be applied (bell shaped histogram)

Describing the return distribution Approximately 68% of the returns lie between 2% and

18% [10 – 1(8), 10 + 1(8)]

Approximately 95% of the returns lie between -6% and 26% [10 – 2(8), 10 + 2(8)]

Approximately 99.7% of the returns lie between -14% and 34% [10 – 3(8), 10 + 3(8)]

More Examples

47

Chebyshev’s Theorem

The arithmetic mean biweekly amount contributed by the Dupree Paint employees to the company’s profit-sharing plan is $51.54, and the standard deviation is $7.51. At least what percent of the contributions lie within plus 3.5 standard deviations and minus 3.5 standard deviations of the mean?

48

Chebyshev’s Theorem: The proportion of any distribution that lies within k standard deviations of the mean is at least 1 (1/k2), where k is any positive number larger than 1. This theorem applies to all distributions of data.

at least

1 12

k

xx ks x ks

Illustration:

Chebyshev’s Theorem

49

The proportion of observations in any sample that lie within

k standard deviations of the mean is at least

1-1/k2 for k > 1.

This theorem is valid for any set of measurements (sample,

population) of any shape!!

K Interval Chebyshev Empirical Rule

1 at least 0% approximately 68%

2 at least 75% approximately 95%

3 at least 89% approximately 99.7%

s2x,s2x

sx,sx

s3x,s3x

(1-1/12)

(1-1/22)

(1-1/32)

Chebyshev’s Theorem

50

Chebyshev’s theorem is very conservative and

holds for any distribution of data

Chebyshev’s theorem also applies to any

population

The two most common values used to describe a

distribution of data are k = 2, 3

1.7 2 2.5 3

0.65 0.75 0.84 0.89

k

1 1 2( / )k

The table below lists some values for k and 1 - (1/k2):

Chebyshev’s Theorem

51

Example: At the close of trading, a random sample of 35 technology stocks was selected. The mean selling price was 67.75 and the standard deviation was 12.3. Use Chebyshev’s theorem (with k = 2, 3) to describe the distribution.

)104.65 ,85.30()3.12(375.67 ),3.12(375.67()3 ,3( sxsx

Using k=3: At least 89% of the observations lie within 3

standard deviations of the mean:

Solutions:

)35.92 ,15.43()3.12(275.67 ),3.12(275.67()2 ,2( sxsx

Using k=2: At least 75% of the observations lie within 2 standard

deviations of the mean:

Chebyshev’s Theorem – Example

52

Example: The annual salaries of the employees of a chain of computer

stores produced a positively skewed histogram. The mean and

standard deviation are $28,000 and $3,000, respectively. What

can you say about the salaries at this chain?

Solution:

At least 75% of the salaries lie between $22,000 and $34,000

28000 – 2(3000) 28000 + 2(3000)

At least 88.9% of the salaries lie between $19,000 and $37,000

28000 – 3(3000) 28000 + 3(3000)

Chebyshev’s Theorem – Example

53

Normal Distribution

mean

median

mode

Frequency

X

Normal Distribution

54

Normal Distribution

Frequency

X

Normal Distribution

55

Recommended