55
MEASURES OF VARIABILITY Variance and Standard Deviation

MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

MEASURES OF

VARIABILITY

Variance and Standard Deviation

Page 2: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency.

Measures of variability or dispersion are used to describe the spread, or variability, of a distribution

Common measures of dispersion: range, variance, and standard deviation

Measures of Variability or

Dispersion

2

Page 3: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Why Study Dispersion?

For example, if your nature guide told you that

the river ahead averaged 3 feet in depth,

would you want to wade across on foot without

additional information? Probably not. You

would want to know something about the

variation in the depth.

A second reason for studying the dispersion

in a set of data is to compare the spread in two

or more distributions.

3

Page 4: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Observe two hypothetical

data sets:

The average value provides

a good representation of the

observations in the data set.

Small variability

This data set is now

changing to...

Why Study Dispersion?

4

Page 5: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Observe two hypothetical

data sets:

The average value provides

a good representation of the

observations in the data set.

Small variability

Larger variability

The same average value does not

provide as good representation of the

observations in the data set as before.

Why Study Dispersion?

5

Page 6: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Samples of Dispersions

6

Page 7: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Measures of Dispersion

Range

Mean Deviation

Variance and

Standard Deviation

7

Page 8: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

EXAMPLE – Range

The number of cappuccinos sold at the Starbucks location in the Orange Country Airport between 4 and 7 p.m. for a sample of 5 days last year were 20, 40, 50, 60, and 80. Determine the mean deviation for the number of cappuccinos sold.

Range = Largest – Smallest value

= 80 – 20 = 60

8

Page 9: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

range H L

Range: The difference in value between the highest-

valued (H) and the lowest-valued (L) pieces of data:

Example: Consider the sample {12, 23, 17, 15, 18}.

Find the range

Solutions:

range H L 23 12 11

Range

9

Page 10: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Skor (X)

Kekerapan (f )

fX

50

1

50

58

1

58

63

2

126

65

3

195

67

2

134

74

5

370

78

2

156

80

2

160

86

1

86

89

1

89

N = f = 20

fX= 1424

Ungrouped Data

10

Page 11: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Range for Ungrouped Data

R = H – L H = highest value

L = lowest value

R = 89 – 50 = 39

Range for Ungrouped Data

11

Page 12: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

For Grouped Data Class

Interval

Class Limit Class Mid-

point (m)

Frequency (less

than UCL) (f)

mf

50 – 54

49.5 – 54.5

52

1

52

55 – 59

54.5 – 59.5

57

1

57

60 – 64

59.5 – 64.5

62

2

124

65 – 69

64.5 – 69.5

67

5

335

70 – 74

69.5 – 74.5

72

5

360

75 – 79

74.5 – 79.5

77

2

154

80 – 85

79.5 – 84.5

82

2

164

85 – 89

84.5 – 89.5

87

2

174

N = f = 20

mf =

1420

12

Page 13: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Range for Grouped Data

For Grouped Data in a Frequency Distribution

Range = UCL of Highest Class – LCL of Lowest Class

Range = 89.5 – 49.5 = 40

Range for Grouped Data

13

Page 14: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

The range of a set of observations is the

difference between the largest and

smallest observations.

Its major advantage is the ease with which it can be computed.

Its major shortcoming is its failure to

provide information on the dispersion of the

observations between the two end points.

? ? ?

But, how do all the observations spread out?

Smallest

observation

Largest

observation

The range cannot assist in answering this question Range

Range

14

Page 15: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Semi Inter-Quartile Range

There are three quartiles:

First Quartile, Q1, at the 25% frequency

Second Quartile, Q2, at the 50% frequency, equals the median

Third Quartile, Q3, at the 75% frequency

Semi Inter-Quartile Range, Q = ( Q3 - Q1 )/2

Semi Inter-Quartile Range

15

Page 16: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Semi Inter-Quartile Range

Semi Inter-Quartile Range, Q = ( Q3 - Q1 )/2

Q = (77.0 – 65.5) = 5.75

2

Semi Inter-Quartile Range

16

Page 17: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

• Other measures of dispersion are based on the following

quantity

Deviation from the Mean: A deviation from the

mean, ,is the difference between the value of x and

the mean .

x xx

Deviation from the Mean

17

Page 18: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Example: Consider the sample {12, 23, 17, 15, 18}.

Find each deviation from the mean.

Solutions:

-5

6

0

-2

1

Data Deviation from Mean

_________________________

12

23

17

15

18

x x x

x 1

512 23 17 15 18 17( )

Deviation from the Mean

18

Page 19: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Consider two small populations:

10 9 8

7 4 10

11 12

13 16

8-10= -2

9-10= -1

11-10= +1

12-10= +2

4-10 = - 6

7-10 = -3

13-10 = +3

16-10 = +6

Sum = 0

Sum = 0

The mean of both

populations is 10...

…but measurements in B

are more dispersed

then those in A.

A measure of dispersion

Should agrees with this

observation.

Can the sum of deviations

Be a good measure of dispersion?

A

B

The sum of deviations is

zero for both populations,

therefore, is not a good

measure of dispersion.

Why not use the sum of deviations

from the mean?

19

Page 20: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Mean Deviation

The number of cappuccinos sold at the Starbucks location in the Orange Country Airport between 4 and 7 p.m. for a sample of 5 days last year were 20, 40, 50, 60, and 80. Determine the mean deviation for the number of cappuccinos sold.

20

Page 21: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

This measure reflects the dispersion of all the

observations

The variance of a population of size N x1, x2,…,xN

whose mean is m is defined as

The variance of a sample of n observations

x1, x2, …,xn whose mean is is defined as x

N

)x( 2i

N1i2 m

1n

)xx(s

2i

n1i2

The Variance

21

Page 22: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Population Variance, 2 = (x - m)2 ,

N

m = population mean

N = population size

Population Standard Deviation, = 2

Population Variance

22

Page 23: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Sample Variance: The sample variance, s2, is the mean

of the squared deviations, calculated using n 1 as the

divisor:

s n

x x 2 2 1

1

( ) where n is the sample size

s xn

2

1

SS( ) ( ) SS ( ) ( ) x x x x

n x 2 2 2 1

Note: The numerator for the sample variance is called the

sum of squares for x, denoted SS(x):

where

Sample Variance

23

Page 24: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Let us calculate the variance of the two populations

185

)1016()1013()1010()107()104( 222222B

25

)1012()1011()1010()109()108( 222222A

Why is the variance defined as

the average squared deviation?

Why not use the sum of squared

deviations as a measure of

variation instead?

After all, the sum of squared

deviations increases in

magnitude when the variation

of a data set increases!!

Population Variance – Example

24

Page 25: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Example:

The following sample consists of the number of jobs six students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and variance

Solution:

2

2222

in

1i2

jobs2.33

)1413...()1415()1417(16

1

1n

)xx(s

jobs146

84

6

1397231517

6

xx

i6

1i

Sample Variance – Example

25

Page 26: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

( )

1

2

2

2

n

n

xx

s

The shortcut formula for the sample variance:

Sample Variance – Shortcut

Formula

26

Page 27: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

( ) ( )

2

2222

2i

n1i2

i

n

1i

2

jobs2.33

6

13...151713...1517

16

1

n

)x(x

1n

1s

Sample Variance – Shortcut

Formula

27

Page 28: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Sample Variance – Example 2

The hourly wages for a sample of part-time employees at Home Depot are: $12, $20, $16, $18, and $19. What is the sample variance?

28

Page 29: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

The standard deviation of a set of

observations is the positive square root of

the variance .

2

2

:deviationandardstPopulation

ss:deviationstandardSample

The unit of measure for the standard deviation

is the same as the unit of measure for the data

Standard Deviation

29

Page 30: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Example: Find the (1) variance and (2) standard

deviation for the data {5, 7, 1, 3, 8}:

2 . 8 ) 8 . 32 ( 4

1 2 s 1) 86 . 2 2 . 8 s 2)

x x x ( ) x x 2

0.2

2.2

-3.8

-1.8

3.2

0.04

4.84

14.44

3.24

10.24

5

7

1

3

8

24 0 32.08 Sum:

Solutions:

x 1 5

5 7 1 3 8 4 8 ( ) . First:

Standard Deviation

30

Page 31: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

The number of traffic citations issued during the last

five months in Beaufort County, South Carolina, is 38,

26, 13, 41, and 22. What is the population variance?

EXAMPLE – Variance and

Standard Deviation

31

Page 32: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

EXAMPLE – Sample Variance

The hourly

wages for a

sample of part-

time employees

at Home Depot

are: $12, $20,

$16, $18, and

$19. What is

the sample

variance?

32

Page 33: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

If the data is given in the form of a

frequency distribution, we need to make

a few changes to the formulas for the

mean, variance, and standard deviation

Complete the extension table in order to

find these summary statistics

Calculation of Mean, Variance and

Standard Deviation

33

Page 34: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Population variance, 2 =

Sample variance, s2 =

Population standard deviation, =

Sample standard deviation, s =

Ν

Ν

ΧΧ

22 )(

1

)( 22

n

n

ΧΧ

2

2s

Variance and Standard Deviation

for Ungrouped Data

34

Page 35: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

X

X2

50

58

63

63

65

65

65

67

67

74

74

74

74

74

78

78

80

80

86

89

2 500

3 364

3 969

3 969

4 225

4 225

4 225

4 489

4 489

5 476

5 476

5 476

5 476

5 476

6 084

6 084

6 400

6 400

7 396

7 921

X = 1 424

X2 = 103 120

X2 = 103120; (X)2 = 2 027 776

2 =

= 86.56

= = 9.30

2 027 776103120

20

20

56.86

For Ungrouped Data

35

Page 36: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Example: A survey of students in the first grade at a local school asked for the number of brothers and/or sisters for each child. The results are summarized in the table below. Find (1) the mean, (2) the variance, and (3) the standard deviation:

s2

2239 93

6262 1

163

( )

.2) s 163 128. .3) x 93 62 15/ .1)

Solutions:

First

: 17

23

5

2

17

46

20

10

1

2

4

5

62 93 Sum:

15 0 0

x f xf x f2

17

92

80

50

239

0

For a Frequency Distribution

36

Page 37: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Standard Deviation of Grouped

Data

37

Page 38: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Standard Deviation of Grouped

Data - Example Refer to the frequency distribution for the Whitner

Autoplex data used earlier. Compute the standard deviation of the vehicle selling prices

38

Page 39: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

The standard deviation can be used to

compare the variability of several distributions

make a statement about the general shape of a distribution.

The empirical rule: If a sample of observations has a mound-shaped distribution, the interval

tsmeasuremen the of 68%ely approximat contains )sx,sx(

tsmeasuremen the of 95%ely approximat contains )s2x,s2x(

tsmeasuremen the of 99.7%ely approximat contains )s3x,s3x(

Interpreting Standard Deviation

39

Page 40: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Standard deviation is a measure of

variability, or spread

Two rules for describing data rely on the

standard deviation:

Empirical rule: applies to a variable that is

normally distributed

Chebyshev’s theorem: applies to any

distribution

Interpreting Standard Deviation

40

Page 41: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

The Empirical Rule

41

Page 42: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Notes:

The empirical rule is more informative than Chebyshev’s theorem since we know more about the distribution (normally distributed)

Also applies to populations

Can be used to determine if a distribution is normally distributed

1. Approximately 68% of the observations lie within 1

standard deviation of the mean

2. Approximately 95% of the observations lie within 2

standard deviations of the mean

3. Approximately 99.7% of the observations lie within 3

standard deviations of the mean

Empirical Rule: If a variable is normally distributed, then:

The Empirical Rule

42

Page 43: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

xx s x sx s2 x s2x s3 x s3

68%

95%

99.7%

The Empirical Rule

43

Page 44: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

1) What percentage of weights fall between 5.7 and 7.3?

2) What percentage of weights fall above 7.7?

Example: A random sample of plum tomatoes was selected from a local grocery store and their weights recorded. The mean weight was 6.5 ounces with a standard deviation of 0.4 ounces. If the weights are normally distributed:

Solutions:

( , ) ( . (0. ), . (0. )) ( . , . ) x s x s 2 2 6 5 2 4 6 5 2 4 5 7 7 3 Approximat ely 95% of the weigh ts fall be tween 5.7 and 7.3

1)

( , ) ( . (0. ), . (0. )) ( . , . ) x s x s 3 3 6 5 3 4 6 5 3 4 5 3 7 7 Approximat ely 99.7% of the wei ghts fall between 5. 3 and 7.7 Approximat ely 0.3% of the weigh ts fall out side (5.3, 7.7) Approximat ely (0.3 / 2) =0.15% of t he weights fall abov e 7.7

2)

Empirical Rule – Example

44

Page 45: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

1. Find the mean and standard deviation for the data

2. Compute the actual proportion of data within 1, 2, and 3 standard deviations from the mean

3. Compare these actual proportions with those given by the empirical rule

4. If the proportions found are reasonably close to those of the empirical rule, then the data is approximately normally distributed

Note: The empirical rule may be used to determine whether or not a set of data is approximately normally distributed

Empirical Rule

45

Page 46: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Example: A statistics practitioner wants

to describe the way returns on

investment are distributed.

The mean return = 10%

The standard deviation of the return = 8%

The histogram is bell shaped.

More Examples

46

Page 47: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Example: – solution

The empirical rule can be applied (bell shaped histogram)

Describing the return distribution Approximately 68% of the returns lie between 2% and

18% [10 – 1(8), 10 + 1(8)]

Approximately 95% of the returns lie between -6% and 26% [10 – 2(8), 10 + 2(8)]

Approximately 99.7% of the returns lie between -14% and 34% [10 – 3(8), 10 + 3(8)]

More Examples

47

Page 48: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Chebyshev’s Theorem

The arithmetic mean biweekly amount contributed by the Dupree Paint employees to the company’s profit-sharing plan is $51.54, and the standard deviation is $7.51. At least what percent of the contributions lie within plus 3.5 standard deviations and minus 3.5 standard deviations of the mean?

48

Page 49: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Chebyshev’s Theorem: The proportion of any distribution that lies within k standard deviations of the mean is at least 1 (1/k2), where k is any positive number larger than 1. This theorem applies to all distributions of data.

at least

1 12

k

xx ks x ks

Illustration:

Chebyshev’s Theorem

49

Page 50: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

The proportion of observations in any sample that lie within

k standard deviations of the mean is at least

1-1/k2 for k > 1.

This theorem is valid for any set of measurements (sample,

population) of any shape!!

K Interval Chebyshev Empirical Rule

1 at least 0% approximately 68%

2 at least 75% approximately 95%

3 at least 89% approximately 99.7%

s2x,s2x

sx,sx

s3x,s3x

(1-1/12)

(1-1/22)

(1-1/32)

Chebyshev’s Theorem

50

Page 51: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Chebyshev’s theorem is very conservative and

holds for any distribution of data

Chebyshev’s theorem also applies to any

population

The two most common values used to describe a

distribution of data are k = 2, 3

1.7 2 2.5 3

0.65 0.75 0.84 0.89

k

1 1 2( / )k

The table below lists some values for k and 1 - (1/k2):

Chebyshev’s Theorem

51

Page 52: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Example: At the close of trading, a random sample of 35 technology stocks was selected. The mean selling price was 67.75 and the standard deviation was 12.3. Use Chebyshev’s theorem (with k = 2, 3) to describe the distribution.

)104.65 ,85.30()3.12(375.67 ),3.12(375.67()3 ,3( sxsx

Using k=3: At least 89% of the observations lie within 3

standard deviations of the mean:

Solutions:

)35.92 ,15.43()3.12(275.67 ),3.12(275.67()2 ,2( sxsx

Using k=2: At least 75% of the observations lie within 2 standard

deviations of the mean:

Chebyshev’s Theorem – Example

52

Page 53: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Example: The annual salaries of the employees of a chain of computer

stores produced a positively skewed histogram. The mean and

standard deviation are $28,000 and $3,000, respectively. What

can you say about the salaries at this chain?

Solution:

At least 75% of the salaries lie between $22,000 and $34,000

28000 – 2(3000) 28000 + 2(3000)

At least 88.9% of the salaries lie between $19,000 and $37,000

28000 – 3(3000) 28000 + 3(3000)

Chebyshev’s Theorem – Example

53

Page 54: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Normal Distribution

mean

median

mode

Frequency

X

Normal Distribution

54

Page 55: MEASURES OF VARIABILITY · Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency

Normal Distribution

Frequency

X

Normal Distribution

55