Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
MEASURES OF
VARIABILITY
Variance and Standard Deviation
Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency.
Measures of variability or dispersion are used to describe the spread, or variability, of a distribution
Common measures of dispersion: range, variance, and standard deviation
Measures of Variability or
Dispersion
2
Why Study Dispersion?
For example, if your nature guide told you that
the river ahead averaged 3 feet in depth,
would you want to wade across on foot without
additional information? Probably not. You
would want to know something about the
variation in the depth.
A second reason for studying the dispersion
in a set of data is to compare the spread in two
or more distributions.
3
Observe two hypothetical
data sets:
The average value provides
a good representation of the
observations in the data set.
Small variability
This data set is now
changing to...
Why Study Dispersion?
4
Observe two hypothetical
data sets:
The average value provides
a good representation of the
observations in the data set.
Small variability
Larger variability
The same average value does not
provide as good representation of the
observations in the data set as before.
Why Study Dispersion?
5
Samples of Dispersions
6
Measures of Dispersion
Range
Mean Deviation
Variance and
Standard Deviation
7
EXAMPLE – Range
The number of cappuccinos sold at the Starbucks location in the Orange Country Airport between 4 and 7 p.m. for a sample of 5 days last year were 20, 40, 50, 60, and 80. Determine the mean deviation for the number of cappuccinos sold.
Range = Largest – Smallest value
= 80 – 20 = 60
8
range H L
Range: The difference in value between the highest-
valued (H) and the lowest-valued (L) pieces of data:
Example: Consider the sample {12, 23, 17, 15, 18}.
Find the range
Solutions:
range H L 23 12 11
Range
9
Skor (X)
Kekerapan (f )
fX
50
1
50
58
1
58
63
2
126
65
3
195
67
2
134
74
5
370
78
2
156
80
2
160
86
1
86
89
1
89
N = f = 20
fX= 1424
Ungrouped Data
10
Range for Ungrouped Data
R = H – L H = highest value
L = lowest value
R = 89 – 50 = 39
Range for Ungrouped Data
11
For Grouped Data Class
Interval
Class Limit Class Mid-
point (m)
Frequency (less
than UCL) (f)
mf
50 – 54
49.5 – 54.5
52
1
52
55 – 59
54.5 – 59.5
57
1
57
60 – 64
59.5 – 64.5
62
2
124
65 – 69
64.5 – 69.5
67
5
335
70 – 74
69.5 – 74.5
72
5
360
75 – 79
74.5 – 79.5
77
2
154
80 – 85
79.5 – 84.5
82
2
164
85 – 89
84.5 – 89.5
87
2
174
N = f = 20
mf =
1420
12
Range for Grouped Data
For Grouped Data in a Frequency Distribution
Range = UCL of Highest Class – LCL of Lowest Class
Range = 89.5 – 49.5 = 40
Range for Grouped Data
13
The range of a set of observations is the
difference between the largest and
smallest observations.
Its major advantage is the ease with which it can be computed.
Its major shortcoming is its failure to
provide information on the dispersion of the
observations between the two end points.
? ? ?
But, how do all the observations spread out?
Smallest
observation
Largest
observation
The range cannot assist in answering this question Range
Range
14
Semi Inter-Quartile Range
There are three quartiles:
First Quartile, Q1, at the 25% frequency
Second Quartile, Q2, at the 50% frequency, equals the median
Third Quartile, Q3, at the 75% frequency
Semi Inter-Quartile Range, Q = ( Q3 - Q1 )/2
Semi Inter-Quartile Range
15
Semi Inter-Quartile Range
Semi Inter-Quartile Range, Q = ( Q3 - Q1 )/2
Q = (77.0 – 65.5) = 5.75
2
Semi Inter-Quartile Range
16
• Other measures of dispersion are based on the following
quantity
Deviation from the Mean: A deviation from the
mean, ,is the difference between the value of x and
the mean .
x xx
Deviation from the Mean
17
Example: Consider the sample {12, 23, 17, 15, 18}.
Find each deviation from the mean.
Solutions:
-5
6
0
-2
1
Data Deviation from Mean
_________________________
12
23
17
15
18
x x x
x 1
512 23 17 15 18 17( )
Deviation from the Mean
18
Consider two small populations:
10 9 8
7 4 10
11 12
13 16
8-10= -2
9-10= -1
11-10= +1
12-10= +2
4-10 = - 6
7-10 = -3
13-10 = +3
16-10 = +6
Sum = 0
Sum = 0
The mean of both
populations is 10...
…but measurements in B
are more dispersed
then those in A.
A measure of dispersion
Should agrees with this
observation.
Can the sum of deviations
Be a good measure of dispersion?
A
B
The sum of deviations is
zero for both populations,
therefore, is not a good
measure of dispersion.
Why not use the sum of deviations
from the mean?
19
Mean Deviation
The number of cappuccinos sold at the Starbucks location in the Orange Country Airport between 4 and 7 p.m. for a sample of 5 days last year were 20, 40, 50, 60, and 80. Determine the mean deviation for the number of cappuccinos sold.
20
This measure reflects the dispersion of all the
observations
The variance of a population of size N x1, x2,…,xN
whose mean is m is defined as
The variance of a sample of n observations
x1, x2, …,xn whose mean is is defined as x
N
)x( 2i
N1i2 m
1n
)xx(s
2i
n1i2
The Variance
21
Population Variance, 2 = (x - m)2 ,
N
m = population mean
N = population size
Population Standard Deviation, = 2
Population Variance
22
Sample Variance: The sample variance, s2, is the mean
of the squared deviations, calculated using n 1 as the
divisor:
s n
x x 2 2 1
1
( ) where n is the sample size
s xn
2
1
SS( ) ( ) SS ( ) ( ) x x x x
n x 2 2 2 1
Note: The numerator for the sample variance is called the
sum of squares for x, denoted SS(x):
where
Sample Variance
23
Let us calculate the variance of the two populations
185
)1016()1013()1010()107()104( 222222B
25
)1012()1011()1010()109()108( 222222A
Why is the variance defined as
the average squared deviation?
Why not use the sum of squared
deviations as a measure of
variation instead?
After all, the sum of squared
deviations increases in
magnitude when the variation
of a data set increases!!
Population Variance – Example
24
Example:
The following sample consists of the number of jobs six students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and variance
Solution:
2
2222
in
1i2
jobs2.33
)1413...()1415()1417(16
1
1n
)xx(s
jobs146
84
6
1397231517
6
xx
i6
1i
Sample Variance – Example
25
( )
1
2
2
2
n
n
xx
s
The shortcut formula for the sample variance:
Sample Variance – Shortcut
Formula
26
( ) ( )
2
2222
2i
n1i2
i
n
1i
2
jobs2.33
6
13...151713...1517
16
1
n
)x(x
1n
1s
Sample Variance – Shortcut
Formula
27
Sample Variance – Example 2
The hourly wages for a sample of part-time employees at Home Depot are: $12, $20, $16, $18, and $19. What is the sample variance?
28
The standard deviation of a set of
observations is the positive square root of
the variance .
2
2
:deviationandardstPopulation
ss:deviationstandardSample
The unit of measure for the standard deviation
is the same as the unit of measure for the data
Standard Deviation
29
Example: Find the (1) variance and (2) standard
deviation for the data {5, 7, 1, 3, 8}:
2 . 8 ) 8 . 32 ( 4
1 2 s 1) 86 . 2 2 . 8 s 2)
x x x ( ) x x 2
0.2
2.2
-3.8
-1.8
3.2
0.04
4.84
14.44
3.24
10.24
5
7
1
3
8
24 0 32.08 Sum:
Solutions:
x 1 5
5 7 1 3 8 4 8 ( ) . First:
Standard Deviation
30
The number of traffic citations issued during the last
five months in Beaufort County, South Carolina, is 38,
26, 13, 41, and 22. What is the population variance?
EXAMPLE – Variance and
Standard Deviation
31
EXAMPLE – Sample Variance
The hourly
wages for a
sample of part-
time employees
at Home Depot
are: $12, $20,
$16, $18, and
$19. What is
the sample
variance?
32
If the data is given in the form of a
frequency distribution, we need to make
a few changes to the formulas for the
mean, variance, and standard deviation
Complete the extension table in order to
find these summary statistics
Calculation of Mean, Variance and
Standard Deviation
33
Population variance, 2 =
Sample variance, s2 =
Population standard deviation, =
Sample standard deviation, s =
Ν
Ν
ΧΧ
22 )(
1
)( 22
n
n
ΧΧ
2
2s
Variance and Standard Deviation
for Ungrouped Data
34
X
X2
50
58
63
63
65
65
65
67
67
74
74
74
74
74
78
78
80
80
86
89
2 500
3 364
3 969
3 969
4 225
4 225
4 225
4 489
4 489
5 476
5 476
5 476
5 476
5 476
6 084
6 084
6 400
6 400
7 396
7 921
X = 1 424
X2 = 103 120
X2 = 103120; (X)2 = 2 027 776
2 =
= 86.56
= = 9.30
2 027 776103120
20
20
56.86
For Ungrouped Data
35
Example: A survey of students in the first grade at a local school asked for the number of brothers and/or sisters for each child. The results are summarized in the table below. Find (1) the mean, (2) the variance, and (3) the standard deviation:
s2
2239 93
6262 1
163
( )
.2) s 163 128. .3) x 93 62 15/ .1)
Solutions:
First
: 17
23
5
2
17
46
20
10
1
2
4
5
62 93 Sum:
15 0 0
x f xf x f2
17
92
80
50
239
0
For a Frequency Distribution
36
Standard Deviation of Grouped
Data
37
Standard Deviation of Grouped
Data - Example Refer to the frequency distribution for the Whitner
Autoplex data used earlier. Compute the standard deviation of the vehicle selling prices
38
The standard deviation can be used to
compare the variability of several distributions
make a statement about the general shape of a distribution.
The empirical rule: If a sample of observations has a mound-shaped distribution, the interval
tsmeasuremen the of 68%ely approximat contains )sx,sx(
tsmeasuremen the of 95%ely approximat contains )s2x,s2x(
tsmeasuremen the of 99.7%ely approximat contains )s3x,s3x(
Interpreting Standard Deviation
39
Standard deviation is a measure of
variability, or spread
Two rules for describing data rely on the
standard deviation:
Empirical rule: applies to a variable that is
normally distributed
Chebyshev’s theorem: applies to any
distribution
Interpreting Standard Deviation
40
The Empirical Rule
41
Notes:
The empirical rule is more informative than Chebyshev’s theorem since we know more about the distribution (normally distributed)
Also applies to populations
Can be used to determine if a distribution is normally distributed
1. Approximately 68% of the observations lie within 1
standard deviation of the mean
2. Approximately 95% of the observations lie within 2
standard deviations of the mean
3. Approximately 99.7% of the observations lie within 3
standard deviations of the mean
Empirical Rule: If a variable is normally distributed, then:
The Empirical Rule
42
xx s x sx s2 x s2x s3 x s3
68%
95%
99.7%
The Empirical Rule
43
1) What percentage of weights fall between 5.7 and 7.3?
2) What percentage of weights fall above 7.7?
Example: A random sample of plum tomatoes was selected from a local grocery store and their weights recorded. The mean weight was 6.5 ounces with a standard deviation of 0.4 ounces. If the weights are normally distributed:
Solutions:
( , ) ( . (0. ), . (0. )) ( . , . ) x s x s 2 2 6 5 2 4 6 5 2 4 5 7 7 3 Approximat ely 95% of the weigh ts fall be tween 5.7 and 7.3
1)
( , ) ( . (0. ), . (0. )) ( . , . ) x s x s 3 3 6 5 3 4 6 5 3 4 5 3 7 7 Approximat ely 99.7% of the wei ghts fall between 5. 3 and 7.7 Approximat ely 0.3% of the weigh ts fall out side (5.3, 7.7) Approximat ely (0.3 / 2) =0.15% of t he weights fall abov e 7.7
2)
Empirical Rule – Example
44
1. Find the mean and standard deviation for the data
2. Compute the actual proportion of data within 1, 2, and 3 standard deviations from the mean
3. Compare these actual proportions with those given by the empirical rule
4. If the proportions found are reasonably close to those of the empirical rule, then the data is approximately normally distributed
Note: The empirical rule may be used to determine whether or not a set of data is approximately normally distributed
Empirical Rule
45
Example: A statistics practitioner wants
to describe the way returns on
investment are distributed.
The mean return = 10%
The standard deviation of the return = 8%
The histogram is bell shaped.
More Examples
46
Example: – solution
The empirical rule can be applied (bell shaped histogram)
Describing the return distribution Approximately 68% of the returns lie between 2% and
18% [10 – 1(8), 10 + 1(8)]
Approximately 95% of the returns lie between -6% and 26% [10 – 2(8), 10 + 2(8)]
Approximately 99.7% of the returns lie between -14% and 34% [10 – 3(8), 10 + 3(8)]
More Examples
47
Chebyshev’s Theorem
The arithmetic mean biweekly amount contributed by the Dupree Paint employees to the company’s profit-sharing plan is $51.54, and the standard deviation is $7.51. At least what percent of the contributions lie within plus 3.5 standard deviations and minus 3.5 standard deviations of the mean?
48
Chebyshev’s Theorem: The proportion of any distribution that lies within k standard deviations of the mean is at least 1 (1/k2), where k is any positive number larger than 1. This theorem applies to all distributions of data.
at least
1 12
k
xx ks x ks
Illustration:
Chebyshev’s Theorem
49
The proportion of observations in any sample that lie within
k standard deviations of the mean is at least
1-1/k2 for k > 1.
This theorem is valid for any set of measurements (sample,
population) of any shape!!
K Interval Chebyshev Empirical Rule
1 at least 0% approximately 68%
2 at least 75% approximately 95%
3 at least 89% approximately 99.7%
s2x,s2x
sx,sx
s3x,s3x
(1-1/12)
(1-1/22)
(1-1/32)
Chebyshev’s Theorem
50
Chebyshev’s theorem is very conservative and
holds for any distribution of data
Chebyshev’s theorem also applies to any
population
The two most common values used to describe a
distribution of data are k = 2, 3
1.7 2 2.5 3
0.65 0.75 0.84 0.89
k
1 1 2( / )k
The table below lists some values for k and 1 - (1/k2):
Chebyshev’s Theorem
51
Example: At the close of trading, a random sample of 35 technology stocks was selected. The mean selling price was 67.75 and the standard deviation was 12.3. Use Chebyshev’s theorem (with k = 2, 3) to describe the distribution.
)104.65 ,85.30()3.12(375.67 ),3.12(375.67()3 ,3( sxsx
Using k=3: At least 89% of the observations lie within 3
standard deviations of the mean:
Solutions:
)35.92 ,15.43()3.12(275.67 ),3.12(275.67()2 ,2( sxsx
Using k=2: At least 75% of the observations lie within 2 standard
deviations of the mean:
Chebyshev’s Theorem – Example
52
Example: The annual salaries of the employees of a chain of computer
stores produced a positively skewed histogram. The mean and
standard deviation are $28,000 and $3,000, respectively. What
can you say about the salaries at this chain?
Solution:
At least 75% of the salaries lie between $22,000 and $34,000
28000 – 2(3000) 28000 + 2(3000)
At least 88.9% of the salaries lie between $19,000 and $37,000
28000 – 3(3000) 28000 + 3(3000)
Chebyshev’s Theorem – Example
53
Normal Distribution
mean
median
mode
Frequency
X
Normal Distribution
54
Normal Distribution
Frequency
X
Normal Distribution
55