Upload
dinhdung
View
228
Download
6
Embed Size (px)
Citation preview
Chapter 3: Statistics for describing, exploring, and comparing data
Chapter Problem: A common belief is that women talk more than men.
Is that belief founded in fact, or is it a myth?
Data set 8 in Appendix B includes different sample groups from the results
provided by researchers show that the sample mean for males is 15,668.5 for a
sample size 186, and for female is 16,215.0 for a sample size 210 per day. 1
Chapter 3-1 Overview and 3-2.1 Measures of Center
3-1 Overview:
Discuss the characteristics of a data set: CVDOT
Statistics:
Descriptive statistics: summarize or describe the characteristics of a data set; Chapters 2 and 3 discuss the fundamental principles of descriptive statistics
inferential statistics: use sample data to make inferences (or generalizations) about a population; focus of later chapters
3-2 Measures of Center:
Objective: discuss the characteristics “center”, mean and median of a data set, effect of outliers on the mean and median
Definition: a measure of center is a value at the center of middle of a data set
Definition: The arithmetic mean of a set of values is the sum of the data
values divided by total number n of value. We call it “mean” (means arithmetic mean) and mean is denoted by x-bar
2
n
xx
xSum of all sample values
number of sample values
3-2.2 Mean, Median & Notations Mean is relative reliable, since means of samples drawn from the same
population don‟t vary as much as other measure of center since it takes every data value into account. But mean is sensitive to every value, especially when there is outliers (is a disadvantage).
Greek letter sigma denote the sum of a set of values
x is the variable for individual data value
n is the number of values in a sample
N represents the number of values in a population
The median (x-tilde) of a data set is the measure of center that is the middle value when the original data values are in ascending order
If the number of values is odd, the median x is exactly the middle of the list
If the number of values is even, the median is mean (average) of the two middle numbers; Note: Median is not affected by outliers
3
is the mean of a set of sample values
is the mean of all values in a population
n
xx
N
x
x~
3-2.3 Example of mean and median
Monitoring Lead in Air
Data taken are 5.4, 1.10, 0.42, 0.73, 0.48, 1.10
Mean is 1.538 g/m3
To find median, first you need to arrange the data in ascending order
0.42, 0.48, 0.73, 1.10, 1.10, 5.4 (the number is 6); The median is 0.915
g/m3 = (0.73+1.10)/2
0.42, 0.48, 0.66, 0.73, 1.10, 1.10, 5.4 (the number is 7); The median is
0.73 g/m3
These examples show that median is not sensitive to extreme values
and median is often used for data sets with a few extremes.
Example 2
Find the mean and median for the word counts from 5 men: 27,531; 15,684;
5,638; 27,997; and 25,433. (20,456.6; 25,433)
Find the mean and median if including the additional 8,077 words. (18,393.3,
20,558.5)
4
3-2.3 Mode and Midrange
The mode of a data set is the value that occurs most frequently
Data set is called bimodal when there are two values occur with the same
greatest frequency, each one is a mode
Data set is called multimodal when there are more than two values occur
with the same greatest frequency, each one is a mode
Data set is called no mode when no value is repeated
Midrange is the measure of center, is calculated by average of the minimum and maximum value of the data set: Midrange = (max + min)/2
Example from lead in the air: Midrange = (5.4+0.42)/2 = 2.910 g/m3
Example from word counts: 27,531; 15,684; 5,638; 27,997; 25,433; Midrange = ?
Midrange is rarely used, since it only uses max and min and too sensitive to those extremes. It is easy to compute and is one of the values to define the “center” of the data set. “Midrange” is different from the “mean”
Round-off rule – Carry one more decimal place than is present in the original set of values
5
3-2.4 Examples
Example 1: Comparison of ages of best actress and best actors
What does this data tell? Measures of center suggests that best actresses are
younger than best actors.
In Ch 9, we will discuss the methods for determining whether such
differences are satisfactory significant.
Example 2: Find the mean, median, mode, and midrange of the randomly
selected cans of Coke: 12.3, 12.1, 12.2, 12.3, 12.2 (12.22, 12.20, 12.2, 12.3,
12.2)
6
Comparison of Ages of Best Actresses and Best Actors
Best Actresses Best Actors
Mean 35.7 43.9
Median 33.5 42
Mode 35 41 and 42
Midrange 50.5 52.5
3-2.5 More Examples
The following examples identify a major reason why the mean and median
are not meaning statistics that accurately and effectively serve as measures
of center.
Find the mean and median of the following
Zip codes: 12601, 90210, 02116, 76177, 19102
Ranks of stress level from different jobs: 2 3 1 7 9
Surveyed respondents are coded as 1 (for democrat), 2 (for republican), 3 (for
liberal), 4 (conservative), or 5 (for others)
Mean salary of secondary school teachers: from 50 states, $37,200, $
49,400, $40,000, ….$37,800. The mean is $42,210. but is this mean
salary of all secondary school teachers in U.S.? Why or why not?
The above example did not take into the considerations of the number of
secondary school teachers in each state. The mean for all secondary school
teachers in the U.S. is $45,200, not $42,210.
7
3-2.6 Mean from a Frequency Distribution
Mean from a Frequency Distribution is defined as
formula 3-2
Example:
You can use TI, with midpoints in L1, f in L2, then calculate
Age of actress Frequency f Class Midpoint x f x
21 - 30 28 25.5 714
31 - 40 30 35.5 1065
41 - 50 12 45.5 546
51 - 60 2 55.5 111
61 - 70 2 65.5 131
71 - 80 2 75.5 151
Totals 76 2718
8
Where x is the class midpoint,
f is the frequency for that class
f
xfx
8.3576
2718
f
xfx
3-2.7 Weighted Mean
Weighted mean – used when the values with
different degrees of importance, is defined as
Formula 3-3:
Example: mean of 3 test scores (85, 90, 75)
Test 1: 20%
Test 2: 30%
Test 3: 50%
9
w
xwx
5.81100
8150
503020
)7550()9030()8520(
)(
w
xwx
3-2.8 Best measure of Center?
10
Measures
of center
mean Median ModeMidrange
Find the sum
Of all values,
then divide by
the number of
values
Sort the data
Median is the
value in the
exact middle
Add the 2 middle
numbers,
then divide by 2
Odd
number
of value
Even
number
of value
Value that
Occurs
Most
frequently
(max + min)/2
(1) Sensitive to
Extreme value
(2) Sample means
to vary less than
other measures
of center The median is a good choice
If there are some extreme values
The mode is good
for data at the
nominal level of
measurement
The midrange
Is rarely used
Use mean most, then median
3-2.9 Skewness of data
Definition: a distribution of data is skewed if it is not symmetric and extends more to one side than the other
Skewed to the left (negatively skewed; has a longer left tail) if mean and the median to the left of mode
Skewed to the right (positively skewed; has a longer right tail) if mean and the median to the right of the mode
A distribution is symmetric (zero skewness) if the left half of its histogram is roughly a mirror image of its right half, mean=median=mode
11
3-2.10 Summary and Homework #8 Section 3-2
We have learned types of measurements of center of a data set;
mean from a frequency distribution, weighted mean, best
measure of center, and skewness.
The mean and median cannot always be used to identify the
shape of the distribution.
Question: What is the highest point of the graph whether it is
symmetric or skewness?
HW #8, Pages 94-96, #5-17 odd, 33-34 (answer for 34: mean
= 84.8, grade = B)
12
3-3.1 Measures of Variation
Objective 3-3:
Learn the characteristic of variation; such as standard deviation and variance.
Learn how to use a data set for finding the value of the range and standard deviation;
Interpreting values of standard deviations and reasons of standard deviation
Definition: The range of a set of data is the difference between the
maximum value and the minimum value
Range = (maximum value) – (minimum value)
Not useful, since it depends on max and min (i.e. extreme sensitive to the extreme
values)
Definition: The standard deviation is a set of sample values is the measure
of variation of values about the mean.
13
Formula 3-4 simple
standard deviation
Formula 3-5 shortcut
formula for sample
standard deviation (formula used by calculators
and computer programs)
1
)( 2
n
xxs
)1(
)()( 22
nn
xxns
3-3.2 Properties of Standard Deviation (S.D.)
S.D. is a measure of variation of all values from the mean
S.D. is always 0. It is zero only when all of the data values are the same; large S.D. values indicate greater amount of variation
S.D. can increase dramatically with the inclusion of one or more outliers
The units of the S.D. s are the same as the units of original values, e.g. minutes, feet, pounds, etc..
1. Compute the mean
2. Subtract the mean from each value
3. Square the difference
4. Add the all the squares
5. Divide the total by n-1 (i.e. one less than the number)
6. Find the square root of the result of step 5
Find the standard deviation of the waiting times from the multiple times. Those times (in minutes) are 1, 3, 14.
14
xxx
2)( xx
1
)( 2
n
xx 2)( xx
1
)( 2
n
xxs
3-3.3 Standard Deviation of a Population
Standard deviation of a population is the formula of
sample deviation, except divided by N (N is the
population size)
The population standard deviation is defined as
Since we generally deal with sample data, thus we
usually use the formula 3-4.
15
N
x
2)(
1
)( 2
n
xxs
3-3.4 Variance of a Sample and Population
Definition – The variance of a set of values is a measure of variation equal to the square of the standard deviation Sample variance: s2 square of the standard deviation Population variance: 2 square of the population standard deviation
s2 is called unbiased estimator of the population variation 2
Example: Use the waiting times of 1 min, 3 min, and 14 min to find the variance of waiting time
Q: Is smaller variance better? Note: The units of variance are different from the units of original data set;
the standard deviation has the same unit as the data set Notations:
s = sample standard deviation s2 = sample variance = population standard deviation 2 = population variance SD – standard deviation VAR – variance
Round-Off Rule – carry one more decimal place than the original set of data for the final answer (don‟t round-off in the middle of a calculation)
16
3-3.5 Why learn Standard Deviation and interpretation
Standard deviation measures the variation among values Small standard deviation means values are close together, while large standard
deviation means values are spread farther apart
Range Rule of Thumb to estimate standard deviation is used to roughly estimate standard deviation which is based on the principle that for
most data sets the vast majority (such as 95%) of sample values lie within 2 standard
deviation s; where s Range/4 (range = max – min)
If the standard deviation s is known, we can use it to estimate min and
max of sample values Minimum “usual” value = (mean) – 2 * (standard deviation)
Maximum “usual” value = (mean) + 2 * (standard deviation)
Example 1: IQ test, mean is 100, S.D. is 15; Min is 70, max is 130
Interpretation: Based on these results, we expect that typical IQ scores fall between 70
and 130. How do you interpret IQ 65 or IQ 135?
Example 2: Pulse rate of women: mean is 76, S. D. is 12.5; min is 51
beats/min, and max is 101 beats/minute
Interpretation Typical women pulses are from 51 to 101 beats/min
If someone has pulse rate 110 would be unusual, since 110 is outside the limits
17
3-3.6 Empirical (or 68-95-99.7) Rule for data with
normal distribution
Empirical rule – for data set having a distribution that is approximately bell-shaped
has the following properties:
About 68% of all values fall within 1 standard deviation of the mean, i.e. between (mean -
s) and (mean + s)
About 95% of all values fall within 2 standard deviation of the mean, i.e. between (mean-
2s) and (mean+2s)
About 99.7% of all values fall within 3 standard deviation of the mean, i.e. between (mean-
3s) and (mean+3s)
Example of IQ scores, mean is 100, standard deviation is 15. What percentage of IQ scores are between 70 and 130?
18
3-3.7 Chebyshev‟s Theorem
The proportion (or fraction) of any data set lying with K standard deviations of the mean is always at least 1-1/K2
where K >1
When K=2, we can interpret that at least ¾ (75%) of all values lie within 2 standard deviation of the mean
When K=3, we can interpret that at least 8/9 (or 89%) of all values lie within 3 standard deviation of the mean
Example – IQ score:
At least 75% of people have IQ between 70 and 130 (2 SD from mean)
At least 89% of people have IQ between 55 and 145 (3 SD from mean)
Comparison: Example – IQ score using empirical rule
About 68% of people have IQ between 85 and 115 (1 SD from mean)
About 95% of people have IQ between 70 and 130 (2 SD from mean)
About 99.7% of people have IQ between 55 and 145 (3 SD from mean)
19
3-3.8 Coefficient of Variation in Different Populations
Coefficient of variation (CV) for a set of nonnegative sample
population data (expressed as %) is used to describe the standard
deviation relative to the mean with the following:
Example: Heights and Weights of Men(data set 1 in Appendix B)
For heights: mean , s.d. = 3.02in
For weights: mean , s.d. = 26.33lb
We want to compare variation among heights to variation among weights.
Heights: CV = 4.42%; weights: CV = 15.26%
Interpretation: ? The heights has considerably less variation than weights,
does it make sense?20
Sample Population
%100x
scv
%100
cv
inx 34.68
lbx 55.172
3-3.9 Summary and HW #9 (3-3)
Range rule of thumb s range/4
Min (usual) = mean – 2*s
Max (usual) = mean + 2*s
Empirical rule (only applicable to normal (bell-shaped)
distribution) 68% within 1 S.D. means data values are within (mean- s) and (mean +
s)
95% within 2 S.D. means data values are within (mean-2s) and (mean
+2s)
99.7%within 3 S.D. means data values are within (mean-3s) and (mean
+3s)
Chebyshev‟s Theorem helps to approximate the values of data
set (applicable to any data set, but has limited usefulness)
HW #9 (3-3) Pp. 110 -113, # 5-11odd, 17, 31- 35odd
21
3-4.1 Measure of Relative Standing and Boxplots
Objective: To learn the “measure” that can be used to compare values from
the same or different data set, z score, and able to convert data values to z-
scores, quartiles, percentiles, and boxplots
Definition: a z score (or standardized value) is the number of standard
deviations that a given value x is above or below the mean
For sample data
For population data
Round z to two decimal places
A man is 76.in tall with 237.1 lb weight. Find the Z-score for the height
and weight. (mean height = 68.34in, s.d.= 3.02in, mean weight = 172.55 lb
and s.d. = 26.33lb); z-score for this man is 2.60 in height, 2.45 in weight.
Interpretation: The man is 2.6 above the mean height, 2.45 above the mean
weight. The height is more extreme than the weight.
Example: Lyndon Johnson 75” (mean 71.5”, S.D. 2.1”), Shaquille O‟Neal
85”(mean 80”, S.D. 3.3”)
Interpretation: ?
22
s
xxz
xz
3-4.2 z-score and unusual values Use range rule of thumb, a value is “unusual” if it is more than 2 S.D. from the
mean:
min (usual) = mean – 2*s, and
max (usual)= mean + 2*s
Use z-score, a value is „unusual” if it is less than -2, or greater than +2
Ordinary values: -2 z score +2
Unusual values: z score < -2 or z score >2
z scores: measures of position relative to the mean, a z- score of +2 means 2
standard deviations above the mean, z score of -3 means 3 standard deviation
below the mean
Example: Over the past 30 years, heights of basketball players at Newport University have a mean of 74.5in, and a s.d. of 2.5in. The latest recruit has a height of 79.0in Find z score
Is the height of 79.0in unusual among the heights of players over the past 30 years? Why or why not? 23
3-4.3 Percentiles
Definition:
Is one type of quantiles (fractiles) which partition data into groups with roughly the same
numbers of values in each group
Percentiles are measures of location. There are 99 percentiles and are denoted by P1, P2,
P3, …P99, which is divide a set of data into 100 groups about 1% of the values in each
group
Example: 50th percentile, denoted by P50, has about 50% of the data values below it,
and about 50% of the data values above it; 50th percentile is the same as the median.
Formula is (round the result to the nearest whole number)
Another way is
Find the percentile for the value of $29 millions. Table 3-4 in the Text (click here)
24
)100(or 100
*
n
L k
valuesofnumbertotal
xlessvaluesofnumberxvalueofPercentile
n = total number of values in the data set
k = percentile being used
L = location that gives the position of a value
nk
L 100
4.5 5 6.5 7 20 20 29 30 35 40
40 41 50 52 60 65 68 68 70 70
70 72 74 75 80 100 113 116 120 125
132 150 160 200 225
3-4.4 Converting from kth percentile to the corresponding data value
25
Yes
StartSort the data
(arrange the data
from low to high)
Is L a whole
number?
NoChange L by
rounding it up
to the next
whole number
The value of Pk
is the Lth value
counting
from the lowest
The value of Pk is mean of the
values Lth location and (L+1)th
location
Compute
L = (k/100)n
n = # of values
k = percentile
Example: Find the 17th percentile of the previous
Data set.
3-4.5 Example: Setting Speed Limits
The table is the recorded speeds miles/hour randomly selected
on 405 highway
Find the 85th percentile of the listed speeds
Given that speed limits are usually rounded to a multiple of 5,
what speed limit is suggested by these data? Explain your
choice
Does the existing speed limit on Highway 405 conform to the
85th percentile rule (i.e. the speed limit is set so that 85% of
drivers are at or below the speed limit)26
68 68 72 73 65 74 73 72 68 65
65 73 66 71 68 74 66 71 65 73
59 75 70 56 66 75 68 75 62 72
60 73 61 75 58 74 60 73 58 75
3-4.6 Quartiles
Definition
Quartiles are measures of location, denoted by Q1, Q2, and Q3 , which divide a
data set into four groups with about 25% of the values in each group (percentile
divide the data into 100 groups.)
Three quartiles Q1, Q2, Q3 Divide the sorted data value into 4 equal parts
Q1 (first quartile) separate the bottom 25% from the rest
Q2 (second quartile) separate the bottom 50% from the rest (Q2 is also the
median) (also 50 percentile)
Q3 (third quartile) separate the bottom 75% from the rest
Interquartile range (IQR) = Q3 - Q1
Semi-interquartile = (Q3 - Q1)/2
Mid-quartile = (Q3 + Q1)/2
10-90 percentile range = P90 - P10
Example: find the values of Min, Q1, Q2, Q3, Max, IQR of movie budget;
(click here for the table)
27
3-4.8 5-Number Summary and Boxplot
Definition – for a set of data, the 5-number summary consists of :
(1) Minimum, (2) Q1, (3) Q2 (the median), (4) Q3 , (5) Maximum
A boxplot (or box-and-whisker diagram) is a graph of data set that consists of a
line extending from the min to max and a box with lines drawn at the Q1, the
median, and the Q3 (summary and example next page)
A graph which is useful for revealing
The center of the data
The spread of distribution of the data
The presence of outliers
Outlier is a value that is located away from almost all of the other values, an extreme value falls outside the general pattern;
A data x value is an outlier if x –Q3 > 1.5 IQR or Q1-x > 1.5 IQR
An outlier can have a dramatic effect on the Mean
Standard deviation
The scale of histogram, so the true nature of the distribution is totally observed
28
3-4.9 Procedures for Construct a Boxplot, HW #10
Find the 5-number summary, min, Q1, median, Q3, and the max
Construct a scale with values that include the min and max data value
Construct a box (rectangle) extending from Q1 to Q3, and draw a line in the box at the median value
Draw lines extending outward from the box to min and max data value
Example on the board, use the movie budget (click here) 5-number summary
Boxplots don‟t show detail info as histograms or stem-and leaf plots – not the best choice when dealing with a single data set; but it‟s great for comparing different data sets (use the same scale)
Do women really talk more than men? Use the 5-number summary
Read Table 3-3 Comparison of word counts of men and women for “mean”, “median”, “midrange”, “range”, “S.D”
Example: Here are measured reaction times (in seconds) in a test of driving skills;
2.4, 2.5, 2.8, 2.0, 2.4, 2.9, 3.2, 3.5, 2.7, 2.7, 2.8, 2.6; find the five-number-summary.
HW #10: P. 127-128, #1, 5-7, 9, 13, 15, 19, 23, 27
29
Min Q1 Q2 Q3 Max
Men 695 10009 14290 20565 47016
Women 1674 11010 15917 20571 40055
Review (1)
Ch 1 – You should be able to do the following learned distinguish between a population and a sample; and parameter and statistic
Understand the importance of good experimental design, including the control of variable effects, replication, and randomization
Recognize the importance of good sampling methods in general, a simple random sample in particular
Understand if sample data are not collected in an appropriate way, the data may be completely useless
Ch 2: You should be able to do: Summarize data by constructing a frequency distribution or relative frequency
distribution
Visually display the nature of the distribution by constructing a histogram or relative frequency histogram
Investigate important characteristics of a data set by creating visual display, such as a frequency polygon, dotplot, stemplot, pareto chart, pie chart, scatterplot or time-series graph
Understand and interpret those result
30
Review (3) - Continued
You should be able to Calculate measures of center by finding the mean and
median Calculate measures of variation by finding the standard
deviation, variance, and range Understand and interpret the standard deviation by using
the tools such as range rule of thumb Compare individual values by using z score, quartiles, or
percentiles, identify outliers Investigate and explore the spread of data, the center of the
data, and the range of values by constructing a boxplot Understand and interpret those result such as standard
deviation us a measure of how much data vary, and use standard deviation to distinguish between values that are usual and unusual
31
Examples
32
Always consider certain key factors:• Context of the data
• Source of the data
• Sampling method
• Measures of center
• Measures of variation
• Distribution
• Outliers
• Changing patterns over time
• Conclusion
• Practical implications