Upload
phamtuyen
View
213
Download
1
Embed Size (px)
Citation preview
Chapter 2
Describing, Exploring, and Comparing Data
Important Characteristics of Data
Describes the overall pattern of a distribution:
Center
Divides the data in half
Spread
Differences between the data
Shape
Skewness of the data
Outlier
Data that falls outside of the pattern
Data Distributions
Graphs displays distribution
Numbers describe the distribution
Lesson 2-2
Frequency Distribution
Frequency Distribution
Grades Frequency
A (100 – 90) 5
B ( 89 – 80) 8
C ( 79 – 70) 4
D ( 69 – 60) 5
F (59 – 50) 3
A frequency distribution lists the number of occurrences
for each category of data.
Lower Class Limits
Upper Class Limits
Example – Page 44, #2
Systolic Blood
Pressure of Women Frequency
80 – 99 9
100 – 119 24
120 – 139 5
140 – 159 1
160 – 179 0
180 – 199 1
Identify the class width, class midpoints, and class
boundaries for the given frequency distribution.
Example – Page 44, #2
Blood Pressure Frequency
80 – 99 9
100 – 119 24
120 – 139 5
140 – 159 1
160 – 179 0
180 – 199 1
Find the class width.
100 80 20
Example – Page 44, #2
Blood Pressure Class Midpoints Class Boundaries
80 – 99
100 – 119
120 – 139
140 – 159
160 – 179
180 – 199
80 9989.5
2
109.5
129.5
149.5
169.5
189.5
100 990.50
2
79.5 99.5
99.5 119.5
119.5 139.5
139.5 159.5
159.5 179.5
179.5 199.5
Relative Frequency Distribution
The relative frequency is the proportion or percent of
observations within a category and is found using the formula
Reasons for Constructing Frequency Distributions
Large data sets can be summarized.
Can gain some insight into the nature of data.
Have a basis for constructing graphs.
Relative Frequency = Class Frequency
Sum of all Frequencies
Example – Page 44, #6
Blood Pressure Frequency Relative Frequency
80 – 99 9
100 – 119 24
120 – 139 5
140 – 159 1
160 – 179 0
180 – 199 1
Total 40
Construct the relative frequency distribution #2
90.225 22.5%
40
60.0%
12.5%
2.5%
0.0%
2.5%
100%
Cumulative Frequency Distribution
Discrete Data
It displays the total number of observation less
than or equal to the category.
Continuous Data
It displays the total number of observation less
than equal to the upper class limit of a class.
Example – Page 44, #10
Frequency Relative Frequency Cumulative
Frequency
9
24
5
1
0
1
Construct the cumulative frequency distribution #2
22.5%
60.0%
12.5%
2.5%
0.0%
2.5%
9
9 24 33
33 5 38
39
39
40
Example – Page 45, #16
In “Tobacco and Alcohol Use in G-Rated Children’s
Animated Films,” by Goldstein, Sobel, and Newman (Journal
of American Medical Association, Vol 281, No. 12), the length
(in seconds) of scenes showing tobacco use and alcohol use
were recorded for animated children’s movies. Refer to
Data set 7 in Appendix B. Construct a separate frequency
distribution for the lengths of time for tobacco use and alcohol
use. In both cases, uses the classes of 0 – 99, 100 – 199,
and so on. Compare the results and determine whether
there appears to be a significant difference.
Example – Page 45, #16
STAT 2nd STAT
Example – Page 45, #16
Time (Sec) Tobacco Alcohol
0 – 99 39 46
100 – 199 6 3
200 – 299 4 0
300 – 399 0 0
400 – 499 0 1
500 – 599 1 0
Example – Page 45, #16
There does not appear to be significant difference
Lesson 2-3
Visualizing Data
Displaying Distributions
Categorical Data (Qualitative)
Bar Graphs
Pie Charts
Measurement Data (Quantitative)
Histograms
Dotplots
Stem-and-leaf plots
Ogive
Frequency Polygon
Pie Chart
When to use:
The categorical data has a small number of possible
categories.
Are most useful for illustrating proportions of the whole
data set for various categories.
What to look for:
Categories that form large or small proportions of the data
set.
Don’t forget to title the graph, label the categories
and include all categories that make up the whole.
Example – Pie Chart
Education of People 25 to 34 Years Old, 2000
Number of Persons
(thousands)
Relative
Frequency
Less than High School 4,459 11.8%
High School Graduate 11,562 30.6%
Some College 10,693 28.3%
Bachelor’s Degree 8,577 22.7%
Advanced Degree 2,494 6.6%
Total 37,786 100%
Example – Pie Charts
Education of People 25 to 34 Years Old, 2000
11.8%28.3%
22.7%
6.6%
30.6%
HS Grad Not HS Grad Some College
Bachelor's Degree Advanced Degree
0.306 360 110
Bar Graph
When to use:
The categorical data has a large number of possible
categories.
What to look for:
Frequently or infrequently occurirng categories.
Don’t forget to include labels for the axes as well as
a title for the graph.
Example – Bar Graphs
Edcucation of People 25 to 34 Years Old, 2000
0
5
10
15
20
25
30
35
Not HS Grad HS Grad Some College Bachelor's
Degree
Advanced
Degree
Education
Percen
t
Dot Plot
When to use:
Numerical data sets with small number of observations.
What to look for:
Conveys information about a typical value in the data set.
Extent in which the data values are spread out.
The nature of the distribution of values along the number line.
The presence of unusual values in the data set.
Don’t forget to title the graph and label the axis.
Example – Dotplot
54 59 35 41 46 25 47 60 54 46 49 46 41 34 22
Here are the numbers of home runs that Babe Ruth hit in his
15 years with the New York Yankees, 1920 to 1935
20 6055504540353025
Stem Plot
When to use:
Numerical data sets with a small to moderate number of
observations
What to look for:
Conveys information about a typical value in the data set.
Extent in which the data values are spread out.
The presence of any gaps in the data.
The symmetry in the distribution of values
The number and location of peaks.
The presence of unusual (outlier) values in the data set.
Don’t forget to title the graph
Example – Stem Plot (Babe Ruth)
54 59 35 41 46 25 47 60 54 46 49 46 41 34 22
2
3
4
5
6
2, 5
4, 5
1, 1, 6, 6, 6, 7, 9
4, 4, 9
0
Displaying Distributions
Categorical Data
Bar Graphs
Pie Charts
Quantitative Data
Dotplots
Stem-and-leaf plots
Histograms
Ogive
Frequency Polygon
Histogram
When to use:
Continuous numerical data sets with a moderate to large
number of observations
What to look for:
Conveys information about a typical value in the data set.
Extent in which the data values are spread out.
The general shape, location and number of peaks
The presence of gaps.
The presence of unusual (outlier) values in the data set.
Don’t forget to title the graph and label axes.
Example – Histogram (Discrete Data)
The manager of Wendy’s fast-food restaurant is interested in
studying the typical number of customers who arrive during
the lunch hour. The data in the following table represent the
number of customers who arrive at Wendy’s for 40 randomly
selected 15-minute intervals of time during lunch
7 6 6 6 4 5 6 6 11 4
2 7 1 2 4 6 5 5 3 7
2 2 9 7 5 6 2 6 5 7
4 6 9 8 5 6 8 2 6 5
Number of Arrivals at Wendy’s
Example – Histogram (Discrete Data)
7 6 6 6 4 5 6 6 11 4
2 7 1 2 4 6 5 5 3 7
2 2 9 7 5 6 2 6 5 7
4 6 9 8 5 6 8 2 6 5
Number of Arrivals at Wendy’s
Step 1 – Construct a frequency distribution table
How many categories are there?
11
Example – Histogram (Discrete Data)
Number of Customers Tally Frequency Relative Frequency
1 1 0.025
2 6 0.15
3 1 0.025
4 4 0.1
5 7 0.175
6 11 0.275
7 5 0.125
8 2 0.05
9 2 0.05
10 0 0
11 1 0.025
Example – Histogram (Discrete Data)
0
2
12
10
8
6
4
Arrivals at Wendy’s
Fre
quen
cy
Number of Customers
1 111096 7 85432
Example – Histogram (Discrete Data)
0
0.05
0.3
0.25
0.2
0.15
0.1
Arrivals at Wendy’s
Rel
ativ
e F
requen
cy
Number of Customers
1 111096 7 85432
Example – Histogram (Continuous Data)
27.4 12.7 22.6 32.1 18.2 23.7 18.4 14.7
16.7 28.5 29.6 47.7 32.0 14.7 21.3 37.0
10.8 22.2 11.6 10.9 25.5 12.8 27.0 19.2
24.1 18.4 45.9 18.4 23.7 31.1 19.6 18.5
35.9 17.4 16.6 23.3 38.1 21.9 18.5 29.1
Suppose you are considering investing in a Roth IRA.
You collect the data table, which represent the three-year
rate of return (in percent) for 40 small capitalization growth
mutual funds.
Example – Histogram (Continuous Data)
STAT
Example – Histogram (Continuous Data)
A) Construct a frequency distribution to display these data.
Record your class intervals and counts
Step 1 – Find the class intervals
Locate the smallest number (10.8) and the largest
number (47.7)
Lower class limit will be 10.0 with a class width of 5
Example – Histogram (Continuous Data)
3-yr Rate of Return Frequency
10.00
15.0
14.9
20.0
25.0
30.0
35.0
40.0
45.0
19.9
24.9
29.9
34.9
39.9
44.9
49.9
Example – Histogram (Continuous Data)
3-yr Rate of Return Frequency
Total
10.00
15.0
14.9
20.0
25.0
30.0
35.0
40.0
45.0
19.9
24.9
29.9
34.9
39.9
44.9
49.9
7
11
8
6
3
3
0
2
40
Example – Histogram
Step – 2 Graph it using the TI
Example – Histogram
Example - Histogram
10 15 20 25 30 35 40 45 50
4
8
12
Rate of Return
Fre
qu
en
cy
3 – Year Rate of Return of Mutual Funds
25%
40
Example – Histogram
B) Describe the distribution of 3 – Year Rate of Return.
The distribution is skewed to
the right with a peak at the
class 15.0 – 19.9. So that
27.5% = (11/40) of the small-cap
growth fund had a 3-year
return between 15% and 19.9%
There is one outlier in class
the 45.0 – 49.9
Histogram – Too few categories
18 23 28
0
10
20
30
40
50
60
Age (in years)
Fre
quency (
Count)
Age of Spring 1998 Stat 250 Students
n=92 students
Histogram – Too many categories
2 3 4
0
1
2
3
4
5
6
7
GPA
Fre
quency (
Co
unt)
GPAs of Spring 1998 Stat 250 Students
n=92 students
Ogive
A relative cumulative frequency graph (ogive) is
used to find the relative standing of an individual
observation.
Example – Relative Cumulative Frequency
27.4 12.7 22.6 32.1 18.2 23.7 18.4 14.7
16.7 28.5 29.6 47.7 32.0 14.7 21.3 37.0
10.8 22.2 11.6 10.9 25.5 12.8 27.0 19.2
24.1 18.4 45.9 18.4 23.7 31.1 19.6 18.5
35.9 17.4 16.6 23.3 38.1 21.9 18.5 29.1
Suppose you are considering investing in a Roth IRA.
You collect the data table, which represent the three-year
rate of return (in percent) for 40 small capitalization growth
mutual funds.
Example – Relative Cumulative Frequency
Class Freq Relative
Frequency
Cumulative
Frequency
Relative cumulative
Frequency
10.0 – 14.9 7
15.0 – 19.9 11
20.0 – 24.9 8
25.0 – 29.9 6
30.0 – 34.9 3
35.0 – 39.9 3
40.0 – 44.9 0
45.0 – 49.9 2
Total 40
70.175
40
0.20
0.275
0.15
0.075
0.075
0
0.05
7
7 111 8
18 28 6
32
35
38
38
40
0.175
0.2750.175 0.45
0.20.4 655 0.
0.8
0.875
0.95
0.95
1
Example – Relative Cumulative Frequency
Class Freq Rel Freq Cum Freq Rel Cum Freq
20.0 – 24.9 8 0.2 26 0.65
45.0 – 49.9 2 0.05 40 1
26 of the 40 mutual funds had a 3 year rate of return of 24.9%
or less
65% of the mutual funds had 3 year rate of return of 24.9% or
less
A mutual fund with a 3 year rate of return of 45% or higher is
out performing 95% of its peers.
Example – Relative Cumulative Frequency
L3 – Upper Class Limits
L4 – Relative Cumulative Frequency
Example – Relative Cumulative Frequency
Example – Relative Cumulative Frequency
3 Year Rate of Return for Small Capitalization
Mutal Funds
0
0.2
0.40.6
0.8
1
1.2
10 14.9 19.9 24.9 29.9 34.9 39.9 44.9 49.9
Rate of Return
Cu
mu
lati
ve
Rela
tive F
req
uen
cy
80% of the mutual funds had a 3 year-year rate of return
less than or equal to 29.9%
Example – Frequency Polygon
27.4 12.7 22.6 32.1 18.2 23.7 18.4 14.7
16.7 28.5 29.6 47.7 32.0 14.7 21.3 37.0
10.8 22.2 11.6 10.9 25.5 12.8 27.0 19.2
24.1 18.4 45.9 18.4 23.7 31.1 19.6 18.5
35.9 17.4 16.6 23.3 38.1 21.9 18.5 29.1
Suppose you are considering investing in a Roth IRA.
You collect the data table, which represent the three-year
rate of return (in percent) for 40 small capitalization growth
mutual funds.
Example – Frequency PolygonClass Freq Class Midpoints
10.0 – 14.9 7
15.0 – 19.9 11 17.45
20.0 – 24.9 8 22.45
25.0 – 29.9 6 27.45
30.0 – 34.9 3 32.45
35.0 – 39.9 3 37.45
40.0 – 44.9 0 42.45
45.0 – 49.9 2 47.45
Total 40
14.9 10.012.45
2
Example – Frequency Polygon
L3 – Class Midpoints
L4 – Frequency
3 Year Rate of Return
0
2
4
6
8
10
12
0 12.45 17.45 22.45 27.45 32.45 37.45 42.45 47.45
Rate of Return
Freq
uen
cy
Example – Frequency Polygon
Lesson 2-4
Measure of Center
Measuring the Center
Mean
Median
Mode
Midrange
Mean or Arithmetic Mean
Find the sum of all values and then divide by the
number of values
Sample Population
x
xn
x
N
Median
Arrange the data in order.
Odd number values – the median is the value
in the exact middle.
Even number values – add the two middle
numbers then divide by 2.
Mode
Value that occurs most frequently.
Bimodal is when two values occur with the
same greatest frequency.
Multimodal is when more than two values
occur with the same greatest frequency.
When no value is repeated, we say there is no
mode.
Midrange
Is the value halfway between the highest and lowest values.
Midrange = Highest Value + Lowest Value
2
Example, Page 70, #10
Find the mean, median, mode and midrange for each of the
two samples, then compare the two sets of results.
: 0.8192 0.8150 0.8163 0.8211 0.8181 0.8247
: 0.7773 0.7758 0.7896 0.7868 0.7844 0.7861
regular
diet
Example, Page 70, #10
Example, Page 70, #10
Regular Diet
Example, Page 70, #10
Regular Diet
Mean
Median 0.81865 lb 0.78525 lb
Mode None None
Midrange
0.81907x lb 0.78333x lb
0.8150 0.8247
2
0.81985
0.7758 0.7896
2
0.7827
lb lb
Example, Page 70, #10
Diet appears to weigh less because it has less sugar than
regular coke.
Mean from the a Frequency Distribution
use class midpoints of classes for variable x
f xx
f
frequency class midpoint
n
Example, Page 71, #20
The accompany frequency distribution summarizes a sample
of human body temperatures. How does the mean compare
to the value of 98.6 F, which is the value assumed to be the
mean by most people
Example, Page 71, #20
Temperature Frequency Midpoint x
96.5 – 96.8 1 96.65 96.65
96.9 – 97.2 8 97.05 776.4
97.3 – 97.6 14 97.45 1364.3
97.7 – 98.0 22 97.85 2152.7
98.1 – 98.4 19 98.25 1866.8
98.5 – 98.8 32 98.65 3156.8
98.9 – 99.2 6 99.05 594.3
99.3 – 99.6 4 99.45 397.8
106 10405.7
f x
Example, Page 71, #20
10405.798.17
106
f xx
f
The mean appears to be substantially lower than 98.6 F
Skewed To The Left (Negatively)
Symmetric
Skewed To The Right (Positively)
How to Choose the Best Average
Choose mode if there are two or more “trends” in the
data
Two or more areas of high frequency values
Report one mode for each trend
Choose the median if the distribution is skewed
A small number of outliers are heavily
influencing the mean.
Choose the mean if the distribution if fairly
symmetric with one mode.
Lesson 2-5
Measures of Variation
Measuring the Spread
Range
Quartiles
Boxplots
Standard Deviation
Variance
Range
The range is the difference between the
largest and smallest observation.
max minR x x
Standard Deviation
The standard deviation (s) measures the
average distance of observations from their
mean.
Variance and Standard Deviation
2
2
1
x xs s
n
2
2
1
x xs
n
Variance
Standard Deviation
Notation
s = standard deviation
s² = variance
Sample Population
σ = standard deviation
σ² = variance
Example - Variance
The levels of various substances in the blood influence
our health. Here are measurements of the level of
phosphate in the blood of a patient, in milligrams
of phosphate per deciliter of blood, made on 6
consecutive visits to a clinic.
5.6 5.2 4.6 4.9 5.7 6.4
Example - Variance
5.6 5.2 4.6 4.9 5.7 6.4
A. Find the mean.
5.6 5.2 4.6 4.9 5.7 6.4 32.45.4
6 6x
Example - Variance
5.04.5 5.5 6.56.0
5.4x 4.6x 6.4x
0.8 1
Example - Variance
Observation Deviations Square Deviations
5.6
5.2
4.6
4.9
5.7
6.4
x x x 2
x x
5.6 5.4 0.2
5.2 5.4 0.2
4.6 5.4 0.8
4.9 5.4 0.5
5.7 5.4 0.3
6.4 5.4 1
0SUM
2(0.2) 0.04
0.04
0.64
0.25
0.09
1
2.06SUM
Example – Variance
2
2
1
x xs
n
2s s
B) Find the standard deviation (s) from its definition.
2.06 2.060.412
6 1 5
0.412 0.64187 0.6419
Example – Variance
C) Use your TI-83 to find and Do the result agree with
part B.
x .s
Example – Variance
Example – Page 88, #6
Listed below are ages of motorcyclists when they were
fatally injured in traffic crashes. How does the variation
of these ages compare to the variation of ages of
licensed drivers in the general population.
17 38 27 14 18 34 16 42 28
24 40 20 23 31 37 21 30 25
Example – Page 88, #6
2
8.7
75.7
s
s
years
years²
Example – Page 88, #6
2
8.7
75.7
s
s
years
years²
Since motorcycle drivers tend to come from a particular age
group, there ages would vary less than those in the
general population.
Standard Deviation
Standard deviation (s) is the square root of the variance (s² )
Units are the original units
Measures the spread about the mean and should only be
used when the mean is chosen as the center
If s = 0 then there is no spread. Observations are the same
value
As s gets larger the observations are more spread out.
Highly affected by outliers
Variance
Variance (s²) measures the average squared deviation
of observations from the mean
Units are squared
Highly affected by outliers. Best for symmetric data
Example – Page 89, #16
In “Tobacco and Alcohol Use in G-Rated Children’s
Animated Films,” by Goldstein, Sobel, and Newman (Journal
of American Medical Association, Vol. 281, No. 12), the length
(in seconds) of scenes showing tobacco use and alcohol use
were recorded for animated children’s movies. Refer to Data
set 7 in Appendix B. Find the standard deviation for the
lengths of time for tobacco use and alcohol use.
Example – Page 89, #16
Example – Page 89, #16
Tobacco: 104.0 sec; alcohol; 66.3 sec. There was slightly
more variability The times for among the lengths of
tobacco usage in this particular sample, there does not
appear to be a significant difference between the products
Standard Deviation from a
Frequency Distribution
22
( 1)
n f x f xs
n n
Use the class midpoints as the x values
Example – Page 89, #19
The given frequency distribution describes the speeds
of drivers ticketed by the Town of Poughkeepsie
police. These drivers were traveling through a 30 mi/hr
speed zone on Creek Road.
Example – Page 89, #19
Class Midpoints f x
Example – Page 89, #19
2f x
Example – Page 89, #19
Speed Frequency Class Midpoints
42 – 45 25 43.5 1087.5 47306
46 – 49 14 47.5 665 31588
50 – 53 7 51.5 360.5 18566
54 – 57 3 55.5 166.5 9240.8
58 – 61 1 59.5 59.5 3540.3
Sum 50 2339 110240.5
f x f x 2f x
Example – Page 89, #19
22
( 1)
n f x f xs
n n
250 110240.5 2339 41104
16.7850 50 1 2450
s
4.1s mi/hr
Empirical (68 – 95 – 99.7 Rule)
68% of all values fall within 1 standard deviation of the mean.
95% of all values fall within 2 standard deviations of the mean.
99.7 % of all values fall within 3 standard deviations of the mean.
For data sets that have a distribution that is approximately
bell shape, the following properties apply:
The Empirical Rule
The Empirical Rule
The Empirical Rule
Example, Page 90, #26
Use the weights of regular coke listed in data set 17 from
Appendix B, we find that the mean is 0.81682 lb, the
standard deviations is 0.00751 lb, and the distribution
is approximately bell-shape. Using the empirical rule, what
is the approximate percentage of cans of regular coke with
weights between:
A. 0.89031 lb and 0.82433 lb
B. 0.80180 lb and 0.83184 lb
Example, Page 90, #26
mean = 0.81682 lb and the standard deviations = 0.00751 lb
A. 0.80931 lb and 082433 lb?
B. 0.80180 lb and 0.83184 lb?
0.8
1682
0.8
3184
0.8
2433
0.8
0931
0.8
018
0
x x s 2x sx s2x s
1 standard deviation 68%
2 standard deviations 95%
Lesson 2-6
Measure of Relative Standing
Z-Scores (Standard Score)
Z-Score (Standard Score)
The standard score or z score, is the number of standard
deviations that a given value x is above or below the mean.
It is found using the following expressions:
Sample Population
Round z to two decimal places
x μz
σ
x x
zs
Unusual Values/Ordinary Values
Ordinary values: z score between –2 and 2 sd
Unusual Values: z score < -2 or z score > 2 sd
Example – Page 99, #2
Assume that adults have pulse rates (beats per minute)
with a mean of 72.9 and a standard deviation of 12.3.
When this question was written, the author’s pulse rate
was 48.
A. What is the difference between the author’s pulse
and the mean.
72.9 48 24.9
Example – Page 99, #2
48 72.92.02
12.3
x μz
σ
mean of 72.9 and a standard deviation of 12.3.
the author’s pulse rate was 48.
B. How many standard deviations is that [the difference
found in part (a).]
48 72.92.02
12.3
x μz
σ
C. Convert a pulse rate of 48 to a z-score.
2.02
Example – Page 99, #2
mean of 72.9 and a standard deviation of 12.3.
the author’s pulse rate was 48.
D. If we considered “usual” pulse rates to be those that
convert to z scores between -2 and 2, is a pulse rate
of 48 usual or unusual?
Unusual 2.02 2
Example – Page 100, #8
For men ages between 18 and 24 years, serum cholesterol
levels (in mg/100 ml) have a mean of 178.1 and a
standard deviation of 40.7. Find the z score corresponding
to a male, aged 18 – 24 years, who has serum
cholesterol of 259.0 mg/100 ml. Is this level unusually high?
x
μ
σ
259
178.1
40.7
259 178.11.99
40.7
x μz
σ
No, this level is not considered unusually high
since 1.99 < 2.00
Quartiles
1Q 2Q3Q
Quartiles divides the observation into fourths, or four equal
parts.
Smallest
Data Value
Largest
Data Value
25% of
the data
25% of
the data
25% of
the data
25% of
the data
Percentiles
1 25Q P 2 50Q P 3 75Q P
Just as there are three quartiles separating a data set into
four parts, there are also 99 percentiles, denoted
which partition the data into 100 groups.
Smallest
Data Value
Largest
Data Value
25% of
the data
25% of
the data
25% of
the data
25% of
the data
1 2 99, ,....P P P
Finding the Percentile of a Given Score
Percentile of value x = • 100number of values less than x
total number of values
Example – Page 100, #14
Find the percentile corresponding to the given cotinine levels
of 210. Use Table 2-13
0 1 1 3 17 32 35 44 48 86
87 103 112 121 123 130 131 149 164 167
173 173 198 208 210 222 227 234 245 250
253 265 266 277 284 289 290 313 477 491
24100 60
40P
n total number of values in the data set
k percentile being used
L locator that gives the position of a value
Pk kth percentile
L = • nk
100
Notation
Converting from the kth Percentile to the
Corresponding Data Value
Example – Page 100, #22
0 1 1 3 17 32 35 44 48 86
87 103 112 121 123 130 131 149 164 167
173 173 198 208 210 222 227 234 245 250
253 265 266 277 284 289 290 313 477 491
Find the 21P
2140 8.4 9
100L
9th Score is 48
21 48P
Example – Page 100, #20
0 1 1 3 17 32 35 44 48 86
87 103 112 121 123 130 131 149 164 167
173 173 198 208 210 222 227 234 245 250
253 265 266 277 284 289 290 313 477 491
Find the 2Q
5040 20
100L
The mean of the 20th and
21st scores
50
167 173170
2P
Converting from kth Percentile to the
Corresponding Data Value
Lesson 2-7
Exploratory Data Analysis
Exploratory Data Analysis
Is the process of using statistical tools (such
as graphs, measures of center, and measures
of variation) to investigate data sets in order
to understand their important characteristics
Outlier
An outlier can have a dramatic effect on
the mean
An outlier have a dramatic effect on the
standard deviation
An outlier can have a dramatic effect on
the scale of the histogram so that the true
nature of the distribution is totally
obscured
Five Number Summary
Smallest observation (minimum)
Quartile 1
Quartile 2 (median)
Quartile 3
Largest observation (maximum)
Box plotsA boxplot ( or box-and-whisker-diagram) is a graph of a
data set that consists of a line extending from the
minimum value to the maximum value, and a box with
lines drawn at the first quartile, Q1; the median; and the
third quartile, Q3
Interquartile Range (IQR)
3 1IQR Q Q
The interquartile range (IQR) is the distance between
the first and third quartiles
Outliers
3 1.5( )Q IQR
1 1.5( )Q IQR
Upper Cutoff
Lower Cutoff
Example – Page 109, #7
In “Tobacco and Alcohol Use in G-Rated Children’s
Animated Films,” by Goldstein, Sobel, and Newman (Journal
of American Medical Association, Vol. 281, No. 12), the length
(in seconds) of scenes showing alcohol use was recorded for
animated children’s movies. Refer to Data set 7 in Appendix
B. Find the 5 – number summary and construct a box plot.
Based on the box plots, does the distribution appear to be
symmetric or is it skewed?
Example – Page 109, #7
Example – Page 109, #7
1
1 13
2 25.5
3 38
50
0
0
0 31.5
2
39
414
min x
Q x
Q x
Q x
max x
Example – Page 109, #7
Based on the box plot, the
distribution appears to be
extremely right skewed