Upload
benny
View
44
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Chapter 4 Displaying and Summarizing Quantitative Data. CHAPTER OBJECTIVES At the conclusion of this chapter you should be able to: 1)Construct graphs that appropriately describe quantitative data 2)Calculate and interpret numerical summaries of quantitative data. - PowerPoint PPT Presentation
Citation preview
Chapter 4Displaying and Summarizing
Quantitative DataCHAPTER OBJECTIVES
At the conclusion of this chapter you should be able to: 1) Construct graphs that appropriately describe
quantitative data 2) Calculate and interpret numerical summaries
of quantitative data. 3) Combine numerical methods with graphical
methods to analyze a data set. 4) Apply graphical methods of summarizing data
to choose appropriate numerical summaries. 5) Apply software and/or calculators to automate
graphical and numerical summary procedures.
Displaying Quantitative Data
Histograms
Stem and Leaf Displays
Relative Frequency Histogram of Exam Grades
0.05
.10
.15
.20
.25
.30
40 50 60 70 80 90Grade
Rel
ativ
e fr
eque
ncy
100
Frequency Histogram
Histograms
A histogram shows three general types of information:
It provides visual indication of where the approximate center of the data is.
We can gain an understanding of the degree of spread, or variation, in the data.
We can observe the shape of the distribution.
All 200 m Races 20.2 secs or less
Histograms Showing Different Centers
Histograms Showing DifferentCenters
(football head coach
salaries)
Histograms - Same Center,
Different Spread
(football head coach
salaries)
Excel Example: 2012-13 NFL Salaries
Statcrunch Example: 2012-13 NFL Salaries
Grades on a statistics exam
Data:
75 66 77 66 64 73 91 65 59 86 61 86 61
58 70 77 80 58 94 78 62 79 83 54 52 45
82 48 67 55
Frequency Distribution of Grades
Class Limits Frequency40 up to 50
50 up to 60
60 up to 70
70 up to 80
80 up to 90
90 up to 100
Total
2
6
8
7
5
2
30
Relative Frequency Distribution of Grades
Class Limits Relative Frequency40 up to 50
50 up to 60
60 up to 70
70 up to 80
80 up to 90
90 up to 100
2/30 = .067
6/30 = .200
8/30 = .267
7/30 = .233
5/30 = .167
2/30 = .067
Relative Frequency Histogram of Grades
0.05
.10
.15
.20
.25
.30
40 50 60 70 80 90Grade
Rel
ativ
e fr
eque
ncy
100
Based on the histo-gram, about what percent of the values are between 47.5 and 52.5?
1 2 3 4
0% 0%0%0%
1. 50%
2. 5%
3. 17%
4. 30%
CountdownCountdown
10
Stem and leaf displays Have the following general appearance
stem leaf
1 8 9
2 1 2 8 9 9
3 2 3 8 9
4 0 1
5 6 7
6 4
Stem and Leaf Displays Partition each no. in data into a “stem” and
“leaf” Constructing stem and leaf display
1) deter. stem and leaf partition (5-20 stems)
2) write stems in column with smallest stem at top; include all stems in range of data
3) only 1 digit in leaves; drop digits or round off
4) record leaf for each no. in corresponding stem row; ordering the leaves in each row helps
Example: employee ages at a small company
18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39; stem: 10’s digit; leaf: 1’s digit
18: stem=1; leaf=8; 18 = 1 | 8
stem leaf
1 8 9
2 1 2 8 9 9
3 2 3 8 9
4 0 1
5 6 7
6 4
Suppose a 95 yr. old is hiredstem leaf
1 8 9
2 1 2 8 9 9
3 2 3 8 9
4 0 1
5 6 7
6 4
7
8
9 5
Number of TD passes by NFL teams: 2012-2013 season(stems are 10’s digit)
stem leaf
4
3
03
247
2 6677789
2 01222233444
1 13467889
0 8
Pulse Rates n = 138
# Stem Leaves 4* 3 4. 588 9 5* 001233444 10 5. 5556788899 23 6* 00011111122233333344444 23 6. 55556666667777788888888 16 7* 00000112222334444 23 7. 55555666666777888888999 10 8* 0000112224 10 8. 5555667789 4 9* 0012 2 9. 58 4 10* 0223 10. 1 11* 1
Advantages/Disadvantages of Stem-and-Leaf Displays
Advantages
1) each measurement displayed
2) ascending order in each stem row
3) relatively simple (data set not too large) Disadvantages
display becomes unwieldy for large data sets
Population of 185 US cities with between 100,000 and 500,000
Multiply stems by 100,000
Back-to-back stem-and-leaf displays. TD passes by NFL teams: 1999-2000, 2012-13multiply stems by 10
1999-2000 2012-13
2 4 03
6 3 7
2 3 24
6655 2 6677789
43322221100 2 01222233444
9998887666 1 67889
421 1 134
0 8
Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic. How many pulses are between 67 and 77?
Stems are 10’s digits
1 2 3 4 5
0% 0% 0%0%0%
1. 4
2. 6
3. 8
4. 10
5. 12 CountdownCountdown
10
Interpreting Graphical Displays: Shape
A distribution is symmetric if the right and left
sides of the histogram are approximately mirror
images of each other.
Symmetric distribution
Complex, multimodal distribution
Not all distributions have a simple overall shape,
especially when there are few observations.
Skewed distribution
A distribution is skewed to the right if the right
side of the histogram (side with larger values)
extends much farther out than the left side. It is
skewed to the left if the left side of the histogram
extends much farther out than the right side.
Heights of Students in Recent Stats Class
Shape (cont.)Female heart attack patients in New York state
Age: left-skewed Cost: right-skewed
Alaska Florida
Shape (cont.): Outliers
An important kind of deviation is an outlier. Outliers are observations
that lie outside the overall pattern of a distribution. Always look for
outliers and try to explain them.
The overall pattern is fairly
symmetrical except for 2
states clearly not belonging
to the main trend. Alaska
and Florida have unusual
representation of the
elderly in their population.
A large gap in the
distribution is typically a
sign of an outlier.
Center: typical value of frozen personal pizza? ~$2.65
Spread: fuel efficiency 4, 8 cylinders
4 cylinders: more spread 8 cylinders: less spread
Other Graphical Methods for Economic Data
Time plots
plot observations in time order, with time on the horizontal axis and the vari-able on the vertical axis
** Time series
measurements are taken at regular intervals (monthly unemployment, quarterly GDP, weather records, electricity demand, etc.)
Unemployment Rate, by Educational Attainment
Water Use During Super Bowl
Winning Times 100 M Dash
Numerical Summaries of Quantitative Data
Numerical and More Graphical Methods to Describe Univariate
Data
2 characteristics of a data set to measure
center
measures where the “middle” of the data is located
variability
measures how “spread out” the data is
The median: a measure of center
Given a set of n measurements arranged in order of magnitude,
Median= middle value n odd
mean of 2 middle values, n even
Ex. 2, 4, 6, 8, 10; n=5; median=6 Ex. 2, 4, 6, 8; n=4; median=(4+6)/2=5
Student Pulse Rates (n=62)
38, 59, 60, 60, 62, 62, 63, 63, 64, 64, 65, 67, 68, 70, 70, 70, 70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75, 75, 75, 75, 76, 77, 77, 77, 77, 78, 78, 79, 79, 80, 80, 80, 84, 84, 85, 85, 87, 90, 90, 91, 92, 93, 94, 94, 95, 96, 96, 96, 98, 98, 103
Median = (75+76)/2 = 75.5
Medians are used often
Year 2011 baseball salaries
Median $1,450,000 (max=$32,000,000 Alex Rodriguez; min=$414,000)
Median fan age: MLB 45; NFL 43; NBA 41; NHL 39
Median existing home sales price: May 2011 $166,500; May 2010 $174,600
Median household income (2008 dollars) 2009 $50,221; 2008 $52,029
The median splits the histogram into 2 halves of equal area
Examples Example: n = 7
17.5 2.8 3.2 13.9 14.1 25.3 45.8 Example n = 7 (ordered): 2.8 3.2 13.9 14.1 17.5 25.3 45.8 Example: n = 8
17.5 2.8 3.2 13.9 14.1 25.3 35.7 45.8 Example n =8 (ordered)
2.8 3.2 13.9 14.1 17.5 25.3 35.7 45.8
m = 14.1
m = (14.1+17.5)/2 = 15.8
Below are the annual tuition charges at 7 public universities. What is the median
tuition?
1 2. 3 4
0% 0%0%0%
4429496049604971524555467586
1. 5245
2. 4965.5
3. 4960
4. 4971
CountdownCountdown
10
Below are the annual tuition charges at 7 public universities. What is the median
tuition?
1 2. 3 4
0% 0%0%0%
4429496052455546497155877586
1. 5245
2. 4965.5
3. 5546
4. 4971
CountdownCountdown
10
Measures of Spread
The range and interquartile range
Ways to measure variability
range=largest-smallest OK sometimes; in general, too crude;
sensitive to one large or small data value
The range measures spread by examining the ends of the data
A better way to measure spread is to examine the middle portion of the data
m = median = 3.4
Q1= first quartile = 2.3
Q3= third quartile = 4.2
1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 6 2.39 5 2.510 4 2.811 3 2.912 2 3.313 1 3.414 2 3.615 3 3.716 4 3.817 5 3.918 6 4.119 7 4.220 6 4.521 5 4.722 4 4.923 3 5.324 2 5.625 1 6.1
Quartiles: Measuring spread by examining the middle
The first quartile, Q1, is the value in the
sample that has 25% of the data at or
below it (Q1 is the median of the lower
half of the sorted data).
The third quartile, Q3, is the value in the
sample that has 75% of the data at or
below it (Q3 is the median of the upper
half of the sorted data).
Quartiles and median divide data into 4 pieces
Q1 M Q3Q1 M Q3
1/41/4 1/41/4 1/41/4 1/41/4
Quartiles are common measures of spread
http://www2.acs.ncsu.edu/UPA/admissions/fresprof.htm
http://www2.acs.ncsu.edu/UPA/peers/current/ncsu_peers/sat.htm
University of Southern California
Rules for Calculating QuartilesStep 1: find the median of all the data (the median divides the data in half)
Step 2a: find the median of the lower half; this median is Q1;Step 2b: find the median of the upper half; this median is Q3.
Important:when n is odd include the overall median in both halves;when n is even do not include the overall median in either half.
Example 2 4 6 8 10 12 14 16 18 20 n = 10
Median m = (10+12)/2 = 22/2 = 11
Q1 : median of lower half 2 4 6 8 10
Q1 = 6
Q3 : median of upper half 12 14 16 18 20
Q3 = 16
11
Quartile example: odd no. of data values
HR’s hit by Babe Ruth in each season as a Yankee 54 59 35 41 46 25 47 60 54 46 49 46 41 34 22
Ordered values: 22 25 34 35 41 41 46 46 46 47 49 54 54 59 60
Median: value in ordered position 8. median = 46
Lower half (including overall median): 22 25 34 35 41 41 46 46
Upper half (including overall median): 46 46 47 49 54 54 59 60
1
35 41lower quartile 38
2Q
3
49 54upper quartile 51.5
2Q
Pulse Rates n = 138
# Stem Leaves4*
3 4. 5889 5* 00123344410 5. 555678889923 6* 0001111112223333334444423 6. 5555666666777778888888816 7* 0000011222233444423 7. 5555566666677788888899910 8* 000011222410 8. 55556677894 9* 00122 9. 584 10* 0223
10.1 11* 1
Median: mean of pulses in locations 69 & 70: median= (70+70)/2=70
Q1: median of lower half (lower half = 69 smallest pulses); Q1 = pulse in ordered position 35;Q1 = 63
Q3 median of upper half (upper half = 69 largest pulses); Q3= pulse in position 35 from the high end; Q3=78
Below are the weights of 31 linemen on the NCSU football team. What is the
value of the first quartile Q1?
# stemleaf
2 2255
4 2357
6 2426
7 257
10 26257
12 2759
(4) 281567
15 2935599
10 30333
7 3145
5 32155
2 336
1 340
1 2. 3. 4.
0% 0%0%0%
1. 287
2. 257.5
3. 263.5
4. 262.5
CountdownCountdown
10
Interquartile range
lower quartile Q1
middle quartile: median upper quartile Q3
interquartile range (IQR)IQR = Q3 – Q1
measures spread of middle 50% of the data
Example: beginning pulse rates
Q3 = 78; Q1 = 63
IQR = 78 – 63 = 15
Below are the weights of 31 linemen on the NCSU football team. The first quartile Q1 is 263.5. What is the value of the IQR?
# stemleaf
2 2255
4 2357
6 2426
7 257
10 26257
12 2759
(4) 281567
15 2935599
10 30333
7 3145
5 32155
2 336
1 340
1. 2. 3 4.
0% 0%0%0%
1. 23.5
2. 39.5
3. 46
4. 69.5
CountdownCountdown
10
5-number summary of data
Minimum Q1 median Q3 maximum
Pulse data
45 63 70 78 111
End of General Numerical SummariesNext: Numerical Summaries of
Symmetric Data
Numerical Summaries of Symmetric Data.
Measure of Center: Mean
Measure of Variability: Standard Deviation
Symmetric DataBody temp. of 93 adults
Recall: 2 characteristics of a data set to measure
center
measures where the “middle” of the data is located
variability
measures how “spread out” the data is
Measure of Center When Data Approx. Symmetric
mean (arithmetic mean) notationx i
x x x x
n
x x x x x
i
n
ii
n
n
: th measurement in a set of observations
number of measurements in data set; sample
size
1 2 3
11 2 3
, , , ,
:
N
x
n
x
n
xxxxx
x
N
ii
n
ii
n
1
1321
size population = N
known)not typically(value mean Population
mean Sample
Connection Between Mean and Histogram
A histogram balances when supported at the mean. Mean x = 140.6
Histogram
0
10
20
30
40
50
60
70
118.
5
125.
5
132.
5
139
.5
146.
5
153.
5
160
.5
Mor
e
Absences from Work
Fre
qu
en
cy
Frequency
Mean: balance pointMedian: 50% area each half
right histo: mean 55.26 yrs, median 57.7yrs
Properties of Mean, Median1.The mean and median are unique; that is, a
data set has only 1 mean and 1 median (the mean and median are not necessarily equal).
2.The mean uses the value of every number in the data set; the median does not.
14
20 4 6Ex. 2, 4, 6, 8. 5; 5
4 2
21 4 6Ex. 2, 4, 6, 9. 5 ; 5
4 2
x m
x m
Example: class pulse rates
53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140
23
1
23
84.48;23
:location: 12th obs. 85
ii
n
xx
m m
2010, 2014 baseball salaries
2010
n = 845
= $3,297,828
median = $1,330,000
max = $33,000,000
2014
n = 848
= $3,932,912
median = $1,456,250
max = $28,000,000
Disadvantage of the mean
Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data
Mean, Median, Maximum Baseball Salaries 1985 - 2014
Skewness: comparing the mean, and median
Skewed to the right (positively skewed) mean>median
53
490
102 7235 21 26 17 8 10 2 3 1 0 0 1
0
100
200
300
400
500
600
Freq
uenc
y
Salary ($1,000's)
2011 Baseball Salaries
Skewed to the left; negatively skewed
Mean < median mean=78; median=87;
Histogram of Exam Scores
0
10
20
30
20 30 40 50 60 70 80 90 100Exam Scores
Fre
qu
en
cy
Symmetric data
mean, median approx. equal
Bank Customers: 10:00-11:00 am
0
5
10
15
20
Number of Customers
Fre
qu
en
cy
DESCRIBING VARIABILITY OF SYMMETRIC DATA
Describing Symmetric Data (cont.)
Measure of center for symmetric data:
Measure of variability for symmetric data?
1 2 3 1
Sample mean n
in i
x
xx x x x
xn n
Example
2 data sets:
x1=49, x2=51 x=50
y1=0, y2=100 y=50
On average, they’re both comfortable
0 10049 51
Ways to measure variability
1. range=largest-smallest
ok sometimes; in general, too crude; sensitive to one large or small obs.
1
2. measure spread from the middle, where
the middle is the mean ;
deviation of from the mean:
( ); sum the deviations of all the 's from ;
i i
n
i ii
x
x x x
x x x x
1
( ) 0 always; tells us nothingn
ii
x x
Previous Example
1 2
1 2
1 2
1 2
sum of deviations from mean:
49, 51; 50
( ) ( ) (49 50) (51 50) 1 1 0;
0, 100; 50
( ) ( ) (0 50) (100 50) 50 50 0
x x x
x x x x
y y y
y y y y
The Sample Standard Deviation, a measure of spread around the mean Square the deviation of each
observation from the mean; find the square root of the “average” of these squared deviations
deviation
standard sample thecalled1
)(
average theofroot square thethen take
,average"" thefind and)(;)(
1
2
1
22
n
xxs
xxxx
n
ii
n
iii
Calculations …
Mean = 63.4
Sum of squared deviations from
mean = 85.2
(n − 1) = 13; (n − 1) is called degrees
freedom (df)
s2 = variance = 85.2/13 = 6.55
inches squared
s = standard deviation = √6.55 = 2.56
inches
Women height (inches)
x
x
2
1
2 )(1
1xx
ns
n
i
1. First calculate the variance s2.2. Then take the square root to get the
standard deviation s.
2
1
)(1
1xx
ns
n
i
Mean± 1 s.d.
We’ll never calculate these by hand, so make sure to know how to get the standard deviation using your calculator, Excel, or other software.
Population Standard Deviation
2
1
( )population standard deviation
value of typically not known;
use to estimate value of
N
ii
x
N
s
Remarks
1. The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement
Remarks (cont.)
2. Note that s and are always greater than or equal to zero.
3. The larger the value of s (or ), the greater the spread of the data.
When does s=0? When does =0?When all data values are the same.
Remarks (cont.)4. The standard deviation is the most
commonly used measure of risk in finance and business– Stocks, Mutual Funds, etc.
5. Variance s2 sample variance 2 population variance Units are squared units of the original data square $, square gallons ??
Remarks 6):Why divide by n-1 instead of n?
degrees of freedom each observation has 1 degree of
freedom however, when estimate unknown
population parameter like , you lose 1 degree of freedom
1
)(; of value
unkown theestimate to use we,for formulaIn
1
2
n
xxs
xs
n
ii
Remarks 6) (cont.):Why divide by n-1 instead of n? Example
Suppose we have 3 numbers whose average is 9
x1= x2=
then x3 must be
once we selected x1 and x2, x3 was determined since the average was 9
3 numbers but only 2 “degrees of freedom”
Since the average (mean) is 9, x1 + x2 + x3 must equal 9*3 = 27, so x3 = 27 – (x1 + x2)
Choose ANY values for x1 and x2
Computational Example
67.11
;42.367.113
35
3
25.2025.25.225.12
3
)5.4()5(.)5.1()5.3(
14
)5.49()5.45()5.43()5.41(
5.4;9,5,3,1
2
2222
2222
418
s
s
xnsobservatio
class pulse rates
2 2
53 64 67 67 70 76 77 77 78 83 84 85 85 89 90
90 90 90 91 96 98 103 140
23 84.48 85
290.26(beats per minute)
17.037 beats per minute
n x m
s
s
Review: Properties of s and s and are always greater than or
equal to 0
when does s = 0? = 0? The larger the value of s (or ), the
greater the spread of the data the standard deviation of a set of
measurements is an estimate of the likely size of the chance error in a single measurement
Summary of Notation
2
SAMPLE
sample mean
sample median
sample variance
sample stand. dev.
y
m
s
s
2
POPULATION
population mean
population median
population variance
population stand. dev.
m
End of Chapter 4