Upload
jessica-rich
View
222
Download
2
Tags:
Embed Size (px)
Citation preview
Chapter 2
Graphical Methods for Describing Data
Distributions
Created by Kathy Fritz
Variable • any characteristic whose value may
change from one individual to another
Political affiliation
Number of textbooks purchased
Distance from home to
collegeHome
College
Data
• The values for a variable from individual observations
Political affiliation:
Democrat, Republican,
etc.
Number of textbooks purchased:
1, 2, 3, 4, . . .
Distance from home to college:
25 miles, 53.5 miles, 347.2 miles,
etc.
Suppose that a PE coach records the height of each student in his class.
Univariate – consist of observations on a single variable made on individuals in a sample or population
This is an example of a univariate data
Suppose that the PE coach records the height and weight of each student in his class.
Bivariate - data that consist of pairs of numbers from two variables for each individual in a sample or population
This is an example of a bivariate data
Suppose that the PE coach records the height, weight, number of sit-ups, and number of push-ups for each student in his class.
Multivariate - data that consist of observations on two or more variables
This is an example of a multivariate data
Two types of variables
categorical numerical
Categorical variables
• Qualitative
• Consist of categorical responses
1. Car model2. Birth year3. Type of cell phone4. Your zip code5. Which club you have joined
Which of these
variables are NOT
categorical variables?
They are all categorical variables!
Numerical variables• quantitative
• observations or measurements take on numerical values
1. GPAs2. Height of students3. Codes to combination locks4. Number of text messages per day5. Weight of textbooks
It makes sense to perform math operations on these values.
Which of these variables are
NOT numerical?
Does it makes sense to find an average code to
combination locks?
There are two types of numerical variables -
discrete and continuous
Two types of variables
categorical numerical
discrete continuous
Discrete (numerical)
• Isolated points along a number line
• usually counts of items
• Example: number of textbooks purchased
Continuous (numerical)
• Variable that can be any value in a given interval
• usually measurements of something
• Example: GPAs
Identify the following variables:
1. the color of cars in the teacher’s lot
2. the number of calculators owned by students at your college
3. the zip code of an individual
4. the amount of time it takes students to drive to school
5. the appraised value of homes in your city
Categorical
Categorical
Discrete numerical
Discrete numerical
Continuous numerical
Is money a measurement or a count?
Graphical Display
Variable Type Data Type
Purpose
Bar Chart Univariate CategoricalDisplay data distribution
Comparative Bar Chart
Univariate for 2 or more groups Categorical
Compare 2 or more groups
Dotplot Univariate NumericalDisplay data distribution
Comparative dotplot
Univariate for 2 or more groups Numerical
Compare 2 or more groups
Stem-and-leaf display Univariate Numerical
Display data distribution
Comparative stem-and-leaf
Univariate for 2 groups Numerical
Compare 2 or more groups
Histogram Univariate NumericalDisplay data distribution
Scatterplot Bivariate NumericalInvestigate relationship between 2 variables
Time series plot Univariate, collected over time Numerical
Investigate trend over time
Use the following table to determine an appropriate
graphical display a data set.What types of graphs can be
used with categorical
data?
In section 2.3, we will see how the various graphical
displays for univariate, numerical
data compare.
Displaying Categorical Data
Bar ChartsComparative Bar
Charts
When to Use: Univariate, Categorical data
To comply with new standards from the U. S. Department of Transportation, helmets should reach the bottom of the motorcyclist’s ears. The report “Motorcycle Helmet Use in 2005 – Overall Results” (National Highway Traffic Safety Administration, August 2005) summarized data collected by observing 1700 motorcyclists nationwide at selected roadway locations. Each time a motorcyclist passed by, the observer noted whether the rider was wearing no helmet (N), a noncompliant helmet (NC), or a compliant helmet (C).
The data are summarized in this table:
Bar Chart
Helmet Use
Frequency
N 731
NC 153
C 816
1700
This is called a frequency distribution.
A frequency distribution is a table that displays the possible categories along
with the associated frequencies or relative frequencies.
The frequency for a particular category is the number of times that
category appears in the data set.
This should equal the total number of observations.
A bar chart is a graphical display for categorical data.
To compile with new standards from the U. S. Department of Transportation, helmets should reach the bottom of the motorcyclist’s ears. The report “Motorcycle Helmet Use in 2005 – Overall Results” (National Highway Traffic Safety Administration, August 2005) summarized data collected by observing 1700 motorcyclists nationwide at selected roadway locations. Each time a motorcyclist passed by, the observer noted whether the rider was wearing no helmet (N), a noncompliant helmet (NC), or a compliant helmet (C).
The data is summarized in this table:
Bar Chart
Helmet Use
Frequency
N 731
NC 153
C 816
1700
Helmet Use
Relative Frequency
N 0.430
NC 0.090
C 0.480
1.000
This should equal 1 (allowing for rounding).
How to construct1. Draw a horizontal line; write the categories
or labels below the line at regularly spaced intervals
2. Draw a vertical line; label the scale using frequency or relative frequency
3. Place a rectangular bar above each category label with a height determined by its frequency or relative frequency
Bar Chart
All bars should have the same width so that both the height and the area
of the bar are proportional to the frequency or relative frequency of
the corresponding categories.
What to Look For Frequently or infrequently occurring categories
Here is the completed bar chart for the motorcycle helmet data.
Describe this graph.
Bar Chart
Comparative Bar ChartsWhen to Use Univariate, Categorical
data for two or more groups
How to construct• Constructed by using the same horizontal
and vertical axes for the bar charts of two or more groups
• Usually color-coded to indicate which bars correspond to each group
• Should use relative frequencies on the vertical axis
Bar charts can also be used to provide a visual comparison of two or more groups.
Why?
You use relative frequency rather than frequency on the vertical axis so that you can
make meaningful comparisons even if the sample sizes are not
the same.
Each year the Princeton Review conducts a survey of students applying to college and of parents of college applicants. In 2009, 12,715 high school students responded to the question “Ideally how far from home would you like the college you attend to be?”
Also, 3007 parents of students applying to college responded to the question “how far from home would you like the college your child attends to be?” Data is displayed in the frequency table below. Frequency
Ideal Distance Students Parents
Less than 250 miles 4450 1594
250 to 500 miles 3942 902
500 to 1000 miles 2416 331
More than 1000 miles 1907 180
Create a comparative bar chart with these data.
What should you do first?
Relative Frequency
Ideal Distance Students Parents
Less than 250 miles .35 .53
250 to 500 miles .31 .30
500 to 1000 miles .19 .11
More than 1000 miles .15 .06Found by dividing the frequency by the
total number of studentsFound by dividing the frequency by the total number of parents
What does this graph show about the ideal distance college should be from home?
Displaying Numerical Data
DotplotsStem-and-leaf Displays
Histograms
DotplotWhen to Use Univariate, Numerical
data How to construct1. Draw a horizontal line and mark it with an
appropriate numerical scale
2. Locate each value in the data set along the scale and represent it by a dot. If there are two are more observations with the same value, stack the dots vertically
What to Look For • A representative or typical value
(center) in the data set• The extent to which the data values
spread out• The nature of the distribution (shape)
along the number line• The presence of unusual values (gaps
and outliers)
Dotplot
What we look for with univariate, numerical
data sets are similar for dotplots, stem-and-leaf
displays, and histograms.
An outlier is an unusually large or small data value.
A precise rule for deciding when an observation is an outlier is given in Chapter
3.
Professor Norm gave a 10-question quiz last week in his introductory statistics class. The number of correct answers for each student is recorded below.
2 4 6 8 10
Number of correct answers2 4 6 8 10
Number of correct answers
6 8 6 5 4 7 9 4 5
8 5 4 6 7 7 3 8 7
6 7 6 6 6 5 5 9
2 4 6 8 10
Number of correct answers
First draw a horizontal line with an appropriate scale.
The first three observations are plotted – note that you stack the
points if values are repeated.
This is the completed dotplot.
Write a few sentence describing this distribution.
Professor Norm gave a 10-question quiz last week in his introductory statistics class. The number of correct answers for each student is recorded below.
2 4 6 8 10
Number of correct answers
What to Look For • The representative or typical value (center) in the
data set• The extent to which the data values spread out• The nature of the distribution (shape) along the number
line• The presence of unusual values
The center for the distribution of the number of correct answers is about 6.
What to Look For • The representative or typical value (center) in the
data set• The extent to which the data values spread out• The nature of the distribution (shape) along the number
line• The presence of unusual values
The center for the distribution of the number of correct answers is about 6. There is not a lot of variability in the observations.
What to Look For • The representative or typical value (center) in the data
set• The extent to which the data values spread out• The nature of the distribution (shape) along the
number line• The presence of unusual values
The center for the distribution of the number of correct answers is about 6. There is not a lot of variability in the observations. The distribution is approximately symmetrical with no unusual observations.
A symmetrical distribution is one that has a vertical line of symmetry where the left half
is a mirror image of the right half.
If we draw a curve, smoothing out this
dotplot, we will see that there is ONLY one peak.
Distributions with a single peak are said to
be unimodal.
Distributions with two peaks are bimodal, and
with more than two peaks are multimodal.
When to Use Univariate, numerical data with observations from 2 or
more groups
How to construct• Constructed using the same numerical
scale for two or more dotplots• Be sure to include group labels for the
dotplots in the display
What to Look ForComment on the same four attributes,
but comparing the dotplots displayed.
Comparative Dotplots
In another introductory statistics class, Professor Skew also gave a 10-question quiz. The number of correct answers for each student is recorded below.
6 8 10 8 8 7 9 8 10
8 7 8 9 7 7 3 8 7
8 7 6 6 6 5 5 9 8
Create a comparative dotplot with the data sets from the two statistics classes,
Professors’ Norm and Skew.
Write a few sentences comparing these distributions. 2 4 6 8 10
Number of correct answers
Pro
f. S
kew
Pro
f. N
orm
The center of the distribution for the number of correct answers on Prof. Skew’s class is larger than the center of Prof. Norm’s class. There is also more variability in Prof. Skew’s distribution. Prof. Skew’s distribution appears to have an unusual observation where one student only had 2 answers correct while there were no unusual observations in Prof. Norm’s class. The distribution for Prof. Skew is negatively skewed while Prof. Norm’s distribution is more symmetrical.
Is the distribution for Prof. Skew’s class symmetric? Why or why not?
Notice that the left side (or lower tail) of the distribution is longer than the right side (or upper
tail). This distribution is said to be negatively skewed (or skewed to the left).
Distributions where the right tail is longer than the left is said to be positively skewed (or skewed to the right).
The direction of skewness is always in the direction of the longer tail.
When to Use Univariate, Numerical data
How to construct• Select one or more of the leading digits
for the stem• List the possible stem values in a
vertical column• Record the leaf for each observation
beside the corresponding stem value• Indicate the units for stems and leaves
someplace in the display
Stem-and-Leaf Displays
Stem-and-leaf displays are an effective way to summarize univariate numerical data
when the data set is not too large.
Each observation is split into two parts:Stem – consists of the first digit(s)
Leaf - consists of the final digit(s)
Be sure to list every stem from the smallest to
the largest value
What to Look For • A representative or typical value
(center) in the data set• The extent to which the data values
spread out• The presence of unusual values (gaps
and outliers)• The extent of symmetry in the data
distribution• The number and location of peaks
Stem-and-Leaf Displays
The article “Going Wireless” (AARP Bulletin, June 2009) reported the estimated percentage of households with only wireless phone service (no landline) for the 50 U.S. states and the District of Columbia. Data for the 19 Eastern states are given here.
5.6 5.7 20.0 16.8 16.5 13.4 10.8 9.3 11.6 8.0
11.4 16.3 14.0 10.8 7.8 20.6 10.8 5.1 11.6
What is the variable of interest?
Wireless percent
A stem-and-leaf display is an appropriate way to summarize these data.
(A dotplot would also be a reasonable choice.)
0
1
2
Let 5.6% be represented as 05.6% so that all the numbers have two digits in front of the decimal. If we use the 2-digits, we would have stems from 05
to 20 – that’s way too many stems!So let’s just use the first digit (tens) as our stems.
0 5.6, 5.7
1
2
So the leaf will be the last two digits.
With 05.6%, the leaf is 5.6 and it will be written behind the stem 0. For the second number, 5.7 also is written behind the stem 0 (with a
comma between).
What is the leaf for 20.0% and where should that leaf
be written?0 5.6, 5.7
1
2 0.0
0 5.6, 5.7, 9.3, 8.0, 7.8, 5.1
1 6.8, 6.5, 3.4, 0.8, 1.6, 1.4, 6.3, 4.0, 0.8, 0.8, 1.6
2 0.0, 0.6
The completed stem-and-leaf display is shown below.
However, it is somewhat difficult to read due to the 2-digit stems.
A common practice is to drop all but the first digit in the leaf.
0 5 5 9 8 7 5
1 6 6 3 0 1 1 6 4 0 0 1
2 0 0
This makes the display easier to read, but
DOES NOT change the overall distribution of
the data set.
The article “Going Wireless” (AARP Bulletin, June 2009) reported the estimated percentage of households with only wireless phone service (no landline) for the 50 U.S. states and the District of Columbia. Data for the 19 Eastern states are given here.
0 5 5 9 8 7 5
1 6 6 3 0 1 1 6 4 0 0 1
2 0 0
While it is not necessary to write the leaves in order from smallest to largest, by doing so, the center of the distribution is more easily seen.
0 5 5 5 7 8 9
1 0 0 0 1 1 1 3 4 6 6 6
2 0 0
Stem: tens
Leaf: ones
Write a few sentences describing this distribution.
The center of the distribution for the estimated percentage of households with only wireless phone service is approximately 11%. There does not appear to be much variability. This display appears to be a unimodal, symmetric distribution with no outliers.
Comparative Stem-and-Leaf Displays
When to Use Univariate, numerical data with observations from 2 or
more group
How to construct• List the leaves for one data set to the right
of the stems• List the leaves for the second data set to
the left of the stems• Be sure to include group labels to identify
which group is on the left and which is on the right
The article “Going Wireless” (AARP Bulletin, June 2009) reported the estimated percentage of households with only wireless phone service (no landline) for the 50 U.S. states and the District of Columbia. Data for the 13 Western states are given here.
11.7 18.9 9.0 16.7 8.0 22.1 9.2 10.8
21.1 17.7 25.5 16.3 11.4
Create a comparative stem-and-leaf display comparing
the distributions of the Eastern and Western states.
Western States Eastern States
9 9 8 0 5 5 5 7 8 9
8 7 6 6 1 1 0 1 0 0 0 1 1 1 3 4 6 6 6
5 2 1 2 0 0
Stem: tens
Leaf: ones
Write a few sentences comparing these distribution.
The center of the distribution of the estimated percentage of households with only wireless phone service for the Western states is a little larger than the center for the Eastern states. Both distributions are symmetrical with approximately the same amount of variability.
When to Use Univariate numerical data
How to construct Discrete data• Draw a horizontal scale and mark it with the
possible values for the variable• Draw a vertical scale and mark it with frequency
or relative frequency• Above each possible value, draw a rectangle
centered at that value with a height corresponding to its frequency or relative frequency
What to look for Center or typical value; spread; general
shape and location and number of peaks; and gaps or outliers
Constructed differently for discrete versus continuous
data
Histograms
Dotplots and stem-and-leaf displays are not effective ways to summarize numerical data when the data set contains a large number of data
values.
Histograms are displays that don’t work well for small data sets but do work well for larger numerical data
sets.
Discrete numerical data almost always result from counting.
In such cases, each observation is a whole number
Queen honey bees mate shortly after they become adults. During a mating flight, the queen usually takes multiple partners, collecting sperm that she will store and use throughout the rest of her life. A paper, “The Curious Promiscuity of Queen Honey Bees” (Annals of Zoology [2001]: 255-265), provided the following data on the number of partners for 30 queen bees.
12 2 4 6 6 7 8 7 8 11 8 3 5 6 7 10 1 9 7 6 9 7 5 4 7 4 6 7 8 10
Here is a dotplot of these data.
2 4 6 8 10 12Number of Partners
0 2 4 6 8 10 12
Queen honey bees continued
2 4 6 8 10 12
2
4
6
Freq
uen
cy
Number of partnersThe variable, number of partners, is discrete.
To create a histogram:we already have a horizontal axis –
we need to add a vertical axis for frequency
The bars should be centered over the discrete data values and have heights corresponding to the frequency of each
data value.
In practice, histograms for discrete data ONLY show the rectangular bars. We built the histogram on top
of the dotplot to show that the bars are centered over the discrete data values and that heights of the bars are the frequency of each data value.
The distribution for the number of partners of queen honey bees is approximately symmetric with a center at 7 partners and a somewhat large amount of variability. There doesn’t appear to be any outliers.
What do you notice about the shapes of these two histograms?
Here are two histograms showing the “queen bee data set”. One uses
frequency on the vertical axis, while the other uses relative frequency
When to Use Univariate numerical data
How to construct Continuous data• Mark the boundaries of the class intervals on the
horizontal axis• Use either frequency or relative frequency on the
vertical axis• Draw a rectangle for each class interval directly
above that interval. The height of each rectangle is the frequency or relative frequency of the corresponding interval
What to look for Center or typical value; spread; general shape
and location and number of peaks; and gaps or outliers
Histograms with equal width intervals
Consider the following data on carry-on luggage weight for 25 airline passengers.
Here is a dotplot of this data set.This is a continuous numerical data set.
25.0 17.9 10.1 27.6 30.0 18.0 28.7 28.2 27.8
28.0 31.4 20.9 33.8 27.6 21.9 19.9 20.8 28.5
22.4 24.9 26.4 22.0 34.5 22.7 25.3
With continuous data, the rectangular bars cover an interval of data values (not just one value). Looking at this dotplot, it is easy to see that we could use intervals with a width
of 5.
This interval includes 10 and all values up to but not including 15. The next intervals will include 15 and all values up to but not
including 20, and so on.
The top dotplot shows all the data values in each interval stacked in
the middle of the interval.
From the dotplot, it is easy to see how the continuous histogram is created.
• Must use two separate histograms with the same horizontal axis and relative frequency on the vertical axis
Comparative Histograms
1-yr-olds 3-yr-olds
The article “Early Television Exposure and Subsequent Attention Problems in Children”
(Pediatrics, April 2004) investigated the television viewing habits of U.S. children. These graphs show the viewing habits of 1-year old and
3-year old children.
The biggest difference between the two histograms is at the low end, with a much higher proportion of 3-year-old children falling in the 0-2
TV hours interval than 1-year-old children.
Histograms with unequal width intervalsWhen to use
when you have a concentration of data in the middle with some extreme values
How to constructconstruct similar to histograms with continuous data, but with density on the vertical axis
interval of widthinterval for frequency relative
density
When people are asked for the values such as age or weight, they sometimes shade the truth in their responses. The article “Self-Report of Academic Performance” (Social Methods and Research [November 1981]: 165-185) focused on SAT scores and grade point average (GPA). For each student in the sample, the difference between reported GPA and actual GPA was determined. Positive differences resulted from individuals reporting GPAs larger than the correct value.Class
IntervalRelative Frequency
-2.0 to < -0.4
0.023
-0.4 to < -0.2
0.055
-0.2 to < 0.1 0.097
-0.1 to < 0 0.210
0 to < 0.1 0.189
0.1 to 0.2 0.139
0.2 to < 0.4 0.116
0.4 to 2.0 0.171
When using relative frequency on the vertical axis, the proportional area principle is violated.
Notice the relative frequency for the interval 0.4 to < 2.0 is smaller than the relative frequency for the interval -0.1 to < 0, but the area of the bar is
MUCH larger.
GPAs continuedClass Interval
Relative Frequency
Width Density
-2.0 to < -0.4
0.023 1.6 0.014
-0.4 to < -0.2
0.055 0.2 0.275
-0.2 to < 0.1
0.097 0.1 0.970
-0.1 to < 0 0.210 0.1 2.100
0 to < 0.1 0.189 0.1 1.890
0.1 to 0.2 0.139 0.1 1.390
0.2 to < 0.4 0.116 0.2 0.580
0.4 to 2.0 0.171 1.6 0.107
To fix this problem, we need to find the density of each interval.
interval of width
interval for frequency relativedensity
This is a correct histogram with unequal widths.
Cumulative Relative Frequency PlotsWhen to use
when you want to show the approximate proportion of data at or below any given value
How to construct1. Mark the boundaries of the class intervals on a horizontal
axis
2. Add a vertical axis with a scale that goes from 0 to 1
3. For each class interval, plot the point that is represented by
(upper endpoint of interval, cumulative relative frequency)
4. Add the point to represented by (lower endpoint of first interval, 0)
5. Connect consecutive points in the display with line segments
Cumulative Relative Frequency PlotsWhat to Look For
Proportion of data falling at or below any given value along the x axis
The cumulative relative frequency of a given interval is the sum of the current relative frequency and all the previous
relative frequencies.
The National Climatic Data Center has been collecting weather data for many years. A frequency distribution for annual rainfall totals for Albuquerque, New Mexico, from 1950 to 2008 are shown in the table below.
Annual Rainfall(inches)
Frequency
Relative Frequency
Cumulative Relative Frequency
4 to < 5 3
5 to < 6 6
6 to < 7 5
7 to < 8 6
8 to < 9 10
9 to < 10 4
10 to < 11 12
11 to < 12 6
12 to < 13 3
13 to < 14 3
0.052
0.103
0.086
0.103
0.172
0.069
0.207
0.103
0.052
0.052
0.052
0.155+
0.792
+
0.999
0.947
0.895
0.516
0.585
0.241
0.344
relative frequency = frequency/58
Cumulative relative frequency =
Current relative frequency +
Previous relative frequency
The National Climatic Data Center has been collecting weather for many years. The frequency of the annual rainfall totals for Albuquerque, New Mexico, from 1950 to 2008 are shown in the table below. Annual
Rainfall(inches)
Frequency
Relative Frequency
Cumulative Relative Frequency
4 to < 5 3
5 to < 6 6
6 to < 7 5
7 to < 8 6
8 to < 9 10
9 to < 10 4
10 to < 11 12
11 to < 12 6
12 to < 13 3
13 to < 14 3
0.052
0.103
0.086
0.103
0.172
0.069
0.207
0.103
0.052
0.052
0.052
0.155
0.792
0.999
0.947
0.895
0.516
0.585
0.241
0.344
To create the cumulative relative frequency plot:
Plot the point (upper value of the interval, cumulative relative frequency of the interval)
Plot the point:
(smallest value of the first interval, 0)
The National Climatic Data Center has been collecting weather for many years. The annual rainfall data for Albuquerque, New Mexico, from 1950 to 2008, was used to construct the cumulative relative frequency plot below. What percent of the years
had rainfall 7.5 inches or less?
About 30%
Which interval has the most observations in it, 9 to < 10 or 10 to < 11? Why?
10 to < 11, because it has a steeper slope
Displaying Bivariate Numerical Data
ScatterplotsTime Series Plots
When to Use Bivariate Numerical data
How to construct1. Draw horizontal and vertical axes. Label the
horizontal axis and include an appropriate scale for the x-variable. Label the vertical axis and include an appropriate scale for the y-variable.
2. For each (x, y) pair in the data set, add a dot in the appropriate location in the display.
What to look for Relationship between x and y
Scatterplots
The accompanying table gives the cost (in dollars) and an overall quality rating for 10 different brands of men’s athletic shoes (www.consumerreports.org).
Is there a relationship between x = cost and y = quality rating?
Cost 65 45 45 80 110 110 30 80 110 70
Rating
71 70 62 59 58 57 56 52 51 51
A scatterplot can help answer this question
Cost 65 45 45 80 110 110 30 80 110 70
Rating
71 70 62 59 58 57 56 52 51 51
First, draw and label appropriate horizontal
and vertical axes.
Cost
Rati
ng Next, plot each (x, y) pair. Here is the completed
scatterplot.
20 40 60 80 100
50
60
70
20 40 60 80 100
50
60
70
20 40 60 80 100
50
60
70
Is there a relationship between x = cost and y = quality rating?
There appears to be a negative relationship
between cost of athletic shoes and
their quality rating – does that surprise
you?
When to Use Bivariate data with time and another variable
How to construct1. Draw horizontal and vertical axes. Label the
horizontal axis and include an appropriate scale for the x-variable. Label the vertical axis and include an appropriate scale for the y-variable.
2. For each (x, y) pair in the data set, add a dot in the appropriate location in the display.
3. Connect each dot in order
What to look fortrends or patterns over time
Time Series Plots
The Christmas Price Index is computed each year by PNC Advisors. It is a humorous look at the cost of giving all the gifts described in the popular Christmas song “The Twelve Days of Christmas” (www.pncchristmaspriceindex.com).
Describe any trends or patterns that you see.
Why is there a downward trend between 1993 &
1995?
Graphical Displays in the Media
Pie ChartsSegmented Bar Charts
Pie (Circle) ChartWhen to Use Categorical data
How to construct• A circle is used to represent the whole data
set.
• “Slices” of the pie represent the categories
• The size of a particular category’s slice is proportional to its frequency or relative frequency.
• Most effective for summarizing data sets when there are not too many categories
Pie (Circle) ChartThe article “Fred Flintstone, Check Your Policy” (The Washington Post, October 2, 2005) summarized a survey of 1014 adults conducted by the Life and Health Insurance Foundation for Education. Each person surveyed was asked to select which of five fictional characters had the greatest need for life insurance: Spider-Man, Batman, Fred Flintstone, Harry Potter, and Marge Simpson. The data are summarized in the pie chart. The survey results were quite
different from the assessment of an insurance expert.
The insurance expert felt that Batman, a wealthy bachelor, and Spider-Man did not need
life insurance as much as Fred Flintstone, a married man
with dependents!
Segmented (or Stacked) Bar Charts When to Use Categorical data
How to construct• Use a rectangular bar rather than a
circle to represent the entire data set.• The bar is divided into segments, with
different segments representing different categories.
• The area of the segment is proportional to the relative frequency for the particular category.
A pie chart can be difficult to construct by hand. The circular shape sometimes makes if difficult to compare areas for different categories, particularly when
the relative frequencies are similar.
So, we could use a segmented bar chart.
Segmented (or Stacked) Bar Charts Each year, the Higher Education Research Institute conducts a survey of college seniors. In 2008, approximately 23,000 seniors participated in the survey (“Findings from the 2008 Administration of the College Senior Survey,” Higher Education Research Institute, June 2009).
This segmented bar chart summarizes student responses to the question: “During the past year, how much time did you spend studying and doing homework in a typical week?”
Common Mistakes
Avoid these Common Mistakes1. Areas should be proportional to frequency,
relative frequency, or magnitude of the number being represented.
The eye is naturally drawn to large areas in graphical displays. Sometimes, in an effort to make the graphical displays more interesting, designers lose sight of this important principle. Consider this graph (USA Today, October 3, 2002).
By replacing the bars of a bar chart with milk buckets, areas are
distorted.
The two buckets for 1980 represent 32 cows,
whereas the one bucket for 1970 represents 19 cows.
Avoid these Common Mistakes1. Areas should be proportional to frequency,
relative frequency, or magnitude of the number being represented.
Another common distortion occurs when a third dimension is added to bar charts or pie charts. This distorts the areas and makes it much more difficult to interpret.
Avoid these Common Mistakes2. Be cautious of graphs with broken axes
(axes that don’t start at 0).
• The use of broken axes in a scatterplot does not result in a misleading picture of the relationship of bivariate data.
• In time series plots, broken axes can sometimes exaggerate the magnitude of change over time.
• In bar charts and histograms, the vertical axis should NEVER be broken. This violates the “proportional area” principle.
Avoid these Common Mistakes2. Be cautious of graphs with broken axes
(axes that don’t start at 0).
This bar chart is similar to one in an advertisement for a software product designed to raise student test scores. Areas of the bars are not proportional to the magnitude of the numbers represented – the area for the rectangle 68 is more than three times the area of the rectangle representing 55!
Avoid these Common Mistakes3. Watch out for unequal time spacing in
time series plots.
If observations over time are not made at regular time intervals, special care must be taken in constructing the time series plot.
Notice that the intervals between observations are irregular, yet the points in the plot are equally
spaced along the time axis. This makes it difficult to assess the rate of change over time.Here is a correct time series plot.
Avoid these Common Mistakes4. Be careful how you interpret patterns in
scatterplots.
Consider the following scatterplot showing the relationship between the number of Methodist ministers in New England and the amount of Cuban rum imported into Boston from 1860 to 1940 (Education.com).
r = .999973
A strong pattern in a scatterplot means that the two variables tend to vary together in a predictable way, BUT it does not mean that there is a cause-and-effect relationship.
5000
10000
15000
20000
25000
30000
35000
Num
ber
of B
arr
els
of Im
port
ed R
um
0 50 100 150 200 250 300
Number of Methodist Ministers
Does an increase in the number of Methodist ministers CAUSE the increase in imported rum?
Avoid these Common Mistakes5. Make sure that a graphical display creates
the right first impression.
Consider the following graph from USA Today (June 25, 2001). Although this graph does not violate the proportional area principle, the way the “bar” for the none category is displayed makes this graph difficult to read. A quick glance at this graph may leave the reader with an incorrect impression.