Upload
joel-anthony
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Objectives (IPS chapter 1.1)Displaying distributions with graphs
Labels/Variables
Two types of variables
Ways to chart categorical data Bar graphs
Pie charts
Ways to chart quantitative data Line graphs: time plots
Scales matter
Histograms
Stemplots
Stemplots versus histograms
Interpreting histograms
Variables
In a study, we collect information—data—from individuals. Individuals
can be people, animals, plants, or any object of interest.
A variable is a characteristic that varies among individuals in a
population or in a sample (a subset of a population).
Example: age, height, blood pressure, ethnicity, leaf length, first language
The distribution of a variable tells us what values the variable takes
and how often it takes these values.
Two types of variables Variables can be either quantitative (or numerical)…
Something that can be counted or measured for each individual and then
added, subtracted, averaged, etc. across individuals in the population.
Example: How tall you are, your age, your blood cholesterol level, the
number of credit cards you own
… or categorical.
Something that falls into one of several categories. What can be counted
is the count or proportion of individuals in each category.
Example: Your blood type (A, B, AB, O), your hair color, your ethnicity,
whether you paid income tax last tax year or not
Cases: are the objects described by a set of data. Cases may be customers,
companies, subjects in a study, or other subjects.
A label: a special variable used in some data sets to distinguish different cases.
A variable is a characteristic of a case.
Different cases can have different values for the variables.
Example:(1) The cases are the individual students;
(2) The first three (Student identification number, last name, first name) are labels.
(3) Gender is a categorical variable;
(4) Test 1 to Final are numerical variables.
Cases, Labels, Variables, and Values
Eg: How do you know if a variable is categorical or quantitative?Ask: What are the n individuals/units in the sample (of size “n”)? What is being recorded about those n individuals/units? Is that a number ( quantitative) or a statement ( categorical)?
Individualsin sample
DIAGNOSIS AGE AT DEATH
Patient A Heart disease 56
Patient B Stroke 70
Patient C Stroke 75
Patient D Lung cancer 60
Patient E Heart disease 80
Patient F Accident 73
Patient G Diabetes 69
QuantitativeEach individual is
attributed a numerical value.
CategoricalEach individual is assigned to one of several categories.
HW 1.14(b), 1.16
LabelEach individual is assigned to one
label.
Definition: quantitative data are discrete if the possible values are isolated
points on the number line. quantitative data are continuous if the set of possible values
forms an entire interval on the number line.Question: Are peoples’ heights continuous? What about ages? What about family size?
(1) For categorical data: i) Frequency / Relative Frequency distribution; ii) Bar graph
iii) Pie charts.(2) For numerical data: i) Frequency distribution; ii) Stemplot (stem-and-leaf plot); iii) Histogram.(3) For bivariate numerical data: Scatter plot
Summary of variables
Ways to chart categorical dataBecause the variable is categorical, the data in the graph can be ordered any way we want (alphabetical, by increasing value, by year, by personal preference, etc.)
Bar graphsEach category isrepresented by a bar.
Pie chartsPeculiarity: The slices must represent the parts of one whole.
For frequency distribution, the most important definition is relative frequency, which is defined by
Example 1: For seven students with gender: M, F, M, M, F, M, F. (M=male; F=Female).
Q1: What is frequency distribution?
Q2: Find the relative frequency for Example 1.
Q3: Make a bar graph and pie chart for Example 1.
set data in the obs of #
frequencyfrequency relative
HW 1.21(a)
Example 1:
For the following frequency distribution:
Q1: Find the relative frequency. (Eg: in the format of 12.56%)
Q2: Make a bar graph for Example 2.
Q3: Make a pie chart for Example 2.
set data in the obs of #
frequencyfrequency relative
Grade FreqFreq
Freshman 12
Sophomore 25
Junior 16
Senior 10
Others 1
TotalTotal 64
Relative FreqRelative Freq
12/64=18.75%
25/64=39.06%
16/64=25%
10/64=15.63%
1/64= 1.56%
1=100%
HW 1.21(a)
Example 2:
Example 2: Students percentage (continue)
Example: Top 10 causes of death in the United States 2006
Rank Causes of death Counts% of top
10s% of total deaths
1 Heart disease 631,636 34% 26%
2 Cancer 559,888 30% 23%
3 Cerebrovascular 137,119 7% 6%
4 Chronic respiratory 124,583 7% 5%
5 Accidents 121,599 7% 5%
6 Diabetes mellitus 72,449 4% 3%
7 Alzheimer’s disease 72,432 4% 3%
8 Flu and pneumonia 56,326 3% 2%
9 Kidney disorders 45,344 2% 2%
10 Septicemia 34,234 2% 1%
All other causes 570,654 24%
For each individual who died in the United States in 2006, we record what was
the cause of death. The table above is a summary of that information.
Top 10 causes of deaths in the United States 2006
Bar graphsEach category is represented by one bar. The bar’s height shows the count (or
sometimes the percentage) for that particular category.
The number of individuals who died of an accident in
2006 is approximately 121,000.
HW 1.21(b, c)
Percent of people dying fromtop 10 causes of death in the United States in 2006
Pie chartsEach slice represents a piece of one whole. The size of a slice depends on what
percent of the whole this category represents.
HW 1.22(b)
Percent of deaths from top 10 causes
Percent of deaths from
all causes
Make sure your labels match
the data.
Make sure all percents
add up to 100.
Child poverty before and after government intervention—UNICEF, 2005
What does this chart tell you?
•The United States and Mexico have the highest
rate of child poverty among OECD (Organization
for Economic Cooperation and Development)
nations (22% and 28% of under 18).
•Their governments do the least—through taxes
and subsidies—to remedy the problem (size of
orange bars and percent difference between
orange/blue bars).
Could you transform this bar graph to fit in 1 pie chart? In two pie charts? Why?
The poverty line is defined as 50% of national median income.
R program for Students percentage example## The route of your data files
route<-"//bearsrv/classrooms/Math/cchen/STT215/LAB_Fall2011/"
## Read in the dataset
dat=read.csv(paste(route,'In_Class_exercises1.csv',sep=''), header=TRUE)
## To plot pie graph
pie(dat$Percent, labels=dat$Students)
## To plot pie graph with a title and make it colorful
pie(dat$Percent, labels=dat$Students, main='Student percentage', col = rainbow(5))
## To create a barplot
barplot(dat$Percent, names.arg= dat$Students, main='Student percentage')
Overview: Ways to chart quantitative data Stemplots
Also called a stem-and-leaf plot. Each observation is represented by a
stem, consisting of all digits except the final one, which is the leaf.
Histograms
A histogram breaks the range of values of a variable into classes and
displays only the count or percent of the observations that fall into each
class.
Line graphs: time plots
A time plot of a variable plots each observation against the time at
which it was measured.
Stem plots: Example 1How to make a stemplot:
Separate each observation into a stem, consisting of all but the final (rightmost) digit,
and a leaf, which is that remaining final digit. Stems may have as many digits as
needed, but each leaf contains only a single digit.
Eg: how to find the stem and leaf for: 25, 135, 6.
Write the stems in a vertical column with the smallest value at the top, and draw a
vertical line at the right of this column.
Write each leaf in the row to the right of its stem, in increasing order out from the
stem.
Dataset: 9, 9, 22, 32, 33, 39, 39, 42, 49, 52, 58, 70.
Q: Make a stem-leaf-plot for this dataset.
Stem plots: Example 2How to make a stemplot:
Separate each observation into a stem, consisting of all but the final
(rightmost) digit, and a leaf, which is that remaining final digit. Stems may
have as many digits as needed, but each leaf contains only a single digit.
Write the stems in a vertical column with the smallest value at the top, and
draw a vertical line at the right of this column.
Write each leaf in the row to the right of its stem, in increasing order out
from the stem.
Dataset: 1, 3, 3, 12, 15, 17, 21, 25, 45, 49, 62, 67, 69. Q: Make a stem-leaf-plot for this dataset.
Stem and leaf Notes:To compare two related distributions, a back-to-back stem plot
with common stems is useful.
Stem-and-leaf plot works best for small numbers of observations that are all greater than 0. But it does not work well for large datasets.
Stem-and-leaf plot display the actual values of the observations.
Stem and leaf Notes: Eg: Make a stemplot for the data: 115, 143, 162, 198, 267, 279, 302. Trim and also split stems. That means: trimming numbers means dropping
the last digit.
Eg: Original data 141, by dropping the last digit, it gives 14.
Original data 255, by dropping the last digit, it gives 25.
Splitting stems. Eg: your dataset are:
1, 7, 10, 11, 12, 13, 15, 16, 17, 18, 19, 21, 22, 23, 25, 26, 28, 29.
Then by “splitting stem”, it gives:
0 | 1
0 | 7
1 | 0 1 2 3
1 | 5 6 7 8 9
2 | 1 2 3
2 | 5 6 8 9
Here “splitting stem” says, if you dataset is of median size, then even when some numbers share same stem, we separate them into two parts with same stem, one part with 0-4, and another part with 5-9.
Histogram A histogram breaks the range of values of a variable into classes
and displays only the count or percentage of the observations that fall into each class.
You can choose any convenient number of classes, but you should always choose classes of equal width.
Table 1.3Introduction to the Practice of Statistics, Sixth Edition
© 2009 W.H. Freeman and Company
Steps to draw a histogram Step 1: Divide the range of the data into classes of equal width. (Be sure to
specify the classes precisely so that each individual falls into exactly one class.)
EX: IQ Scores 75<= IQ Scores < 85; 85<= IQ Scores < 95; 95<= IQ Scores < 105; 105<= IQ Scores < 115; 115<= IQ Scores < 125; 125<= IQ Scores < 135;135<= IQ Scores < 145; 145<= IQ Scores < 155;
Step 2: Count the number of individual in each class. The counts are called frequencies, and a table of frequencies for all class is a frequency table.
Step 3: Draw the histogram.
First, on the horizontal axis mark the scale for the variable whose distribution you are displaying. The vertical axis contains the scale of counts. Each bar represents a class. The base covers the class. The bar height is the class count.
classes [75, 85) [85, 95) [95, 105) [105,115) [115,125) [125,135) [135,145) [145,155)
counts 2 3 10 16 13 10 5 1
Q: How many percent of those chose fifth-grade students have IQ scores of 105 or less?
Q: How many percent of those chose fifth-grade students have IQ scores of 105 or less?
In Summary…
Important property of a density curve is that areas under the curve correspond to relative frequencies
Stemplots are quick and dirty histograms that can easily be done by
hand, and therefore are very convenient for back of the envelope
calculations. However, they are rarely found in scientific or laymen
publications.
Stemplots versus histograms
Once a graph of the variable is made, we can begin to understand its distribution by looking at the following: look at the overall pattern in the graph and for striking
deviations from that overall pattern. Peaks? Gaps? Symmetric? Skewed?
describe the overall pattern of the distribution by talking about its shape, center, and spread (or variation).
look for possible outliers in the distribution; i.e., those values of the variable that seem to fall outside the overall pattern you see.
These features will be important for all types of graphs…
Stemplots versus histograms Examining distributions of Quantitative Variables
Contiunue 1. one or several peaks (called modes)? If unique major peak, then
call unimodal.
2. Center/Midpoint: the value with roughly half the observations taking smaller values and half taking larger values.
3. symmetric or skewed to the right / left? Skewed to the right if the right tail (larger value) is much longer than the left tail (smaller value).
4. Outlier: a point that is clearly apart from the body of the data, not just the most extreme observation in a distribution.
Complex, multimodal distribution
Symmetric distribution
Skewed distribution
Alaska Florida
Outliers
An important kind of deviation is an outlier. Outliers are observations
that lie outside the overall pattern of a distribution. Always look for
outliers and try to explain them.
The overall pattern is fairly
symmetrical except for 2
states that clearly do not
belong to the main trend.
Alaska and Florida have
unusual representation of
the elderly in their
population.
A large gap in the
distribution is typically a
sign of an outlier.
Symmetric
Skewed to the right
Skewed to the left
Outlier
Double peaks
Q: 1.37(p 28) HWQ: 1.25, 1.26, 1.27, 1.42
How to create a histogram
It is an iterative process – try and try again.
What bin size should you use?
Not too many bins with either 0 or 1 counts
Not overly summarized that you lose all the information
Not so detailed that it is no longer summary
rule of thumb: start with 5 to 10 bins
Look at the distribution and refine your bins
(There isn’t a unique or “perfect” solution)
Not summarized enough
Too summarized
Same data set
Death rates from cancer (US, 1945-95)
0
50
100
150
200
250
1940 1950 1960 1970 1980 1990 2000
Years
Death
rate
(per
thousand)
Death rates from cancer (US, 1945-95)
0
50
100
150
200
250
1940 1960 1980 2000
Years
Dea
th r
ate
(per
thou
sand
)
Death rates from cancer (US, 1945-95)
0
50
100
150
200
250
1940 1960 1980 2000
Years
Death
rate
(per
thousand)
A picture is worth a thousand words,
BUT
There is nothing like hard numbers.
Look at the scales.
Scales matterHow you stretch the axes and choose your scales can give a different impression.
Death rates from cancer (US, 1945-95)
120
140
160
180
200
220
1940 1960 1980 2000
Years
Death
rate
(pe
r th
ousan
d)
IMPORTANT NOTE:
Your data are the way they are.
Do not try to force them into a
particular shape.
It is a common misconception
that if you have a large enough
data set, the data will eventually
turn out nice and symmetrical.
Histogram of dry days in 1995
Line graphs: time plots
A trend is a rise or fall that persist over time, despite small irregularities.
In a time plot, time always goes on the horizontal, x axis.
We describe time series by looking for an overall pattern and for striking
deviations from that pattern. In a time series:
A pattern that repeats itself at regular intervals of time is
called seasonal variation.
Retail price of fresh oranges over time
This time plot shows a regular pattern of yearly variations. These are seasonal
variations in fresh orange pricing most likely due to similar seasonal variations in
the production of fresh oranges.
There is also an overall upward trend in pricing over time. It could simply be
reflecting inflation trends or a more fundamental change in this industry.
Time is on the horizontal, x axis.
The variable of interest—here
“retail price of fresh oranges”—
goes on the vertical, y axis.
1918 influenza epidemicDate # Cases # Deaths
week 1 36 0week 2 531 0week 3 4233 130week 4 8682 552week 5 7164 738week 6 2229 414week 7 600 198week 8 164 90week 9 57 56week 10 722 50week 11 1517 71week 12 1828 137week 13 1539 178week 14 2416 194week 15 3148 290week 16 3465 310week 17 1440 149
0100020003000400050006000700080009000
10000
0100200300400500600700800
0100020003000400050006000700080009000
10000
# c
ase
s d
iag
no
sed
0
100
200
300
400
500
600
700
800
# d
ea
ths
rep
ort
ed
# Cases # Deaths
A time plot can be used to compare two or more
data sets covering the same time period.
The pattern over time for the number of flu diagnoses closely resembles that for the
number of deaths from the flu, indicating that about 8% to 10% of the people
diagnosed that year died shortly afterward from complications of the flu.
Go to http://www.statcrunch.com/ and log in with your user name and password.
From tool bar, click on MyStatCrunch:
Then click on My Groups:
Then click on “join a group” In the search box on the left panel, search for: STT215 Then you are able to find Now select and join our UNCW-STT215 group.
How to run StatCrunch
Searchengines.xls (Categorical variable with summary of %).
Ruthmcgwire.xls (Numerical variable)
Beer.xls (Complex dataset)
How to run StatCrunch
After you make a graph, now click on
the “Options” icon.
If you select “COPY”, then a new window
Will pop up. Right-click the image to
copy it.
Now use “Ctr+Alt+v” to past the image.
You can choose the bitmap format.
Otherwise you can save the image and cut and paste later on.
How to run StatCrunch