Upload
florence-montgomery
View
34
Download
0
Tags:
Embed Size (px)
DESCRIPTION
1.2 Displaying and Describing Categorical & Quantitative Data. You should be able to:. Recognize when a variable is categorical or quantitative Choose an appropriate display for a categorical variable and a quantitative variable - PowerPoint PPT Presentation
Citation preview
1.2 Displaying and Describing Categorical & Quantitative Data
You should be able to:• Recognize when a variable is categorical or quantitative• Choose an appropriate display for a categorical variable and a
quantitative variable• Summarize the distribution with a bar, pie chart, stem-leaf plot,
histogram, dot plot, box plots• Know how to make a contingency table• Describe the distribution of categorical variables in terms of relative
frequencies• Be able to describe the distribution of quantitative variables in terms
of its shape, center and spread• Describe abnormalities or extraordinary features of distribution• Discuss outliers and how they deviate from the overall pattern
3 RULES of EXPLORATORY DATA ANALYSIS
1. MAKE A PICTURE – find patterns in difficult to see from a chart
2. MAKE A PICTURE – show important features in graph
3. MAKE A PICTURE- communicates your data to others
Concepts to know!
• Bar graph• Histogram• Dot plot• Stem leaf plot• Scatterplots• Boxplots
Which graph to use?
• Depends on type of data
• Depends on what you want to illustrate
Categorical Data
• The objects being studied are grouped into categories based on some qualitative trait.
• The resulting data are merely
labels or categories.
Categorical Data(Single Variable)
Eye Color BLUE BROWN GREEN
Frequency
(COUNTS)
20 50 5
Relative Frequency
20/75 =
.27
50/75=
.66
5/75=
.07
Pie Chart(Data is Counts or Percentages)
Eye Color
Blue , 20, 27%
Brown, 50, 66%
Green, 5, 7%
Blue Brown Green
Bar Graph(Shows distribution of data)
Eye Color
0
10
20
30
40
50
60
Blue Brown Green
Color
Fre
qu
en
cy
Blue
Brown
Green
Bar Graph
• Summarizes categorical data.• Horizontal axis represents categories, while
vertical axis represents either counts (“frequencies”) or percentages (“relative frequencies”).
• Used to illustrate the differences in percentages (or counts) between categories.
Contingency Table(How data is distributed
across multiple variables)
Class
Survival
First Second Third Crew Total
ALIVE 203 118 178 212 711
DEAD 122 167 528 673 1490
Total 325 285 706 885 2201
What can go wrong when working with
categorical data?
• Pay attention to the variables and what the percentages represent(9.4% of passengers who were in first class survived is different from 67% of survivors were first class passengers!!!)
• Make sure you have a reasonably large data set
(67% of the rats tested died and 1 lived)
Analogy
Bar chart is to categorical data as histogram is to ...
quantitative data.
Histogram
18 19 20 21 22 23 24 25 26 27
0
10
20
30
40
50
Age (in years)
Fre
quency (
Count)
Age of Spring 1998 Stat 250 Students
n=92 students
Histogram
• Divide measurement up into equal-sized categories (BIN WIDTH)
• Determine number (or percentage) of measurements falling into each category.
• Draw a bar for each category so bars’ heights represent number (or percent) falling into the categories.
• Label and title appropriately.
• http://www.stat.sc.edu/~west/javahtml/Histogram.html
Use common sense in determining number of
categories to use.
Between 6 & 15 intervals is preferable
(Trial-and-error works fine, too.)
Histogram
Too few categories
18 23 28
0
10
20
30
40
50
60
Age (in years)
Fre
quency (
Count)
Age of Spring 1998 Stat 250 Students
n=92 students
Too many categories
2 3 4
0
1
2
3
4
5
6
7
GPA
Fre
quen
cy (
Co
unt)
GPAs of Spring 1998 Stat 250 Students
n=92 students
Dot Plot
160150140130120110100908070Speed
Fastest Ever Driving Speed
Women126
Men100
226 Stat 100 Students, Fall '98
Dot Plot
• Summarizes quantitative data.• Horizontal axis represents measurement
scale.• Plot one dot for each data point.
Stem-and-Leaf PlotStem-and-leaf of Shoes N = 139 Leaf Unit = 1.0
12 0 223334444444 63 0 555555555555566666666677777778888888888888999999999 (33) 1 000000000000011112222233333333444 43 1 555555556667777888 25 2 0000000000023 12 2 5557 8 3 0023 4 3 4 4 00 2 4 2 5 0 1 5 1 6 1 6 1 7 1 7 5
Stem-and-Leaf Plot
• Summarizes quantitative data.• Each data point is broken down into a “stem”
and a “leaf.”• First, “stems” are aligned in a column.• Then, “leaves” are attached to the stems.
Box Plot
0
1
2
3
4
5
6
7
8
9
10
Hours
of sl
eep
Amount of sleep in past 24 hours
of Spring 1998 Stat 250 Students
Box Plot
• Summarizes quantitative data.• Vertical (or horizontal) axis represents
measurement scale.• Lines in box represent the 25th percentile (“first
quartile”), the 50th percentile (“median”), and the 75th percentile (“third quartile”), respectively.
5 Number Summary
• Minimum• Q1 (25th percentile)• Median (50th percentile)• Q3 (75th percentile)• Maximum
An aside...
• Roughly speaking:– The “25th percentile” is the number such
that 25% of the data points fall below the number.
– The “median” or “50th percentile” is the number such that half of the data points fall below the number.
– The “75th percentile” is the number such that 75% of the data points fall below the number.
Box Plot (cont’d)
• “Outliers” are drawn to the most extreme data points that are not more than 1.5 times the length of the box beyond either quartile(IQR).
IQR = Q3 - Q1
Outliers(upper) > Q3+1.5 * IQR
Outliers(lower)<Q1-1.5*IQR
Using Box Plots to Compare
female male
60
110
160
Gender
Fast
est
Speed (
mph)
Fastest Ever Driving Speed
226 Stat 100 Students, Fall 1998 Outliers
Strengths and Weaknesses of Graphs for Quantitative Data
• Histograms– Uses intervals– Good to judge the “shape” of a data– Not good for small data sets
• Stem-Leaf Plots– Good for sorting data (find the median)– Not good for large data sets
Strengths and Weaknesses of Graphs for Quantitative Data
• Dotplots– Uses individual data points– Good to show general descriptions of
center and variation– Not good for judging shape for large data sets
• Boxplots– Good for showing exact look at center,
spread and outliers– Not good for judging shape
Analogy
Contingency table is to categorical data with two
variables as
scatterplot is to ..
quantitative data with two
variables.
Scatter Plots
22 23 24 25 26 27 28 29 30 31
22
23
24
25
26
27
28
29
30
31
Left foot (in cm)
Rig
ht fo
ot (in c
m)
Foot sizes of Spring 1998 Stat 250 students
n=88 students
Scatter Plots
• Summarizes the relationship between two quantitative variables.
• Horizontal axis represents one variable and vertical axis represents second variable.
• Plot one point for each pair of measurements.
No relationship
52 57 62
22
23
24
25
26
27
28
29
30
31
32
Head circumference (in cm)
Left fore
arm
(in
cm
)Lengths of left forearms and head circumferences
of Spring 1998 Stat 250 Students
n=89 students
Summary
• Many possible types of graphs.• Use common sense in reading graphs.• When creating graphs, don’t summarize your
data too much or too little.• When creating graphs, label everything for
others. Remember you are trying to communicate something to others!
GRAPHICAL ANALYSIS
• CENTER
• SPREAD
Graphical Analysis
Center Center
(Location)(Location)
SpreadSpread
(Variation)(Variation)
ShapeShape
Interesting Features Identified by Graphs
• Center (Location)• Spread (Variability)• Shape • Individual Values• Compare Groups• Identify Outliers• LOOK FOR PATTERNS, CLUSTERS, GAPS!!!!!• LOOK FOR DEVIATIONS FROM THE GENERAL
PATTERN!!!!!
http://bcs.whfreeman.com/bps3e/
Shape of Graph
Shape of Graph
• Symmetry-Skewness• Modes(peaks)
SYMMETRY & SKEWNESS
• Skewness - a measure of symmetry, or more precisely, the lack of symmetry.
• A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
Kurtosis
• Kurtosis - a measure of whether the data are peaked or flat relative to a normal distribution.
• High kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. L
• Low kurtosis tend to have a flat top near the mean rather than a sharp peak.
Symmetric -- Mesokurtic
Symmetric -- Platykurtic
Symmetric -- Leptokurtic
Nonsymmetric – Skewed Positive -Skewed Right
Nonsymmetric – Skewed Negative -- Skewed Left
Symmetric -- Bimodal
•
Describing Distributions –
SOCS Rock!• Shape
• Outliers
• Center
• Spread
Graphical Analysis - Scale
375 400 425 450 475 500 525 350 400 450 500 550 600 650
SOL Scores – Class 1 SOL Scores – Class 2
Graphical Analysis - Scale
400 450 500 550 600 400 450 500 550 600
SOL Scores – Class 1 SOL Scores – Class 2
400 450 500 550 600 400 450 500 550 600
375 400 425 450 475 500 525 350 400 450 500 550 600 650
SOL Scores – Class 1 SOL Scores – Class 2
• Comparing Histograms