Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
Chapter 7 (part 2)
Scatterplots, Association, and Correlation
Word of Caution in Correlation: Beware of Outliers
2
Outliers can greatly
For all n=10 data points,
For the n=9 clustered data points,
impact correlation.
r = 0.880
r = 0.000
Don’t just remove outliers It is not appropriate (or ethical) to just remove
a data point because it does not follow the general trend (or the expected trend).
Start by making sure it is not a mistake. Often, outliers can show us an interesting phenomenon that leads to further research.
3
Perhaps the most important caution about interpreting correlation is: Correlation does not necessarily imply causality.
4
Correlation does not imply Causality
5
Possible Explanations for a Correlation 1. The correlation may be a coincidence.
2. Both correlation variables might be directly influenced by some common underlying cause.
3. One of the correlated variables may actually be a cause of the other. But note that, even in this case, it may be just one of several causes.
Correlation and Causality (Concepts & Applications)
For each example that follows, state whether the correlation is most likely due to …
(Choose which reason is most likely) Coincidence
A common underlying cause
A direct cause (i.e. changing X causes a change in Y)
6
Correlation and Causality (Concepts & Applications)
Greater damage was observed when more firefighters were at a fire.
(Which reason is most likely?) Coincidence
A common underlying cause
A direct cause (i.e. changing X causes a change in Y)
7
Number of firefighters at scene
fire
dam
age
(dol
lars
$)
Correlation and Causality (Concepts & Applications)
There is a positive correlation between practice time and skill among piano players; that is, those who practice more tend to be more skilled.
(Which reason is most likely?) Coincidence
A common underlying cause
A direct cause (i.e. changing X causes a change in Y) 8
Correlation and Causality (Concepts & Applications)
Amount of ice cream bought at the beach, and the number of swimmers requiring help from the lifeguards was found to be positively correlated.
(Which reason is most likely?) Coincidence
A common underlying cause
A direct cause (i.e. changing X causes a change in Y) 9
Example: SAT score vs. public $$$ spent on education
In setting public policy, we may often hear something like…
10
“State spending on education is positively correlated with SAT scores and therefore we should increase our state’s spending on education.”
Example: SAT score vs. public $$$ spent on education Data was taken from the 1997 Digest of
Education Statistics, an annual publication of the U.S. Department of Education.
11 4 6 8 10
800
900
1000
1100
1200
SAT score vs. money spent on students by state
Expenditures per Pupil in $1000s
Tot
al S
AT
sco
re
Are you surprised by the relationship in this plot? What could be going on here?
r = −0.3805
Example: SAT score vs. public $$$ spent on education First, just looking at this scatterplot, would we
12 4 6 8 10
800
900
1000
1100
1200
SAT score vs. money spent on students by state
Expenditures per Pupil in $1000s
Tot
al S
AT
sco
re
interpret it as “Increasing expenditures causes a decrease in SAT scores?”
NO.
r = −0.3805
Simpson’s Paradox
A statistical relationship between two variables can be reversed by including additional factors in the analysis.
13
4 6 8 10
800
900
1000
1100
1200
SAT score vs. money spent on students by state
Expenditures per Pupil in $1000s
Tot
al S
AT
sco
re
Alaska
Iowa
Minnesota
New Jersey
North Dakota
Utah Wisconsin
Connecticut
Idaho
New York
South Carolina
Example: SAT score vs. public $$$ spent on education Let’s ask… do ALL students in these states
14
take the SAT? It turns out that the answer is ‘no’ and this REALLY matters…
r = −0.3805
4 6 8 10
800
900
1000
1100
1200
SAT score vs. money spent on students by state
Expenditures per Pupil in $1000s
Tot
al S
AT
sco
re
% of students taking SAT<9%9-10%11%-20%21%-40%41%-60%61%-69%70%-81%
Example: SAT score vs. public $$$ spent on education
15
Does the percent of students
taking the SAT in a
state help explain the paradox?
YES!
Low percentage
high percentage
4 6 8 10
800
900
1000
1100
1200
SAT score vs. money spent on students by state
Expenditures per Pupil in $1000s
Tot
al S
AT
sco
re
% of students taking SAT<9%9-10%11%-20%21%-40%41%-60%61%-69%70%-81%
Example: SAT score vs. public $$$ spent on education
16
Within states with the same %
taking the SAT, we
actually see a positive
relationship!!
Low percentage
high percentage
17
When only a few students take the SAT in a state, who are these students? (best students)
4 6 8 10
800
900
1000
1100
1200
SAT score vs. money spent on students by state
Expenditures per Pupil in $1000s
Tot
al S
AT
sco
re
% of students taking SAT<9%9-10%11%-20%21%-40%41%-60%61%-69%70%-81%
Example: SAT score vs. public $$$ spent on education
If you have ALL students in a state taking the SAT, then you won’t be grabbing just the ‘good students’… and the average SAT will be lower compared to states with a small percentage of their ‘best’ students taking the SAT (is this a fair state-to-state comparison?)
18
Within similar states (i.e. similar % taking the SAT) we see that spending more $$$ is associated with higher SAT scores.
4 6 8 10
800
900
1000
1100
1200
SAT score vs. money spent on students by state
Expenditures per Pupil in $1000s
Tot
al S
AT
sco
re
% of students taking SAT<9%9-10%11%-20%21%-40%41%-60%61%-69%70%-81%
Example: SAT score vs. public $$$ spent on education
Simpson’s Paradox
A statistical relationship between two variables can be reversed by including additional factors in the analysis.
By including the variable called “% of students taking the SAT”, we saw a reversal of the relationship shown in the original scatterplot between SAT score and $$$ spent per student.
19
Interpret correlation with caution
Remember that correlation is a simple summary of a sometimes complex situation.
Scatterplots are useful, but they do have limitations when many variables impact each other in a complex manner.
20
21
The SAT score and expenditure information was a modification of material available at:
www.stat.ucla.edu/labs/pdflabs/sat.pdf
Food for thought… Should we reward school teachers based on student’s standardized test scores?
22
0 20 40 60 80 100
0200
400
600
800
1000
1200
Test Scores vs. teaching ability and effort
Teacher's genuine ability and effort
Stu
dent
's n
atio
nal t
est s
core
s What might this scatterplot look like? If students score high, was the teacher good? If students score low, was the teacher bad?