Apr 20, 2023 1
Chapter 14: Chapter 14: Correlation and RegressionCorrelation and Regression
2
In Chapter 14:
14.1 Data
14.2 Scatterplots
14.3 Correlation
14.4 Regression
3
14.1 Data• Quantitative response variable Y (“dependent
variable”)• Quantitative explanatory variable X
(“independent variable”)• Historically important public health data set used
to illustrate techniques (Doll, 1955)– n = 11 countries– Explanatory variable = per capita cigarette
consumption in 1930 (CIG1930) – Response variable = lung cancer mortality per
100,000 (LUNGCA)
4
Data, cont.
5
§14.2 ScatterplotBivariate (xi, yi) points plotted as scatter plot.
6
Inspect scatterplot’s
• Form: Can the relation be described with a straight or some other type of line?
• Direction: Do points tend trend upward or downward?
• Strength of association: Do point adhere closely to an imaginary trend line?
• Outliers (in any): Are there any striking deviations from the overall pattern?
7
Judging Correlational Strength• Correlational strength refers
to the degree to which points adhere to a trend line
• The eye is not a good judge of strength.
• The top plot appears to show a weaker correlation than the bottom plot. However, these are plots of the same data sets. (The perception of a difference is an artifact of axes scaling.)
8
§14.3. Correlation• Correlation coefficient r quantifies linear
relationship with a number between −1 and 1.• When all points fall on a line with an upward
slope, r = 1. When all data points fall on a line with a downward slope, r = −1
• When data points trend upward, r is positive; when data points trend downward, r is negative.
• The closer r is to 1 or −1, the stronger the correlation.
9
Examples of correlations
10
Calculating r• Formula
Correlation coefficient tracks the degree to which X and Y “go together.”
• Recall that z scores quantify the amount a value lies above or below its mean in standard deviations units.
• When z scores for X and Y track in the same direction, their products are positive and r is positive (and vice versa).
11
Calculating r, Example
12
Calculating rIn practice, we rely on computers and calculators to calculate r. I encourage my students to use these tools whenever possible.
13
Calculating rSPSS output for Analyze > Correlate > Bivariate using the illustrative data:
Correlations
1 .737**
.010
11 11
.737** 1
.010
11 11
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Cig Consuption percapita, 1930
Lung Cancer Mortalityper 100000, 1950
CigConsuptionper capita,
1930
Lung CancerMortality per100000, 1950
Correlation is significant at the 0.01 level (2-tailed).**.
14
Interpretation of r1. Direction. The sign of r indicates the direction
of the association: positive (r > 0), negative (r < 0), or no association (r ≈ 0).
2. Strength. The closer r is to 1 or −1, the stronger the association.
3. Coefficient of determination. The square of the correlation coefficient (r2) is called the coefficient of determination. This statistic quantifies the proportion of the variance in Y [mathematically] “explained” by X. For the illustrative data, r = 0.737 and r2 = 0.54. Therefore, 54% of the variance in Y is explained by X.
15
Notes, cont. 4. Reversible relationship. With correlation, it
does not matter whether variable X or Y is specified as the explanatory variable; calculations come out the same either way. [This will not be true for regression.]
5. Outliers. Outliers can have
a profound effect on r. This
figure has an r of 0.82 that is
fully accounted for by the
single outlier.
16
Notes, cont.
6. Linear relations only. Correlation applies only to linear relationships This figure shows a strong non-linear relationship, yet r = 0.00.
7. Correlation does not necessarily mean causation. Beware lurking variables (next slide).
17
Confounded CorrelationA near perfect negative correlation (r = −.987) was seen between cholera mortality and elevation above sea level during a 19th century epidemic.
We now know that cholera is transmitted by water. The observed relationship between cholera and elevation was confounded by the lurking variable proximity to polluted water.
18
Hypothesis Test
We conduct the hypothesis test to guard against identifying too many random correlations.
Random selection from a random scatter can result in an apparent correlation
19
Hypothesis TestA. Hypotheses. Let ρ represent the population
correlation coefficient. H0: ρ = 0 vs. Ha: ρ ≠ 0 (two-sided)[or Ha: ρ > 0 (right-sided) or Ha: ρ < 0 (left-sided)]
B. Test statistic
C. P-value. Convert tstat to P-value with software or Table C.
2
2
1 where
2
stat
ndf
n
rSE
SE
rt r
r
20
Hypothesis Test – Illustrative Example
A. H0: ρ = 0 vs. Ha: ρ ≠ 0 (two-sided)
B. Test stat
C. .005 < P < .01 by Table C. P = .0097 by computer. The evidence against H0 is highly significant.
9 211
3.27 0.2253
737.0
0.2253211
737.01
stat
2
df
t
SEr
21
Confidence Interval for ρ
22
Confidence Interval for ρ
23
Conditions for Inference
• Independent observations
• Bivariate Normality (r can still be used descriptively when data are not bivariate Normal)
24
§14.4. Regression• Regression describes
the relationship in the data with a line that predicts the average change in Y per unit X.
• The best fitting line is found by minimizing the sum of squared residuals, as shown in this figure.
25
Regression Line, cont.
• The regression line equation is:
where ŷ ≡ predicted value of Y, a ≡ the intercept of the line, and b ≡ the slope of the line
• Equations to calculate a and bSLOPE:
INTERCEPT:
26
Regression Line, cont.Slope b is the key statistic produced by the regression
27
Regression Line, illustrative example
Here’s the output from SPSS:
28
• Let α represent the population intercept, β represent population slope, and εi represent the residual “error” for point i. The population regression model is
• The estimated standard error of the regression is
• A (1−α)100% CI for population slope β is
Inference
X
xYbbn
sn
sSESEtb
1 where |
1,2 2
29
Confidence Interval for β–Example
-4.342 17.854
.007 .039
(Constant)
cig1930
Model1
Lower Bound Upper Bound
95% Confidence Interval for B
30
t Test of Slope Coefficient
A. Hypotheses. H0: β = 0 against Ha: β ≠ 0
B. Test statistic.
C. P-value. Convert the tstat to a P-value
2
1
where |stat
ndf
sn
sSE
SE
bt
X
xYb
b
31
t Test: Illustrative Example
6.756 4.906 1.377 .202
.023 .007 .737 3.275 .010
(Constant)
cig1930
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
32
Analysis of Variance of the Regression Model
An ANOVA technique equivalent to the t test can also be used to test H0: β = 0.This technique is covered on pp. 321 – 324 in the text but is not included in this presentation.
33
Conditions for Inference
Inference about the regression line requires these conditions
• Linearity
• Independent observations
• Normality at each level of X
• Equal variance at each level of X
34
Conditions for InferenceThis figure illustrates Normal and equal variation around the regression line at all levels of X
35
Assessing Conditions• The scatterplot should be visually inspected for
linearity, Normality, and equal variance• Plotting the residuals from the model can be
helpful in this regard.• The table lists residuals for the illustrative data
36
Assessing Conditions, cont. • A stemplot of the
residuals show no major departures from Normality
• This residual plot shows more variability at higher X values (but the data is very sparse)
|-1|6|-0|2336| 0|01366| 1|4 x10
37
Residual PlotsWith a little experience, you can get good at reading residual plots. Here’s an example of linearity with equal variance.
38
Residual PlotsExample of linearity with unequal variance
39
Example of Residual PlotsExample of non-linearity with equal variance