View
5
Download
0
Category
Preview:
Citation preview
CIVL 7012/8012
Simple Linear Regression
Lecture 2
Correlation
β’ Correlation is the degree to which two continuous variables are
linearly associated.
β’ This is most often represented by a scatterplot and the Pearson
correlation coefficient, denote by (π).
β’ The scatterplot provides a visual as to how the two continuous
variable are correlated.
β’ The coefficient is a measure of the linear association between the
two variables.
Correlation
β’ If there is no correlation between the two variables, the points will
form a horizontal or vertical line or complete randomness (no obvious
patterns).
β’ Note that it does not matter which variable is on x-axis and which is
on the y-axis.
β’ The pattern the two variables form determines the strength and
direction of their correlation.
Correlation
β’ The stronger the correlation, the more
linearly distinct the pattern will be.
β’ The coefficient is between -1 and 1.
+1 indicates a perfect positive correlation
-1 indicates a perfect negative correlation
0 indicates no correlation
β’ No strict rules for interpretation, however,
as a guideline, it is suggested:
0 < |π| < 0.3: weak correlation
0.3 < |π| < 0.7: moderate correlation
|π| > 0.7: strong correlation
Correlation
Snapshot from Multivariate Lecture 6
πππ is the correlation notation for the entire population.
Pearson correlation coefficient (π) is for our sample representing
the population.
π = π₯π β π₯ π¦π β π¦
π₯π β π₯ 2 π¦π β π¦ 2
Correlation calculation
Meal
Bill ($)
Tip ($)
Bill deviations
Tip deviations
Deviations products
Bill deviations squared
Tip deviations squared
π₯ π¦ π₯π β π₯ π¦π β π¦ (π₯π β π₯ )(π¦π β π¦ ) π₯π β π₯ 2 π¦π β π¦ 2
1 35 6 -37.5 -4 150 1406.25 16
2 110 18 37.5 8 300 1406.25 64
3 66 11 -6.5 1 -6.5 42.25 1
4 75 7 2.5 -3 -7.5 6.25 9
5 100 14 27.5 4 110 756.25 16
6 49 4 -23.5 -6 141 552.25 36
687 4169.5 142
π = π₯π β π₯ π¦π β π¦
π₯π β π₯ 2 π¦π β π¦ 2=
687
(4169.5)(142) = 0.892
Correlation significance test (t-test)
β’ Is it statistically significant?
β’ Conduct a t-test
β’ π»0: π = 0 π£π . π»1: π β 0 ππ‘ πΌ = 0.05
β’ π‘ = ππβ2
1βπ2, df=n-2
β’ π‘ = 0.8926β2
1β0.8922= 3.947
π = 0.892
Correlation significance test (t-test)
β’ π»0: π = 0 π£π . π»1: π β 0 ππ‘ πΌ = 0.05
β’ π‘ = ππβ2
1βπ2, df=n-2
β’ π‘ = 0.8926β2
1β0.8922= 3.947
β’ π‘ππππ > π‘ππππ‘. βββ ππππππ‘ ππ’ππ
SLR Lecture 1 Recap
Recap - Quick Review
β’ SLR is a comparison of 2 models:
β’ One is where the independent variable does not exist
β’ And the other uses the best-fit regression line
β’ If there is only one variable, the best prediction for other
values is the mean of the dependent variable.
β’ The distance between the best-fit line and the observed
value is called residual (or error).
β’ The residuals are squared and added together to
generate sum of squares residuals/error (SSE).
β’ SLR is designed to find the best fitting line through the
data that minimizes the SSE.
Recap - Example
0
2
4
6
8
10
12
14
16
18
20
0 1 2 3 4 5 6 7
Tip
($
)
Meal #
Tips for service ($)
π¦ =10
Best-fit line
Meal # Tip ($)
1 6
2 18
3 11
4 7
5 14
6 4
0
2
4
6
8
10
12
14
16
18
20
0 1 2 3 4 5 6 7
Tips for service ($)
16 1
16
64
9 36
Recap - Residuals (Errors)
+8
+1
β3
+4
β6 Squared Residuals (Errors)
# Residual Residual2
1 β4 16
2 +8 64
3 +1 1
4 β3 9
5 +4 16
6 β6 36
Sum of squared errors (SSE)
= 142
πΉππππ πππππ = πππ
β4
Recap β Population vs. Sample Eq.
β’ If we knew our βpopulationβ parameters, π½0, π½1, then we could use the SLR eq. as is.
β’ In reality, we almost never have the population parameters. Therefore we have to estimate them using sample data. With sample data, SLR eq. changes a bit.
β’ Where π¦ βy-hatβ is the point estimator of πΈ π¦ .
β’ Or, π¦ is the mean value of π¦ for a given π₯.
πΈ π¦ = π½0 + π½1π₯
π¦ = π0 + π1π₯
Recap β OLS criterion
π¦π = observed value of dependent variable (tip amount).
π¦ π =estimated (predicted) value of the dependent variable
(predicted tip amount based on regression model).
min π¦π β π¦ π2
0
5
10
15
20
0 50 100 150
observed
predicted
Recap - SLR parameter equations
π¦ π = π0 + π1π₯
π1 = π₯π β π₯ π¦π β π¦
π₯π β π₯ 2
slope
π₯ = mean of the independent variable ($
bill)
π¦ = mean of the dependent variable ($ tip)
π₯π = value of the independent variable
π¦π = value of the dependent variable
π0 = π¦ β π1π₯
intercept
Recap - OLS Calculations
Meal Bill ($) Tip ($) Bill deviations
(ππ₯) Tip deviations Deviations products
Bill deviations squared ππ₯
2
π₯ π¦ π₯π β π₯ π¦π β π¦ (π₯π β π₯ )(π¦π β π¦ ) π₯π β π₯ 2
1 35 6 -37.5 -4 150 1406.25
2 110 18 37.5 8 300 1406.25
3 66 11 -6.5 1 -6.5 42.25
4 75 7 2.5 -3 -7.5 6.25
5 100 14 27.5 4 110 756.25
6 49 4 -23.5 -6 141 552.25
π₯ = 72.5 π¦ = 10 687 4169.5
Recap - OLS Calculations
Deviations products Bill deviations squared
(ππ β π )(ππ β π ) ππ β π π
150 1406.25
300 1406.25
-6.5 42.25
-7.5 6.25
110 756.25
141 552.25
πππ ππππ. π
ππ = ππ β π ππ β π
ππ β π π
ππ =πππ
ππππ. π
ππ = π. ππππ
Recap - OLS Calculations
ππ = ππ β π. ππππ(ππ. π)
ππ = π. ππππ
ππ = π + πππ
Bill ($) Tip ($)
π π
35 6
110 18
66 11
75 7
100 14
49 4
π₯ = 72.5 π¦ = 10
ππ = ππ β ππ. ππππ
ππ = βπ. ππππ
Recap β New Best-Fit Line & Parameters
π¦ π = π0 + π1π₯
π¦ π = β1.9457 +0.1648π₯
π0 = β1.9457
intercept
π1 = 0.1648
slope
π¦ π = 0.1648π₯ β 1.9457
OR
Recap - Final SLR line
0
2
4
6
8
10
12
14
16
18
20
0 20 40 60 80 100 120
Tip
($
)
Bill ($)
Bill vs. Tip Amount ($)
π Μ_π =βπ.ππππ +π.πππππ
ππ=βπ.ππππ
πππππ ππ = π. ππππ
Recap - SLR Model Interpretation
π¦ π = β1.9457 +0.1648π₯
For every $1 the bill amount (π₯) increases, we would expect the tip
amount to also increase by $0.1648 or
about 16 cents (positive coefficient).
If the bill amount (π₯) is zero, then the
expected/predicted tip amount is $-
1.9457 or negative $1.95!
Does this make any sense? NO In real
world problems, the intercept may or
may not make sense.
SLR β Lecture 2
0
2
4
6
8
10
12
14
16
18
20
0 50 100 150
Bills vs Tips ($)
0
5
10
15
20
0 1 2 3 4 5 6 7
Tips ($)
Model fit and Coefficient of Determination
πΊπΊπ¬ = πππ
πΊπΊπ¬ = πΊπΊπ»
With only the DV, the only sum
of squares is due to error.
Therefore, it is also the total,
and MAX sum of squares for
this data sample. πΊπΊπ» = πππ
With both the IV and DV, SST
remains the same. But the SSE
is reduced significantly. The
difference between the SSE
and SST is due to regression
(SSR).
πΊπΊπ» = πππ
πΊπΊπ¬ = ?
πΊπΊπ» β πΊπΊπ¬ = πΊπΊπΉ
Estimate regression values
Meal Bill ($) Tip ($) π π = βπ. ππππ +π. πππππ π π (predicted tip $)
π₯π π¦π
1 35 6 π¦ π = β1.9457 +0.1648(35) 3.8212
2 110 18 π¦ π = β1.9457 +0.1648(110) 16.1788
3 66 11 π¦ π = β1.9457 +0.1648(66) 8.9290
4 75 7 π¦ π = β1.9457 +0.1648(75) 10.4119
5 100 14 π¦ π = β1.9457 +0.1648(100) 14.5311
6 49 4 π¦ π = β1.9457 +0.1648(49) 6.1280
π₯ = 72.5 π¦ = 10
min π¦π β π¦ π2
Regression errors (residuals)
Meal Bill ($) Tip ($) π π (predicted tip $) Error (π β π π)
π₯ π¦ (observed-predicted)
1 35 6 3.8212 6 β 3.8212 = 2.1788
2 110 18 16.1788 18 β 16.1788 = 1.8212
3 66 11 8.9290 11 β 8.9290 = 2.0710
4 75 7 10.4119 7 β 10.4119 = -3.4119
5 100 14 14.5311 14 β 14.5311 = -0.5311
6 49 4 6.1280 4 β 6.1280 = -2.1280
π₯ = 72.5 π¦ = 10
Meal Bill ($) Tip ($) π π (predicted tip $) Error (π β π π) (π β π π)π
π₯ π¦
1 35 6 3.8212 2.1788 4.7472
2 110 18 16.1788 1.8212 3.3168
3 66 11 8.9290 2.0710 4.2890
4 75 7 10.4119 -3.4119 11.6412
5 100 14 14.5311 -0.5311 0.2821
6 49 4 6.1280 -2.1280 4.5282
Regression errors (residuals) - SSE
π₯ = 72.5 π¦ = 10 πππΈ = 28.8044
SSE comparison
Sum of squared error (SSE) Comparison
D.V. (tip $) ONLY
+ + + + + = SSE = 28.8044
16 1 16 64 9 36 + + + + + = SSE = 142
D.V. & I.V (tip $ as a function of bill $)
Comparison of two lines
β’ When we conducted the regression, the SSE decreased
from 142 to 28.8044.
β’ 28.8044 was explained by (allocated to) ERROR.
β’ What happen to the difference (113.1956)?
β’ 113.1956 is the sum of squares due to REGRESSION
(SSR).
β’ πππ = πππ + πππΈ
β’ In this case:
142 = 113.1956 + 28.8044
0
2
4
6
8
10
12
14
16
18
20
0 50 100 150
Bills vs Tips ($)
0
5
10
15
20
0 1 2 3 4 5 6 7
Tips ($)
Comparison of two lines
πΊπΊπ¬ = πππ
πΊπΊπ¬ = πΊπΊπ»
πΊπΊπ» = πππ
πΊπΊπ» = πππ
πΊπΊπ¬ = ππ. ππππ
πΊπΊπ» β πΊπΊπ¬ = πΊπΊπΉ = πππ. ππππ
Coefficient of Determination (π2)
β’ How well does the estimated regression equation fit our
data?
β’ This is where regression starts to look a lot like ANOVA,
where the SST is partitioned into SSE & SSR.
β’ The larger the SSR the smaller the SSE.
β’ The Coefficient of Determination quantifies this ratio as a
percentage (%).
SSE
SST
SSR
πΆππππππππππ‘ ππ π·ππ‘πππππππ‘πππ = π2 =πππ
πππ
Coefficient of Determination (π2)
β’ How well does the estimated regression equation fit our
data?
β’ This is where regression starts to look a lot like ANOVA,
where the SST is partitioned into SSE & SSR.
β’ The larger the SSR the smaller the SSE.
β’ The Coefficient of Determination quantifies this ratio as a
percentage (%).
SSE
SST
SSR
ANOVA
df SS MS F Significance F
Regression 1 113.1956 113.1956 15.7192 0.016611541
Residual 4 28.80441 7.201103
Total 5 142
π2 Interpretation
β’ πΆππππππππππ‘ ππ π·ππ‘πππππππ‘πππ = π2 =πππ
πππ
β’ πΆππππππππππ‘ ππ π·ππ‘πππππππ‘πππ = π2 =113.1956
142
β’ πΆππππππππππ‘ ππ π·ππ‘πππππππ‘πππ = π2 = 0.7972 ππ 79.72%
β’ We can conclude that 79.72% of the total sum of squares
can be explained using the estimates from the regression
equation to predict the tip amount. And that the remainder
(20.28%) is error.
β’ This is a βGood fitβ!
0
2
4
6
8
10
12
14
16
18
20
30 40 50 60 70 80 90 100 110
Tip
($
)
Bill ($)
3 squared differences
π π = βπ. ππππ +π. πππππ
Bills vs. Tips ($)
π = ππ
SSE= (π¦π β π¦ π)2
SST= (π¦π β π¦ )2
SSR= (π¦ π β π¦ )2
Model fit
π¦ π = β1.9457 +0.1648π₯
Questions:
β’ Once a regression line is calculated, how much better is it than only
using the mean of the dependent variable line alone? (coefficient of
determination (π2)
β’ How confident are we in the significance of the relationship between x
and y? (t-test of slope)
Regression with Excel
β’ Produce SLR model in Excel.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.892834
R Square 0.797152
Adjusted R Square 0.74644
Standard Error 2.683487
Observations 6
ANOVA
df SS MS F Significance F
Regression 1 113.1956 113.1956 15.7192 0.016611541
Residual 4 28.80441 7.201103
Total 5 142
Coefficien
ts Standard
Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept -1.94568 3.205964 -0.60689 0.576683 -10.84685887 6.955504991 -10.84685887 6.955504991
X Variable 1 0.164768 0.041558 3.964745 0.016612 0.049383684 0.280152232 0.049383684 0.280152232
Testing slope -1
β’ Is the relationship between π¦ and π₯ significant?
β’ Test the slope π½1. (two-tailed t-test)
β’ Remember π1is for our sample and π½1 is for the population
β’ We will use our sample slope π1 to test if the true slope of
the population π½1 is significantly different than 0.
π¦ π = β1.9457 +0.1648π₯
Testing slope -2
Steps to conduct a t-test on slope π½1:
β’ Step 1: Specify hypothesis:
β’ π»0: π½1 = 0 π£π . π»1: π½1 β 0 ππ‘ πΌ = 0.05
β’ Step 2: Determine the test statistic:
π‘ =π1βπ½1
ππΈπ1
β’ where π½1 is true coefficient for all population
β’ where ππΈπ1 =πππΈπβ2
(π₯βπ₯ )2
= standard error of the slope π1
Testing slope -3
β’ Step 2 calculation:
β’ ππΈπ1 =πππΈπβ2
(π₯βπ₯ )2
=28.8044(6β2)
4169.5
= 0.0416
β’ π‘ =π1βπ½1
ππΈπ1=
0.1648β0
0.0416= 3.9615
β’ Step 3: Quantify the evidence of the test
β’ Method 1: Critical value method
β’ Compare calculated t to critical t
β’ Β±π‘1βπΌ
2,πβ2 = Β±π‘0.975,4
π¦ π = β1.9457 +0.1648π₯
Testing slope -4
β’ Step 3: Quantify the evidence of the test
β’ Method 1: Critical value method
β’ Compare calculated π‘ to critical π‘ (remember πΌ = 0.05)
β’ Β±π‘1βπΌ
2,πβ2 = Β±π‘0.975,4 = 2.776
Testing slope -5
β’ Step 3: Method 1: Critical value method
β’ Compare calculated π‘ to critical π‘ (remember πΌ = 0.05)
β’ π‘πππππ’πππ‘ππ = 3.9615 > π‘ππππ‘ππππ = 2.776
β’ T calc is in the critical region so Reject null hypothesis π»0: π½1 = 0
meaning that our π½1 β 0 and we do have a statistically significant
relationship between π₯ and π¦. .
0.95
0.025 0.025
Testing slope -6
β’ Step 3: Method 2: p-value method
β’ Compare calculated/estimated π value to desired significance
level. (remember πΌ = 0.05)
β’ ππππππ’πππ‘ππ/ππ π‘ππππ‘ππ = 2π π‘ > πππππ’π‘ππ π‘ = 2π(π‘ > 3.9615) β
0.03
β’ π π£πππ’π ππ 0.03 < πΌ = 0.05, therefore reject null hypothesis
π»0: π½1 = 0 meaning that our π½1 β 0 and we do have a statistically
significant relationship between π₯ and π¦. .
SLR Example with R
β’ Start R session
β’ Import dataset βairqualityβ included in R base
β’ Explore and plot data
β’ Run a simple linear regression model with
βOzoneβ as a DV (π¦)
βTempβ as an IV (π₯)
β’ Follow in R session and model results are as follows:
SLR Example with R
β’ Dataset = airquality ----> 153 obs. of 6 variables
β’ Start R session and follow instructions in code
β’ Use simple linear regression to predict ozone levels βOzoneβ based on the
temperature βTempβ.
ID Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
Step 1: scatter plot
Ozone Temp
41 67
36 72
12 74
18 62
NA 56
28 66
23 65
19 59
8 61
NA 69
STEP 3: CORRELATION (Ozone vs Temp)
β’ What is the correlation coefficient (r) for Ozone vs. Temp? (see R session)
In this case, π = .698
β’ Is the relationship strong?
MODERATE! --------> RUN MODEL see R session
Model results (model m1)
β’ π¦ = π½0 + π½1π₯
β’ π½0 = β146.996 (Intercept) π½1 = +2.429 (Slope)
β’ Regression line for this model ---> π¦ = β146.996 +2.429(π₯)
Results interpretation (model m1) -1
Residuals:
β’ Residuals are the differences between the actual observed response values
(distance to Ozone levels in our case) and the response values that the
model predicted.
β’ The βResidualsβ section of the model output breaks it down into 5 summary
points to assess how well the model fit the data.
β’ A good fit model will show symmetry from the min to max around the mean
value (0).
β’ We do not have a very good symmetry here.
β’ So, the model is predicting certain points that fall far away from the actual
observed points.
Results interpretation (model m1) -2
Model Coefficients:
β’ π½0 = β146.996 (π¦ β πΌππ‘ππππππ‘)
No interpretational meaning; but it is the Ozone level value when Temp = 0
β’ π½1 = +2.429 (πππππ)
For every 1 degree β the temperature increases (π₯), it is expected that the
Ozone level to also increase by 2.429 units.
β’ π π‘π. πππππ = 0.2331
We can say that Ozone level/units can vary by 0.2331.
β’ t-value for βTempβ = πππππππππππ‘
π π‘π. πππππ =
2.429
0.233 = 10.418
t-value is significant Pr (> |π‘|) = 2πβ16 ; which is significant at any level of
significance (you could say at 99.99% level of confidence or 0.001).
Results interpretation (model m1) -3
β’ Residual Standard Error = 23.71 on 114 degrees of freedom
β’ The Residual Standard Error is the average amount that the response
βOzoneβ will deviate from the true regression line.
β’ In our example, the actual Ozone level can deviate from the true regression
line by approximately 23.71 units, on average.
β’ Degrees of freedom are the actual number of data points (observations)
minus 2 (taking into account the parameters for the βinterceptβ and the
βOzoneβ variables).
So, we started the model with 153 data point in the βairqualityβ dataset
We removed 37 data points that were N/Aβs
We are left with 116 data points
116 data points will lead to (116-2 parameters) = 114 DF
Results interpretation (model m1) -4
β’ π -squared = 0.4877 (π 2 = coefficient of determination)
π 2 varies from 0 π‘π 1; in this case, 48.77% of (π¦) is explained by (π₯)
β’ Adjusted π 2 = 0.4832
Adjusted π 2 accounts for how many independent variables entered the
model. Typically lower than π 2 based on how much contribution
additional independent variables (π₯βπ )added to explaining (π¦)
A sharp drop in the adjusted π 2 versus π 2 indicates a bad model.
π-Test (F-value is used for measuring the overall model significance).
β’ At the desired level of significance (say 95%), the statistical significance of
the πΉ-test will show how good of a model this is.
β’ In this model, the πΉ-statistic = 108.5 on 1 variable with 114
β’ The πΉ-statistic level of significance is Pr (> πΉ) = 2.2πβ16; that is the πΉ-statistic
is significant at any reasonable level of significance (or you could say @
99.99%).
SLR β R code
Recommended