Upload
tyler-poole
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
June 30, 2008 Stat 111 - Lecture 16 - Regression
1
Inference for relationships between variables
Statistics 111 - Lecture 16
June 30, 2008 Stat 111 - Lecture 16 - Regression
2
Administrative Notes
• Homework 5 due on Wednesday, July 1st
• Final is this Thursday – Will start at exactly 10:40 – Will end at exactly 12:10– Get here a bit early or you will be wasting time on
your exam
• Final will focus on material from after midterm• There will be at least two questions that require
knowing material from before midterm
• Calculators are definitely needed!• Single 8.5 x 11 cheat sheet (two-sided) allowed
June 30, 2008 Stat 111 - Lecture 16 - Regression
3
Final Exam
June 30, 2008 Stat 111 - Lecture 16 - Regression
4
Inference Thus Far
• Tests and intervals for a single variable
• Tests and intervals to compare a single variable between two samples
• For the last couple of classes, we have looked at count data and inference for population proportions
• Before that, we looked at continuous data and inference for population means
• Next couple of classes: inference for a relationship between two continuous variables
June 30, 2008 Stat 111 - Lecture 16 - Regression
5
Two Continuous Variables
• Remember linear relationships between two continuous variables?
• Scatterplots
• Correlation
• Best Fit Lines
June 30, 2008 Stat 111 - Lecture 16 - Regression
6
Scatterplots and Correlation
• Visually summarize the relationship between two continuous variables with a scatterplot
• If our X and Y variables show a linear relationship, we can calculate a best fit line between Y and X
Education and Mortality: r = -0.51 Draft Order and Birthdayr = -0.22
June 30, 2008 Stat 111 - Lecture 16 - Regression
7
Linear Regression
• Best fit line is called Simple Linear Regression Model:
• Coefficients:is the intercept and is the slope
• Other common notation: 0 for intercept, 1 for slope
• Our Y variable is a linear function of the X variable but we allow for error (εi) in each prediction
• We approximate the error by using the residualObserved Yi
Predicted Yi = + Xi
June 30, 2008 Stat 111 - Lecture 16 - Regression
8
Residuals and Best Fit Line
0 and 1 that give the best fit line are the values that give smallest sum of squared residuals:
• Best fit lineis also called the least-squares line
ei = residuali
June 30, 2008 Stat 111 - Lecture 16 - Regression
9
Best values for Regression Parameters
• The best fit line has these values for the regression coefficients:
• Also can estimate the average squared residual:
Best estimate of slope
Best estimate of intercept
June 30, 2008 Stat 111 - Lecture 16 - Regression
10
Example: Education and Mortality
Mortality = 1353.16 - 37.62 · Education
• Negative association means negative slope b
June 30, 2008 Stat 111 - Lecture 16 - Regression
11
Interpreting the regression line
• The estimated slope b is the average change you get in the Y variable if you increased the X variable by one• Eg. increasing education by one year means an
average decrease of ≈ 38 deaths per 100,000
• The estimated intercept a is the average value of the Y variable when the X variable is equal to zero• Eg. an uneducated city (Education = 0) would have
an average mortality of 1353 deaths per 100,000
June 30, 2008 Stat 111 - Lecture 16 - Regression
12
Example: Vietnam Draft Order
Draft Order = 224.9 - 0.226 · Birthday
• Slightly negative slope means later birthdays have a lower draft order
June 30, 2008 Stat 111 - Lecture 16 - Regression
13
Significance of Regression Line
• Does the regression line show a significant linear relationship between the two variables?• If there is not a linear relationship, then we would
expect zero correlation (r = 0)• So the estimated slope b should also be close to
zero
• Therefore, our test for a significant relationship will focus on testing whether our true slope is significantly different from zero:
H0 : = 0 versus Ha : 0
• Our test statistic is based on the estimated slope b
June 30, 2008 Stat 111 - Lecture 16 - Regression
14
Test Statistic for Slope
• Our test statistic for the slope is similar in form to all the test statistics we have seen so far:
• The standard error of the slope SE(b) has a complicated formula that requires some matrix algebra to calculate• We will not be doing this calculation manually
because the JMP software does this calculation for us!
June 30, 2008 Stat 111 - Lecture 16 - Regression
15
Example: Education and Mortality
June 30, 2008 Stat 111 - Lecture 16 - Regression
16
p-value for Slope Test
• Is T = -4.53 significantly different from zero?• To calculate a p-value for our test statistic T, we use
the t distribution with n-2 degrees of freedom• For testing means, we used a t distribution as well,
but we had n-1 degrees of freedom before• For testing slopes, we use n-2 degrees of freedom
because we are estimating two parameters (intercept and slope) instead of one (a mean)
• For cities dataset, n = 60, so we have d.f. = 58• Looking at a t-table with 58 df, we discover that the
P(T < -4.53) < 0.0005
June 30, 2008 Stat 111 - Lecture 16 - Regression
17
Conclusion for Cities Example
• Two-sided alternative: p-value < 2 x 0.0005 = 0.001
• We could get the p-value directly from the JMP output, which is actually more accurate than t-table
• Since our p-value is far less than the usual -level of 0.05, we reject our null hypothesis
• We conclude that there is a statistically significant linear relationship between education and mortality
June 30, 2008 Stat 111 - Lecture 16 - Regression
18
Next Class - Lecture 17
• More Inference for Regression
• Moore, McCabe and Craig: 10.1, 2.4-2.5