Upload
wade-mcfarland
View
12
Download
0
Embed Size (px)
DESCRIPTION
Announcements:. Homework 10: Due next Thursday (4/25) Assignment will be on the web by tomorrow night. 18. 17. 16. 15. e. m. i. 14. T. n. 13. r. u. B. 12. 11. 10. 9. 1. 2. 3. 4. Fabric. Vertical spread of data points within each oval is one type of variability. - PowerPoint PPT Presentation
Citation preview
Announcements:
• Homework 10:– Due next Thursday (4/25)– Assignment will be on the web by tomorrow
night.
4321
18
17
16
15
14
13
12
11
10
9
Fabric
Bur
n T
ime
Vertical spread of data points within each oval is one type of variability.
Vertical spread of the ovals is another type of variability.
Suppose there are k treatments and n data points.ANOVA table:
Source Sum of Meanof Variation df Squares Square F P
Treatment k-1 SST MST=SST/(k-1) MST/MSE
Error n-k SSE MSE=SSE/(n-k)
Total n-1 total SS
ESTIMATE OF “WITHIN FABRIC TYPE” VARIABILITY
ESTIMATE OF “ACROSS FABRIC TYPE” VARIABILITY
“SUM OF SQUARES” IS WHAT GOES INTO NUMERATOR OF s2: “(X1-X)2 + … + (Xn-X)2”
P-VALUEFOR TEST OFAll Means equal.(REJECT IF LESS THAN )
One-way ANOVA: Burn Time versus FabricAnalysis of Variance for Burn TimeSource DF SS MS F PFabric 3 109.81 36.60 27.15 0.000Error 12 16.18 1.35Total 15 125.99
Explaining why ANOVA is an analysis of variance:MST = 109.81 / 3 = 36.60Sqrt(MST) describes standard deviation among the fabrics.
MSE = 16.18 / 12 = 1.35Sqrt(MSE) describes standard deviation of burn time within each fabric type. (MSE is estimate of variance of each burn time.)
F = MST / MSE = 27.15It makes sense that this is large and p-value = Pr(F4-1,16-4 > 27.15) = 0 is small because the variance “among treatments” is much larger than variance within the units that get each treatment.
(Note that the F test assumes the burn times are independent and normal with the same variance.)
For test:H0:
It turns out that ANOVA is a special case of regression. We’ll come back to that in a
class or two. First, let’s learn about regression (chapters 12 and 13).
• Simple Linear Regression example:
Ingrid is a small business owner who wants to buy a fleet of Mitsubishi sigmas. To save $ she decides to buy second hand cars and wants to estimate how much to pay. In order to do this, she asks one of her employees to collect data on how much people have paid for these cars recently. (From Matt Wand)
151413121110 9 8 7 6
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Age (years)
Regression Plot
Data:Each point is a car
Pri
ce (
$)
• Plot suggests a simple model:
Price of car = intercept + slope times car’s age + erroror
yi = 0 + 1xi + i, i = 1,…,39.
Estimate 0 and 1.
Outline for Regression:
1. Estimating the regression parameters and ANOVA tables for regression
2. Testing and confidence intervals3. Multiple regression models & ANOVA4. Regression Diagnostics
• Plot suggests a model:
Price of car = intercept + slope times car’s age + erroror
yi = 0 + 1xi + i, i = 1,…,39.
Estimate 0 and 1 with b0 and b1. Find these with “least squares”.
In other words, find b0 and b1 to minimize sum of squared errors:
SSE = {y1 – (b0 + b1 x1)}2 + … + {yn – (b0 + b1 xn)}2
See green line on next page.
Each term is squared differencebetween observed y and the regression line ((b0 + b1 x1)
Squared lengthof this line contributes
one term to Sum of Squared Errors (SSE)
This line has lengthyi – b0 – b1xi for some i
151413121110 9 8 7 6
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Age
Pri
ce
S = 1075.07 R-Sq = 43.8 % R-Sq(adj) = 42.2 %
Price = 8198.25 - 385.108 Age
Regression Plot
151413121110 9 8 7 6
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Age (years)
S = 1075.07 R-Sq = 43.8 % R-Sq(adj) = 42.2 %
General Model:Price = 0 + 1 Age + error
Fitted Model:Price = 8198.25 - 385.108 Age
Regression PlotP
rice
($)
Do Minitab example
Regression parameter estimates, b0 and b1, minimize
SSE = {y1 – (b0 + b1 x1)}2 + … + {y – (b0 + b1 xn)}2
Full model is yi = 0 + 1 xi + i
Suppose errors (i’s) are independent N(0, 2).What do you think a good estimate of 2 is?
MSE = SSE/(n-2) is an estimate of 2. Note how SSE looks like the numerator in s2.
(I divided price by $1000. Think about why this doesn’t matter.)
Source DF SS MS F PRegression 1 33.274 33.274 28.79 0.000Residual Error 37 42.763 1.156Total 38 76.038
Sum of Squares Total = {y1 –mean(y)}2 + … + {y39 – mean(y)}2 = 76.038
Sum of Squared Errors = {y1 – (b0 + b1 x1)}2 + … + {y – (b0 + b1 xn)}2
= 42.763
Sum of Squares for Regression = SSTotal - SSE
What do these mean?
Overall meanof $3,656Regression line
151413121110 9 8 7 6
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Age
Pri
ce
S = 1075.07 R-Sq = 43.8 % R-Sq(adj) = 42.2 %
Price = 8198.25 - 385.108 Age
Regression Plot
(I divided price by $1000. Think about why this doesn’t really matter.)
Source DF SS MS F PRegression 1=p-1 33.274 33.274 28.79 0.000Residual Error 37=n-p 42.763 1.156Total 38=n-1 76.038
p is the number of regression parameters (2 for now)
SSTotal = {y1 –mean(y)}2 + … + {y39 – mean(y)}2 = 76.038
SSTotal / 38 is an estimate of the variance around the overall mean.(i.e. variance in the data without doing regression)
SSE = {y1 – (b0 + b1 x1)}2 + … + {y – (b0 + b1 xn)}2 = 42.763
MSE = SSE / 37 is an estimate of the variance around the line. (i.e. variance that is not explained by the regression)
SSR = SSTotal – SSEMSR = SSR / 1 is the variance the data that is “explained by the regression”.
(I divided price by $1000. Think about why this doesn’t really matter.)
Source DF SS MS F PRegression 1=p-1 33.274 33.274 28.79 0.000Residual Error 37=n-p 42.763 1.156Total 38=n-1 76.038
p is the number of regression parameters
A test of H0: 1 = 0 versus HA: parameter is not 0
Reject if the variance explained by the regression is high compared to the unexplained variability in the data. Reject if F is large.
F = MSR / MSE
p-value is Pr(Fp-1,n-p > MSR / MSE)
Reject H0 for any less than the p-value
(Assuming errors are independent and normal.)
R2
• Another summary of a regression is:
R2 = Sum of Squares for Regression
Sum of Squares Total
0<= R2 <= 1
This is the percentage of the of variation in the data that is described by the regression.
Two different ways to assess “worth” of a regression
1. Absolute size of slope: bigger = better
2. Size of error variance: smaller = better1. R2 close to one
2. Large F statistic