This Week
• Continue with linear regression• Begin multiple regression
– Le 8.2
– C & S 9:A-E
• Handout: Class examples and assignment 3
Linear Regression
• Investigate the relationship between two variables• Dependent variable
– The variable that is being predicted or explained
• Independent variable – The variable that is doing the predicting or explaining
• Think of data in pairs (xi, yi)
Linear Regression - Purpose
• Is there an association between the two variables– Is BP change related to weight change?
• Estimation of impact– How much BP change occurs per pound of weight change
• Prediction – If a person loses 10 pounds how much of a drop in blood
pressure can be expected
Assumption for Linear Regression
• For each value of X there is a population of Y’s that are normally distributed
• The population means form a straight line• Each population has the same variance 2
• Note: The X’s do not need to be normally distributed, in fact the researcher can select these prior to data collection
Simple Linear Regression EquationSimple Linear Regression Equation
The The simple linear regression equationsimple linear regression equation is: is:
yy = = 00 + + 11xx
00 is the mean when x=0 is the mean when x=0
The mean increases by The mean increases by 11 for each increase of x for each increase of x by 1by 1
Simple Linear Regression ModelSimple Linear Regression Model
The equation that describes how individual y values relate The equation that describes how individual y values relate to x and an error term is called the to x and an error term is called the regression modelregression model..
yy = = 00 + + 11xx + +
reflects how individuals deviate from others with the reflects how individuals deviate from others with the same value of xsame value of x
Estimated Simple Linear Regression Estimated Simple Linear Regression EquationEquation
The The estimated simple linear regression estimated simple linear regression equationequation is: is:
• bb00 is the estimate for is the estimate for 00
• bb11 is the estimate for is the estimate for 11
• is the estimated (predicted) value of is the estimated (predicted) value of yy for a given for a given xx value. It is the estimated mean value. It is the estimated mean for that x.for that x.
0 1y b b x 0 1y b b x
yy
Least Squares MethodLeast Squares Method
Least Squares Criterion: Choose Least Squares Criterion: Choose and and to minimizeto minimize
Of all possible lines pick the one that minimizes the sum Of all possible lines pick the one that minimizes the sum of the distances squared of each point from that lineof the distances squared of each point from that line
S = yi – 01xi)2
Slope: Slope:
The Least Squares EstimatesThe Least Squares Estimates
21)(
))((
xx
yyxxb
i
ii
21)(
))((
xx
yyxxb
i
ii
0 1b y b x 0 1b y b x Intercept:Intercept:
An Estimate of An Estimate of 22
The mean square error (MSE) provides the estimateThe mean square error (MSE) provides the estimateof of 22, and the notation , and the notation ss22 is also used. is also used.
ss22 = MSE = SSE/(n-2) = MSE = SSE/(n-2)
where:where:
Estimating the VarianceEstimating the Variance
210
2 )()ˆ(SSE iiii xbbyyy 210
2 )()ˆ(SSE iiii xbbyyy
If points are close to the regression line then SSE will be small
If points are far from the regression line then SSE will be large
Estimating Estimating
An Estimate of An Estimate of To estimate To estimate we take the square root of we take the square root of 22.. The resulting The resulting ss is called the is called the root mean square error root mean square error ..
2
SSEMSE
ns
2
SSEMSE
ns
Hypothesis Testing for Hypothesis Testing for
Ho: 1 = 0 no relation between x and y
Ha: 1 ≠0 relation between x and y
Test Statistic: t = b1/SE(b1)
SE(b1) depends on • Sample size• How well the estimated line fits the points• How spread out the range of x values are
Rejection RuleRejection Rule
Reject Reject HH00 if if tt < - < -ttor or tt > > tt
where: where: tt is based on a is based on a tt distribution distribution
with with nn - 2 degrees of freedom - 2 degrees of freedom
Testing for Significance: Testing for Significance: tt Test Test
Confidence Interval for Confidence Interval for 11
)(12/1
bsetb )(12/1
bsetb
is cutoff value from t-distribution with n-2 df
CLM option in SAS on model statement
2/t
Estimating the Mean for a Particular XEstimating the Mean for a Particular X
Simply plug in your value of x in the estimated regression Simply plug in your value of x in the estimated regression equationequation
Want to estimate the mean BP for persons aged 50Want to estimate the mean BP for persons aged 50
Suppose bSuppose b00 = 100 and b = 100 and b11 = 0.80 = 0.80
Estimate = 100 + 0.80*50 = 140 mmHgEstimate = 100 + 0.80*50 = 140 mmHg
Can compute 95% CI for the estimate using SASCan compute 95% CI for the estimate using SAS
CLM option on model statementCLM option on model statement
The Coefficient of DeterminationThe Coefficient of Determination
Relationship Among SST, SSR, SSERelationship Among SST, SSR, SSE
SST = SSR + SSESST = SSR + SSE
where:where: SST = total sum of squaresSST = total sum of squares SSR = sum of squares due to regressionSSR = sum of squares due to regression SSE = sum of squares due to errorSSE = sum of squares due to error
( ) ( ) ( )y y y y y yi i i i 2 2 2( ) ( ) ( )y y y y y yi i i i 2 2 2^^
The The coefficient of determinationcoefficient of determination is: is:
rr22 = SSR/SST = SSR/SST
where:where:
SST = total sum of squaresSST = total sum of squares
SSR = sum of squares due to SSR = sum of squares due to regressionregression
r r 22 = proportion of variability explained by X = proportion of variability explained by X
(must be between 0 and 1)(must be between 0 and 1)
The Coefficient of DeterminationThe Coefficient of Determination
ResidualsResiduals
How far off (distance) an individual point is from the How far off (distance) an individual point is from the estimated regression lineestimated regression line
residual = predicted value – observed valueresidual = predicted value – observed value
SAS CODE FOR REGRESSION;
PROC REG DATA=datasetname SIMPLE; MODEL depvar = indvar(s); PLOT depvar * indvar ;RUN;
Several options on model and plot statements.
OPTIONS ON MODEL STATEMENT;
MODEL depvar = indvar(s)/options
Option What it does
clb 95% CI for 1
p Predicted valuesr Residualsclm 95% CI for the mean at
value of x
OUTPUT FROM PROC REG
Dependent Variable: quarsales
Analysis of Variance
Sum of MeanSource DF Squares Square F Value Pr > F
Model 1 SSR 14200 14200 74.25 <.0001
Error 8 SSE 1530 191.25000
Corrected Total 9 SST 15730
Root MSE 13.82932 R-Square 0.9027Dependent Mean 130.00000 Coeff Var 10.63794
Coefficient of Determination
14200/15730
MSE
Parameter Estimates
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 60.00000 9.22603 6.50 0.0002
studentpop 1 5.00000 0.58027 8.62 <.0001
REGRESSION EQUATION:
Y = 60.0 + 5.0*X
QUARSALES = 60 + 5*STUDENTPOP
b1 SE(b1)