Upload
bernice-townsend
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Basics of Regression Analysis
Determination of three performance measures
• Estimation of the effect of each factor
• Explanation of the variability
• Forecasting Error
Two Predictor Variables
Population Regression Model:
Y = 0 + 1X1 + 2X2 + ee following N(0, )
Unknown parameters: 0, 1, 2;
From Data to Estimates of Coefficients
Principle:
Least Squares
Normal Equation Systems
Estimates ofCoefficients
MathematicsComputingAlgorithm
Least Squares Method
y
x1
x2
*
*
*
*
*
*
*
*
*
*
e
y
*
*
*
*
*
*
*
*y
x
e
y
Simple Regression Multiple Regression
Y=b0 + b1X Y = b0 + b1X1 + b2X2
2
1ˆn
i iiMinimize Y Y
Matrix Computation for b
• Normal Equation System: (XTX) b = XTY– See Text Appendix D.3
• Solution for b: b = (XTX)-1 (XTY)
Standardized Regression Coefficients,
• Definition
– b0 = 0
– the beta coefficient
• Used to show relative weights of predictors.
bk'
bk' = sX
sY bk for k = 1, 2
Estimation of se - Standard Deviation of Disturbance e
• Forecasting
Equation
• SS of Residuals
• Mean SS
SSE = Y i- Y i2
i=1
n
Y=b0+b1X1+b2X2
MSE = se2 = SSE
(n-3)
Standard Error of Coefficients
• The variance matrix of b (K+1 x 1)is
•
12 TeVar s X X
b
1
k
Tb es s the k th diagonal element of X X
The Variability Explained
• First, determine the base variability for explanation by the regression
Unconditional mean model: Y = y + e e follows N(0, y)
LS fit of the model: Pred_Y = Y
SS of Residuals:
MSS (DF=n-1):
2
i1
Y -Yn
i
SST
2
i2 1
Y -Y
1
n
iyS n
The Variability Explained – cont.
• Second, by subtraction of the variability for still left.
• In SS:
• In Variance :
2
i1
Y -Yn
i
SST
2
i2 1
Y -Y
1
n
iyS n
2
i1
ˆY -Yn
i
SSE
2
i2 1
ˆY -Y
3
n
ieS n
Creating ANOVA Table
Regression
Model
Unexplained Variability in SS DF
Unexplained
Variability in Variance (MSE)
Un-
conditional SST (n-1)
Conditional SSE (n-3)
Variability
Explained
SSR=
SST - SSE
2
Proportion
Explained
2yS MST
2eS MSE
2 2y eS S
2 1SSE
RSST
2
22
1 e
y
Sadjusted R
S
Test of Significance
• F test of significance
• T- Test of significance– Two sided alternative
– One sided alternative
F - Test of Significance of the variability explained by the regression
H0: 1= 2 = 0
Ha: At least one coefficient is not 0
2
2
3
2
3
1 2
nSST SSE MSRF stat
SSE MSE
nR
R
P-Value of F-stat = P{F(2, n-3) > F-stat}
t-Test of Significance of significance of a variable, X1
- two sided
H0: 1 = 0
Ha: 1 = 0
1
11
b
bt Stat of X
s
P-Value of t-stat = P{ t( n-3) > |t-stat|}
One Sided Test of Significance of significance of a variable, X1
H0: 1 = 0
Ha: 1 > 0 (using the prior knowledge)
1
11
b
bt Stat of X
s
p-Value of t-stat = P{ t( n-3) > t-stat}
Forecasting
• Point forecasting
• Sources of forecasting error
• Interval forecasting
Forecasting at xm
11 12
1 2
1
1 n n
X X
X X
X 1
2
1
m
m
X
X
mx
Data of X for regression Value of X for prediction
Sources of Forecasting Error
• Data: Y|xm = 0+ 1 x1m + 2 x2m + em
• Forecast:
• Forecast Error:
0 1 m1 2 m2Y | x b +b x +b xm
2
2
m
0 1 20 1 m1 2 m2
-
- - -
Y |x Y|xb + b x + b x
em
m
SS
m
e
Computing Standard Errors
1T Tm e m ms s
x X X x
22emp sss
Forecasting Performance Analysis
• R2_pred = 1 – Press / SST
Press = SS of {yi – yi(i)} (deleted residual)
• Sample splitting
–Analysis sample (n1)
–Validation sample (n2)
Generalization to K Independent Variables
• Use n – K – 1 for n – 3 for DF for t.
• Use K for the numerator DF and n-K-1 for the denominator DF for F.
Diagnostics
• Assumptions for Disturbance
• Multi-collinearity
• Outliers and Influential Observations
Problematic Data Conditions
• Regression Coefficients Are Sensitive to:
–Highly Collinear Independent Variables
–Contamination By Outliers and Influential Observations
DetectingOutliers and Influential Data
• Outliers– Leverage (X-space) distance from the mean
– Tresid (Y-space) forecasting error
• Influential Data – Idea: with / without comparison
–Cook’D
–Dfbetas
–Dfits
Modeling Techniques
• Transformation of Variables– Log – Others
• Using Dummy Variables– Symbolic representation– Dummy variables for qualitative variables
• Using Scores for Ordinal Variables
• Selection of Independent Variables– Forecasting– Computer intensive– Analysis of correlation structure of independent variables
Dummy Variables
• DK= “If (X=k,1,0)”
• Can be used nominal and also ordinal variables
• # of DK = c-1 where c is the number of categories.
Using Scores for Ordinal Variable
• Scoring Systems
– 1, 2, 3, …c
– -2, -1, 0, 1, 2 c:odd
Implications of Variable Selection
Purposes ofRegression
MissingEssentialPredictors
Including Non-essentialPredictors
Prediction ofthe DependentVariable
Increase in theMean SquaredError of thePrediction
Increase in theMean SquaredError of thePrediction
Estimation ofthe Effect ofthe Predictors
Bias in theEstimates
Increase in theStandardErrors of theCoefficients
Selection of Variables - 1
• Backward elimination
• Stepwise (forward) inclusion
All X’s
Final Regression
T-test
Bestsimple
BestTwo variables
Best…. variables
Max Increasein R2
Max Increasein R2
Selection of Variables - 2
• All Possible Regression
K independentvariables
K simple
K (K-1) two variable
1K variable
Final Regression
Selection Criteria
• R2___________________________
• Adj. R2 ______________________
• R2PRED ______________________
• Se __________________________
• Cp___________________________
Cp (= # of coefficients)
Select a combination with Cp close to p
2
2
pp
F
p F F
F
p F
F
p F
F
SSEC n p
MSE
MSE n p MSE n p MSE n p
MSE
n p MSE MSEn p n p
MSE
n p MSE MSEp
MSE
What to Look for in Good Regression?
• Remember the three functions of regression– Estimation of the effect of each X
– Explaining the variability of Y
– Forecasting
• Populations regressions are assumptions– Needs testing
• Data might be contaminated
Extensions
For Other Variable Types of Y
Types of Variable
Variable
Quantitative
Qualitative
Continuous
Discrete(counting)
Ordinal
Nominal
Generalized Linear Models (GLM)
• Regression model:Y = 0 + 1X1 + 2X2 + ee following N(0, )
• GLM Formulation:1. Model for Y:
Y is N(, )
2. Model for predictors (Link Function):
= 0 + 1X1 + 2X
Forecasting Counting Data
• Model for Y: Poisson Distribution ()
• Link Function:
exp|
!
yi i
i iP Y yy
0 1 1 2 2log i i i K KiX X X