Basics of Regression Analysis. Determination of three performance measures Estimation of the effect of each factor Explanation of the variability Forecasting

Basics of Regression Analysis

Determination of three performance measures

• Estimation of the effect of each factor

• Explanation of the variability

• Forecasting Error

Two Predictor Variables

Population Regression Model:

Y = 0 + 1X1 + 2X2 + ee following N(0, )

Unknown parameters: 0, 1, 2;

From Data to Estimates of Coefficients

Principle:

Least Squares

Normal Equation Systems

Estimates ofCoefficients

MathematicsComputingAlgorithm

Least Squares Method

y

x1

x2

*

*

*

*

*

*

*

*

*

*

e

y

*

*

*

*

*

*

*

*y

x

e

y

Simple Regression Multiple Regression

Y=b0 + b1X Y = b0 + b1X1 + b2X2

2

1ˆn

i iiMinimize Y Y

Matrix Computation for b

• Normal Equation System: (XTX) b = XTY– See Text Appendix D.3

• Solution for b: b = (XTX)-1 (XTY)

Standardized Regression Coefficients,

• Definition

– b0 = 0

– the beta coefficient

• Used to show relative weights of predictors.

bk'

bk' = sX

sY bk for k = 1, 2

Estimation of se - Standard Deviation of Disturbance e

• Forecasting

Equation

• SS of Residuals

• Mean SS

SSE = Y i- Y i2

i=1

n

Y=b0+b1X1+b2X2

MSE = se2 = SSE

(n-3)

Standard Error of Coefficients

• The variance matrix of b (K+1 x 1)is

•

12 TeVar s X X

b

1

k

Tb es s the k th diagonal element of X X

The Variability Explained

• First, determine the base variability for explanation by the regression

Unconditional mean model: Y = y + e e follows N(0, y)

LS fit of the model: Pred_Y = Y

SS of Residuals:

MSS (DF=n-1):

2

i1

Y -Yn

i

SST

2

i2 1

Y -Y

1

n

iyS n

The Variability Explained – cont.

• Second, by subtraction of the variability for still left.

• In SS:

• In Variance :

2

i1

Y -Yn

i

SST

2

i2 1

Y -Y

1

n

iyS n

2

i1

ˆY -Yn

i

SSE

2

i2 1

ˆY -Y

3

n

ieS n

Creating ANOVA Table

Regression

Model

Unexplained Variability in SS DF

Unexplained

Variability in Variance (MSE)

Un-

conditional SST (n-1)

Conditional SSE (n-3)

Variability

Explained

SSR=

SST - SSE

2

Proportion

Explained

2yS MST

2eS MSE

2 2y eS S

2 1SSE

RSST

2

22

1 e

y

Sadjusted R

S

Test of Significance

• F test of significance

• T- Test of significance– Two sided alternative

– One sided alternative

F - Test of Significance of the variability explained by the regression

H0: 1= 2 = 0

Ha: At least one coefficient is not 0

2

2

3

2

3

1 2

nSST SSE MSRF stat

SSE MSE

nR

R

P-Value of F-stat = P{F(2, n-3) > F-stat}

t-Test of Significance of significance of a variable, X1

- two sided

H0: 1 = 0

Ha: 1 = 0

1

11

b

bt Stat of X

s

P-Value of t-stat = P{ t( n-3) > |t-stat|}

One Sided Test of Significance of significance of a variable, X1

H0: 1 = 0

Ha: 1 > 0 (using the prior knowledge)

1

11

b

bt Stat of X

s

p-Value of t-stat = P{ t( n-3) > t-stat}

Forecasting

• Point forecasting

• Sources of forecasting error

• Interval forecasting

Forecasting at xm

11 12

1 2

1

1 n n

X X

X X

X 1

2

1

m

m

X

X

mx

Data of X for regression Value of X for prediction

Sources of Forecasting Error

• Data: Y|xm = 0+ 1 x1m + 2 x2m + em

• Forecast:

• Forecast Error:

0 1 m1 2 m2Y | x b +b x +b xm

2

2

m

0 1 20 1 m1 2 m2

-

- - -

Y |x Y|xb + b x + b x

em

m

SS

m

e

Computing Standard Errors

1T Tm e m ms s

x X X x

22emp sss

Forecasting Performance Analysis

• R2_pred = 1 – Press / SST

Press = SS of {yi – yi(i)} (deleted residual)

• Sample splitting

–Analysis sample (n1)

–Validation sample (n2)

Generalization to K Independent Variables

• Use n – K – 1 for n – 3 for DF for t.

• Use K for the numerator DF and n-K-1 for the denominator DF for F.

Diagnostics

• Assumptions for Disturbance

• Multi-collinearity

• Outliers and Influential Observations

Problematic Data Conditions

• Regression Coefficients Are Sensitive to:

–Highly Collinear Independent Variables

–Contamination By Outliers and Influential Observations

DetectingOutliers and Influential Data

• Outliers– Leverage (X-space) distance from the mean

– Tresid (Y-space) forecasting error

• Influential Data – Idea: with / without comparison

–Cook’D

–Dfbetas

–Dfits

Modeling Techniques

• Transformation of Variables– Log – Others

• Using Dummy Variables– Symbolic representation– Dummy variables for qualitative variables

• Using Scores for Ordinal Variables

• Selection of Independent Variables– Forecasting– Computer intensive– Analysis of correlation structure of independent variables

Dummy Variables

• DK= “If (X=k,1,0)”

• Can be used nominal and also ordinal variables

• # of DK = c-1 where c is the number of categories.

Using Scores for Ordinal Variable

• Scoring Systems

– 1, 2, 3, …c

– -2, -1, 0, 1, 2 c:odd

Implications of Variable Selection

Purposes ofRegression

MissingEssentialPredictors

Including Non-essentialPredictors

Prediction ofthe DependentVariable

Increase in theMean SquaredError of thePrediction

Increase in theMean SquaredError of thePrediction

Estimation ofthe Effect ofthe Predictors

Bias in theEstimates

Increase in theStandardErrors of theCoefficients

Selection of Variables - 1

• Backward elimination

• Stepwise (forward) inclusion

All X’s

Final Regression

T-test

Bestsimple

BestTwo variables

Best…. variables

Max Increasein R2

Max Increasein R2

Selection of Variables - 2

• All Possible Regression

K independentvariables

K simple

K (K-1) two variable

1K variable

Final Regression

Selection Criteria

• R2___________________________

• Adj. R2 ______________________

• R2PRED ______________________

• Se __________________________

• Cp___________________________

Cp (= # of coefficients)

Select a combination with Cp close to p

2

2

pp

F

p F F

F

p F

F

p F

F

SSEC n p

MSE

MSE n p MSE n p MSE n p

MSE

n p MSE MSEn p n p

MSE

n p MSE MSEp

MSE

What to Look for in Good Regression?

• Remember the three functions of regression– Estimation of the effect of each X

– Explaining the variability of Y

– Forecasting

• Populations regressions are assumptions– Needs testing

• Data might be contaminated

Extensions

For Other Variable Types of Y

Types of Variable

Variable

Quantitative

Qualitative

Continuous

Discrete(counting)

Ordinal

Nominal

Generalized Linear Models (GLM)

• Regression model:Y = 0 + 1X1 + 2X2 + ee following N(0, )

• GLM Formulation:1. Model for Y:

Y is N(, )

2. Model for predictors (Link Function):

= 0 + 1X1 + 2X

Forecasting Counting Data

• Model for Y: Poisson Distribution ()

• Link Function:

exp|

!

yi i

i iP Y yy

0 1 1 2 2log i i i K KiX X X

Documents

Basics of Regression Analysis. Determination of three performance measures Estimation of the effect of each factor Explanation of the variability Forecasting