20
Linear Models Alan Lee Sample presentation for STATS 760

Linear Models Alan Lee Sample presentation for STATS 760

Embed Size (px)

Citation preview

Page 1: Linear Models Alan Lee Sample presentation for STATS 760

Linear Models

Alan Lee

Sample presentation for STATS 760

Page 2: Linear Models Alan Lee Sample presentation for STATS 760

Contents• The problem• Typical data• Exploratory Analysis• The Model• Estimation and testing• Diagnostics• Software• A Worked Example

Page 3: Linear Models Alan Lee Sample presentation for STATS 760

The Problem

• To model the relationship between a continuous variable Y and several explanatory variables x1,… xk.

• Given values of x1,… xk , predict the value of Y.

Page 4: Linear Models Alan Lee Sample presentation for STATS 760

Typical Data

• Data on 5000 motor vehicle insurance policies having at least one claim

• Variables are– Y: log(amount of claim)

– x1: sex of policy holder

– x2: age of policy holder

– x3: age of car

– x4: car type (1-20 score, 1=Toyota Corolla, 20 = Porsche)

Page 5: Linear Models Alan Lee Sample presentation for STATS 760

Exploratory Analysis

• Plot Y against other variables

• Scatterplot matrix

• Smooth as necessary

Page 6: Linear Models Alan Lee Sample presentation for STATS 760

Log claims vs car age

Page 7: Linear Models Alan Lee Sample presentation for STATS 760

The Model

• Relationship is modelled using the conditional distribution of Y given x1,…xk. (covariates)

• Assume conditional distribution of Y is N(,2) where depends on the covariates.

Page 8: Linear Models Alan Lee Sample presentation for STATS 760

The Model (2)

• If all covariates are “continuous”, then

xkxk

• In addition, all Y’s are assumed independent.

Page 9: Linear Models Alan Lee Sample presentation for STATS 760

Estimation and Testing• Estimate the ’s

• Estimate the error variance 2

• Test if ’s

• Check goodness-of-fit

Page 10: Linear Models Alan Lee Sample presentation for STATS 760

Least SquaresEstimate ’s by values that minimize the sum of squares (Least squares estimates, LSE’s)

2110

1

)...()( ikki

n

ii xxybSS

Minimizing values are the solution of the Normal Equations. Minimum value is the residual sum of squares (RSS)

estimated by RSS/(n-k-1)

Page 11: Linear Models Alan Lee Sample presentation for STATS 760

Goodness of Fit

• Goodness of fit measured by R2:

2

1

2

)(

1

yyTotal SS

SSTotal

RSSR

n

ii

0R21 (why?)

R2=1 iff perfect fit (data all on a plane)

Page 12: Linear Models Alan Lee Sample presentation for STATS 760

Prediction

• Y predicted by

where the hat indicates the LSE

• Standard errors: 2 kinds, one for mean value of Y for a set of x’s, the other for an individual y for a particular set of x’s

kk xx ˆ...ˆˆ110

Page 13: Linear Models Alan Lee Sample presentation for STATS 760

Interpretation of Coefficients

• The LSE for variable xj is the amount we expect y to increase if xj is increased by a unit amount, assuming all the other x’s are held fixed

• The test for j = 0 is that variable j makes no contribution to the fit, given all other variables are in the model

Page 14: Linear Models Alan Lee Sample presentation for STATS 760

Checking Assumptions (1)

• Tools are residuals, fitted values and hat matrix diagonals

• Fitted values

• Residuals

• Hat matrix diagonals

(Measure the effect of an observation on its fitted value)

kk xxy ˆ...ˆˆˆ 110

yye ˆ

iii

iiiiiii

hh

yhyhyhyhy

112211 ......ˆ

Page 15: Linear Models Alan Lee Sample presentation for STATS 760

Checking Assumptions (2)

Assumptions are– Mean linear in the x’s (plot residuals v

fitted values, partial residual plot, CERES plots)

– Constant variance (plot squared residuals v fitted values)

– Independence (time series plot, residuals v preceding)

– Normality/outliers (normal plot)

Page 16: Linear Models Alan Lee Sample presentation for STATS 760

Remedial Action

• Transform variables

• Delete outliers

• Weighted least squares

Page 17: Linear Models Alan Lee Sample presentation for STATS 760

Software

• SAS: PROC REG, PROC GLM• R-Plus, R: lm

• Usage:lm(model formula, dataframe, weights,…)

Page 18: Linear Models Alan Lee Sample presentation for STATS 760

Model Formula• Assume k=3

• If x1,x2,x3 all continuous, fit a planeY~x1 + x2 + x3

• If x1 categorical (eg gender) and x2, x3 continuous, fit a different plane/curve in x2,x3 for each level of x1: Y~x1 + x2 + x3 (planes parallel)Y~x1 + x2 + x3 + x1:x2 + x1:x3 (planes

different)

Page 19: Linear Models Alan Lee Sample presentation for STATS 760

Insurance Example (1) cars.lm<-lm(logad~poly(CARAGE,2)+PRIMAGEN+gender) summary(cars.lm)

Call:lm(formula = logad ~ poly(CARAGE, 2) + PRIMAGEN + gender)

Residuals: Min 1Q Median 3Q Max -3.9713 -0.4610 0.2376 0.8092 3.9767

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.986329 0.077533 77.210 < 2e-16 ***poly(CARAGE, 2)1 -7.308946 1.229095 -5.947 2.92e-09 ***poly(CARAGE, 2)2 -8.038865 1.232416 -6.523 7.58e-11 ***PRIMAGEN 0.004014 0.001339 2.999 0.00272 ** gender 0.015633 0.041474 0.377 0.70624 ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 1.226 on 4995 degrees of freedomMultiple R-Squared: 0.01611, Adjusted R-squared: 0.01532 F-statistic: 20.45 on 4 and 4995 DF, p-value: < 2.2e-16

Page 20: Linear Models Alan Lee Sample presentation for STATS 760

Insurance Example (2)> plot(cars.lm)