Upload
alaina-greer
View
216
Download
0
Embed Size (px)
Citation preview
Linear Models
Alan Lee
Sample presentation for STATS 760
Contents• The problem• Typical data• Exploratory Analysis• The Model• Estimation and testing• Diagnostics• Software• A Worked Example
The Problem
• To model the relationship between a continuous variable Y and several explanatory variables x1,… xk.
• Given values of x1,… xk , predict the value of Y.
Typical Data
• Data on 5000 motor vehicle insurance policies having at least one claim
• Variables are– Y: log(amount of claim)
– x1: sex of policy holder
– x2: age of policy holder
– x3: age of car
– x4: car type (1-20 score, 1=Toyota Corolla, 20 = Porsche)
Exploratory Analysis
• Plot Y against other variables
• Scatterplot matrix
• Smooth as necessary
Log claims vs car age
The Model
• Relationship is modelled using the conditional distribution of Y given x1,…xk. (covariates)
• Assume conditional distribution of Y is N(,2) where depends on the covariates.
The Model (2)
• If all covariates are “continuous”, then
xkxk
• In addition, all Y’s are assumed independent.
Estimation and Testing• Estimate the ’s
• Estimate the error variance 2
• Test if ’s
• Check goodness-of-fit
Least SquaresEstimate ’s by values that minimize the sum of squares (Least squares estimates, LSE’s)
2110
1
)...()( ikki
n
ii xxybSS
Minimizing values are the solution of the Normal Equations. Minimum value is the residual sum of squares (RSS)
estimated by RSS/(n-k-1)
Goodness of Fit
• Goodness of fit measured by R2:
2
1
2
)(
1
yyTotal SS
SSTotal
RSSR
n
ii
0R21 (why?)
R2=1 iff perfect fit (data all on a plane)
Prediction
• Y predicted by
where the hat indicates the LSE
• Standard errors: 2 kinds, one for mean value of Y for a set of x’s, the other for an individual y for a particular set of x’s
kk xx ˆ...ˆˆ110
Interpretation of Coefficients
• The LSE for variable xj is the amount we expect y to increase if xj is increased by a unit amount, assuming all the other x’s are held fixed
• The test for j = 0 is that variable j makes no contribution to the fit, given all other variables are in the model
Checking Assumptions (1)
• Tools are residuals, fitted values and hat matrix diagonals
• Fitted values
• Residuals
• Hat matrix diagonals
(Measure the effect of an observation on its fitted value)
kk xxy ˆ...ˆˆˆ 110
yye ˆ
iii
iiiiiii
hh
yhyhyhyhy
112211 ......ˆ
Checking Assumptions (2)
Assumptions are– Mean linear in the x’s (plot residuals v
fitted values, partial residual plot, CERES plots)
– Constant variance (plot squared residuals v fitted values)
– Independence (time series plot, residuals v preceding)
– Normality/outliers (normal plot)
Remedial Action
• Transform variables
• Delete outliers
• Weighted least squares
Software
• SAS: PROC REG, PROC GLM• R-Plus, R: lm
• Usage:lm(model formula, dataframe, weights,…)
Model Formula• Assume k=3
• If x1,x2,x3 all continuous, fit a planeY~x1 + x2 + x3
• If x1 categorical (eg gender) and x2, x3 continuous, fit a different plane/curve in x2,x3 for each level of x1: Y~x1 + x2 + x3 (planes parallel)Y~x1 + x2 + x3 + x1:x2 + x1:x3 (planes
different)
Insurance Example (1) cars.lm<-lm(logad~poly(CARAGE,2)+PRIMAGEN+gender) summary(cars.lm)
Call:lm(formula = logad ~ poly(CARAGE, 2) + PRIMAGEN + gender)
Residuals: Min 1Q Median 3Q Max -3.9713 -0.4610 0.2376 0.8092 3.9767
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.986329 0.077533 77.210 < 2e-16 ***poly(CARAGE, 2)1 -7.308946 1.229095 -5.947 2.92e-09 ***poly(CARAGE, 2)2 -8.038865 1.232416 -6.523 7.58e-11 ***PRIMAGEN 0.004014 0.001339 2.999 0.00272 ** gender 0.015633 0.041474 0.377 0.70624 ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 1.226 on 4995 degrees of freedomMultiple R-Squared: 0.01611, Adjusted R-squared: 0.01532 F-statistic: 20.45 on 4 and 4995 DF, p-value: < 2.2e-16
Insurance Example (2)> plot(cars.lm)