Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 1✬
✫
✩
✪
GeneralizedAdditive Models
Outline
• Review of generalized additive models
• Example: Dynamic CAPM model
• Example: Kyphosis in young children
• Example: Additive Multinomial ChoiceModel
• Fitting GAMs in Splus
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 2✬
✫
✩
✪
Additive Models
Y = α+ f1(x1) + f2(x2) + · · ·+ fp(xp) + ε
• Y = α+ β1x1 + β2x2 + ε — linear model
• Y = α+ {β11x1 + β12x21} +β2 log(x2)+
∑j β3j1(cj<x3≤cj+1) + ε
• Y = α+ β1x1 + β2x2 + f3(x3) + ε —semiparametric model
• Y = α+ β1x1 + β2x2 + f3(x3) + f4(x4) + ε
• Y = α+ β1x1 + f2(x2) + f34(x3, x4) + ε
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 3✬
✫
✩
✪
A complicated additive model
+ error
x 9
f 89 (x 8 ,x 9 )
x 8
+
x 6
f 67 (x 6 ,x 7 )
+x 7
Y =
x 1
f 1 (x 1 )
x 2
f 2 (x 2 )
+
x 4 =1
x 4 =2f 34 (x 3 ,x 4 )
+
x 3 x 5
f 5 (x 5 )
+
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 4✬
✫
✩
✪
The price for additivity
age
base deficit
f1(age)+f2(base deficit)
age
base deficit
f(age, base deficit)
For each value of age, the function of base deficithas the same shape — the level differs. And viceversa.
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 5✬
✫
✩
✪
Generalized Additive Models
Logistic Regression
A model for 0/1 data, as in classification problemswith 2 classes. P (Y |X) = Binomial(1, p(X))
logit p(x) = log( p(x)1− p(x)
)= α+ f1(x1) + f2(x2) + · · ·+ fp(xp)
or
p(x) =eα+f1(x1)+f2(x2)+···+fp(xp)
1 + eα+f1(x1)+f2(x2)+···+fp(xp)
Binomial variance:
Var(Y | x) = p(x)(1− p(x))
Log-Likelihood:
�(p) =∑
i
{yi log p(xi) + (1− yi) log(1− p(xi))}
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 6✬
✫
✩
✪
Other Examples
• poisson regression—log(µ(x)) = α+ f1(x1) + f2(x2) + · · ·+ fp(xp)
• discrete choice models — survey responsedata
• resistant additive models — taperedlikelihoods
• semiparametric additive models for designedexperiments
• additive decomposition of time series
• additive autoregression models
• varying coefficient models —η(x, t) = α(t) + x1β1(t) + x2β2(t)
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 7✬
✫
✩
✪
Seasonal-Trend DecompositionD
ata
320
330
340
350
Tre
nd32
033
034
035
0S
easo
nal
-20
2R
emai
nder
-0.6
-0.2
0.2
0.6
1960 1970 1980
Data are monthly CO2 measurements recorded atMauna-Loa, Hawaii.
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 8✬
✫
✩
✪
Fitting Additive Models
Y = f1(x1) + f2(x2) + · · ·+ fp(xp) + ε
Estimating Equations
f1(x1) = S1(y − • − f2(x2) − · · · − fp(xp))
f2(x2) = S2(y − f1(x1) − • − · · · − fp(xp))
...
fp(xp) = Sp(y − f1(x1) − f2(x2) − · · · − •)
where Sj are:
• univariate regression smoothers such as
smoothing splines, lowess, kernel
• linear regression operators yielding polynomial
fits, piecewise constant fits, parametric spline fits,
etc
• more complicated operators such as surface
smoothers for 2nd order interactions
We use Gauss-Seidel or “backfitting” to solve these
estimating equations.
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 9✬
✫
✩
✪
Justification?
Example: penalized least squares
Minimizen∑
i=1
(yi−
p∑j=1
fj(xij))2+
p∑j=1
λj
∫(f ′′j (t))
2dt
�
f1 = S1(λ1)(y −∑j �=1fj)
f2 = S2(λ2)(y −∑j �=2fj)
...
fp = Sp(λp)(y −∑j �=p
fj)
where Sj(λj) denotes a smoothing spline usingvariable xj and penalty coefficient λj . Backfittingconverges to the minimizer.
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 10✬
✫
✩
✪
Fitting generalized additive models
Logistic Regression
• Compute starting values: foldj and
ηold =∑
j foldj (xj) e.g. using linear logistic
regression
• Iterate
– construct adjusted dependent variable
zi = ηoldi +
(yi − poldi )
poldi (1− pold
i )
– construct weights wi = poldi (1− pold
i )
– compute ηnew = Awz, the weightedadditive model fit to z.
• Stop when functions don’t change
This is a Newton-Raphson algorithm for apenalized log-likelihood problem.
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 11✬
✫
✩
✪
Example: Kyphosis Case Study
This is a medical example, and illustrates thepower of additive models in relatively smallsample scenarios. Logistic regression models areused frequently with economic and financial data.
84 Children at the Toronto Hospital for SickChildren underwent Laminectomy, a correctivespinal surgery for a variety of abnormalities underthe general heading kyphosis.
Results: 65 successes, 19 kyphosis still present.
Goal: Try to understand/predict whether theoperation will be successful
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 12✬
✫
✩
✪
050
100
150
200
250
kyphosis no kyphosis
age
in m
onth
s
24
68
1012
14 *
*
kyphosis no kyphosis
num
ber
of le
vels
510
15
******
kyphosis no kyphosis
star
t lev
el
510
1520
****
**
kyphosis no kyphosis
end
leve
l
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 13✬
✫
✩
✪
age in months
˚
˚
˚˚˚˚˚
˚˚
˚˚
˚
˚
˚
˚
˚˚˚
˚
˚˚
˚
˚
˚
˚
˚ ˚˚ ˚˚
˚ ˚˚
˚
˚˚˚˚
˚
˚˚̊
˚˚
˚˚
˚˚˚˚
˚
˚˚
˚ ˚
˚˚˚
˚
˚
˚
˚
˚
˚
•••••
• ••••
•• ••• ••
•
2 4 6 8 10 12 14
˚
˚
˚ ˚˚˚̊˚˚
˚ ˚
˚
˚
˚
˚
˚˚ ˚
˚
˚˚
˚
˚
˚
˚
˚ ˚ ˚˚˚
˚̊˚
˚
˚˚
˚˚˚
˚˚̊
˚˚
˚˚
˚˚˚̊
˚
˚˚
˚ ˚
˚˚
˚
˚
˚
˚
˚
˚
˚
•• •• •
•••
•••• • •• •
•
•0
50
100
150
200
250
˚ ˚˚̊
˚˚ ˚ ˚˚˚
˚
˚˚˚ ˚˚˚
˚ ˚˚˚ ˚˚ ˚˚ ˚
˚˚˚
˚˚˚˚
˚
˚ ˚ ˚̊˚
˚˚̊
˚˚˚
˚
˚˚ ˚̊˚˚
˚˚̊ ˚˚˚ ˚˚ ˚˚
˚˚ •• ••
•
•
•
• ••••
•
•••
•
•
2
4
6
8
10
12
14
number of levels
˚ ˚˚ ˚
˚̊̊̊˚ ˚
˚
˚˚˚˚ ˚˚˚˚
˚˚ ˚ ˚̊̊˚
˚˚
˚˚˚
˚˚
˚
˚ ˚˚˚˚
˚˚̊
˚̊˚
˚
˚˚ ˚̊˚˚
˚˚ ˚˚˚ ˚˚ ˚̊
˚
˚˚•
• •••
•
•
• ••••
•
•••
•
•
˚
˚
˚
˚̊ ˚˚ ˚ ˚
˚
˚˚
˚˚ ˚ ˚
˚
˚
˚
˚
˚
˚
˚˚ ˚˚˚̊˚˚
˚
˚˚˚
˚˚
˚ ˚̊ ˚˚
˚̊˚˚
˚˚
˚
˚
˚̊
˚
˚˚˚
˚˚˚
˚
˚˚ ˚˚ ˚˚
•
••
•
•
•••
• •
••
•
•
•
•
•
0 50 100 200
˚
˚
˚
˚˚̊˚˚˚
˚
˚˚
˚˚ ˚˚
˚
˚
˚
˚
˚
˚
˚̊˚˚˚ ˚˚
˚˚
˚ ˚˚
˚˚
˚˚˚˚˚
˚̊˚˚
˚˚
˚
˚
˚˚
˚
˚˚̊˚˚˚
˚
˚˚˚ ˚˚ ˚
•
••
•
•
• ••
••
••
•
•
•
•
•start level
5
10
15
5 10 15
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 14✬
✫
✩
✪
Initial Model
logP (X)
1 − P (X)= f1(Age) + f2(Number) + f3(Start)
age
s(ag
e, 3
)
0 50 100 150 200 250
-4-3
-2-1
01
number
s(nu
mbe
r, 3
)
2 4 6 8 10 12 14
-10
12
34
start
s(st
art,
3)
0 5 10 15
-4-3
-2-1
01
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 15✬
✫
✩
✪
Analysis of Deviance
GAM models have to be assessed for overfitting,even more carefully than for parametric models.Do we believe the upswing in the function forNumber?
We can compare a series of nested models usingthe change in deviance, just like we do for linearmodels.Analysis of deviance table for kyphosis data
Model Dev df ∆ Dev ∆ df P-value
null (81 observations) 83.2 80.0
s(age)+s(number)+s(start) 40.8 68.2 42.4 11.8 0.00
s(age)+number+s(start) 46.3 71.1 5.5 2.9 0.13
s(age)+s(start) 48.4 72.1 2.1 1.0 0.14
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 16✬
✫
✩
✪
Likelihood Inference: Linear Models
For linear models
H0 : Logit(P (X, Z) = βT X p parameters
H1 : Logit(P (X, Z) = βT X + γT Z p + q parameters
Under H0
2 [LLmax(H1) − LLmax(H0)]n∼∞
χ2q
For Bernoulli data
Dev(Y, Hi)def= −2LLmax(Hi)
Dev(H1, H2) = Dev(Y, H0) − Dev(Y, H1)n∼∞
χ2q
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 17✬
✫
✩
✪
Likelihood Inference: Additive Models
H0 : Logit(P (X, Z) = s1(X)
H1 : Logit(P (X, Z) = s1(X) + s2(Z)
Dev(H1, H0) ≈ χ2df
df = EH0Dev(H1, H0)
≈ tr(SZ) − 1
where SZ is the weighted smoothing operator used in
fitting s2(Z).
Similar approximations using smoother matrices gets
standard error estimates for curves.
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 19✬
✫
✩
✪
Final Model
age
s(ag
e, 3
)
0 50 100 150 200
-8-6
-4-2
02
4
nonparametric additive fit
start
s(st
art,
3)
5 10 15
-8-6
-4-2
02
4
age
poly
(age
, 2)
0 50 100 150 200
-8-6
-4-2
02
4
parametric approximation
start
(sta
rt -
12)
* I(
star
t > 1
2)
5 10 15
-8-6
-4-2
02
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 20✬
✫
✩
✪
Alternative Representation
age
fitte
d pr
eval
ence
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
1.0
start < 12
start = 14
age
star
t
0 50 100 150 200
510
15
0.5
0.3
0.1
˚
˚
˚
˚̊˚˚ ˚ ˚
˚
˚
˚˚ ˚
˚˚
˚
˚
˚
˚
˚
˚˚ ˚˚
˚˚
˚
˚
˚
˚ ˚
˚
˚˚
˚ ˚˚ ˚
˚
˚̊˚˚
˚
˚
˚
˚
˚̊
˚
˚
˚˚
˚˚
˚
˚
˚
˚ ˚˚ ˚˚
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 25✬
✫
✩
✪
Example: Discrete Choice Models
Binary and Multinomial logit models are used todetermine the utility function used by a subset ofthe population in selecting from two or morechoices of services.
For example, the following might be 3 scenariospresented in a survey to assess the utility forelectrical services choices. The respondent readsthe configurations and makes a choice A, B or C.
Service Cover-age
InitialCost
MonthlyFee
WinterRatescents/kwh
SummerRatescents/kwh
A Service Drop,Meter
$400 $5 15 30
B None $300 $20 12.5 25
C Service Drop,Meter, InsideWiring
$300 $20 15 25
Each respondent is presented with a different setof three alternatives, mixed and matched in abalanced but random way.
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 26✬
✫
✩
✪
Multinomial Choice Model
Individual i is presented with a set of Kalternatives, each described in terms of a vector ofp variables xij , j = 1, . . . ,K. Based on a utilitytheory, we model the probability of choice as
P (Choice = k|set of K alts) =exp(βT xik)∑K�=1 exp(βT xi�)
This is the linear multinomial choice model, andthe parameters β are fit by maximizing theconditional log-likelihood:
�(β) =N∑
i=1
log
[exp(βT xiCi
)∑K�=1 exp(βT xi�)
]
where Ci is the choice made by respondent i.
See “Discrete Choice Analysis” by Ben-Akiva andLerman, MIT Press, 1985.
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 27✬
✫
✩
✪
Simulation: Electrical Survey
Our data are proprietary, so we have simulatedsome survey results. In the figure below, thegreen curves represent hypothetical true utilitiesfj(xj) for the jth variable, and the blue curvesthose estimated from our linear model.
The total utility for variable x isη(x) =
∑j fj(xj). A “respondent” is faced with
K = 3 alternatives x�, � = 1, . . . ,K, and picksalternative k with probability
exp(η(xk))/∑
�
exp(η(x�))
• In practice respondents can see differentnumbers of alternatives
• Although it is usual practice to offeralternatives for each variable from a veryrestricted range, it is not necessary, andinhibits the fitting of nonlinear utilityfunctions.
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 28✬
✫
✩
✪
Linear Multinomial Choice Model
The green curves represent hypothetical trueutilities fj(xj) for the jth variable, and the bluecurves those estimated using the linear model.
Monthly Fee ($)
5 10 15 20
-1.5
-0.5
0.5
Line Service
none d/m/wire
Initial Cost ($)
0 100 200 300 400
-1.0
0.0
1.0
Summer Rate (c/kwh)
0.20 0.25 0.30 0.35
-1.0
0.0
1.0
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 29✬
✫
✩
✪
Generalized Additive Models
Using generalized additive models, we can modelthe utility contributions fj(xj)non-parametrically.
The penalized conditional log-likelihood mightlook like
�p(η) =N∑
i=1
log
[exp(η(xiCi
))∑K�=1 exp(η(xi�))
]+ J(η)
where Ci is the choice made by respondent i. Hereη(x) =
∑j fj(xj) and J(η) imposes smoothness
constraints on the functions fj implicit in η.
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 30✬
✫
✩
✪
Additive Multinomial Choice Model
The green curves represent hypothetical trueutilities fj(xj) for the jth variable, and the bluecurves those estimated from the simulated datausing our generalized additive model.
Monthly Fee ($)
5 10 15 20
-1.5
-0.5
0.5
Line Service
none d/m/wire
Initial Cost ($)
0 100 200 300 400
-1.0
0.0
1.0
Summer Rate (c/kwh)
0.20 0.25 0.30 0.35
-0.5
0.5
1.5
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 31✬
✫
✩
✪
Fitting GAMs in Splus
The Splus language and environment for dataanalysis, modeling and graphics comes with toolsfor fitting
• Linear and anova models
• Generalized linear and generalized additivemodels
• Local regression
• Tree based models
• General nonlinear models
• Time series and spatial models
• Survival models
In addition, a wide variety of contributed softwareis available for fitting neural networks models,nearest neighbor methods, discriminant analysis,and almost any useful models used in practice.
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 32✬
✫
✩
✪
What tools are available?
Besides a rich functional programming languagefor graphics and statistical computing, specialtools are available for modeling:
• Symbolic formula language
• Data frames
• Software (lm(), aov(), glm(), gam() , ...)
• Classes and Methods
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 33✬
✫
✩
✪
GAMs and Formulas
gam(NOx ∼ C + s(E) )
• reads NOx is modeled as C + s(E)
• the first term is linear in C
• the second term is nonparametric in E, to befit using a smoothing spline and a defaultamount of smoothing
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 34✬
✫
✩
✪
More formulas
NOx ∼ s(C) + s(E, df=6)
NOx ∼ s(C) + lo(E, degree=2, span=.5)
NOx ∼ C + bs(E, knots=c(0.75,0.9,1.0))
log(NOx) ∼ C * poly(E,4)
NOx ∼ lo(C, E)
Each term in a formula y ∼ a + b can refer to:
• numeric vector
• factor or logical vector
• matrix
• an expression that evaluates to one of theabove
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 35✬
✫
✩
✪
gam objects
eth1 <- gam(NOx ∼ C + lo(E, degree=2))
> eth1
Call:
gam(formula = NOx ∼ C + lo(E, degree = 2))
Degrees of Freedom: 88 total; 80.1 Residual
Residual Deviance: 5.2
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 36✬
✫
✩
✪
> plot(eth1, se=T, residuals=T)
C
C
8 10 12 14 16 18
-0.5
0.0
0.5
oooooo
oo o
o
o
oo
o
o
o
o
oo
o
ooo
o
oo
o
o
o o
ooo
o
oo
o
o
o
o
o
oo
o
o
o
oo
ooo
o
o
oo
o
o
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
E
lo(E
, deg
ree
= 2
)
0.6 0.8 1.0 1.2
-2-1
01
2
o
o
o
o
o
o
o
o
o
o
o
ooo
oo
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
oo
oo
o
o
o
o
o
oo
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
ooo
oo
o
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 37✬
✫
✩
✪
Classes and Methods
kyph1 <- gam(Kyphosis ∼ s(Age) + s(Start),
family=binomial)
> class(kyph1)
[1] "gam" "glm" "lm"
Functions like print() and plot() are generic.Some other generics are:
• summary(kyph1)
• update(kyph1, . ∼ . + Number)
• predict(eth1, newdata, se=T)
• anova(kyph1, kyph2, kyph3)
• step(kyph1, scope)
• and extractors such as fitted(), residuals(),and family()
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 38✬
✫
✩
✪
Practical Features of GAM Software
• Mixes parametric and nonparametric modelsin a transparent way:
– deviance and degrees of freedom (df) forall terms. (for nonparametric terms we usedf = tr(S)− 1).
– The df can be used to specify the amountof smoothing for any given term. E.G.s(Age, df=1) implies linear, s(Age, df=6)
a rather flexible smooth curve.
– Often parametric functions ortransformations will be suggested by theadditive terms, and subsequent analysiscan be done in parametric mode; e.g. . . .
MRC c©Hastie & Tibshirani January 23, 2000 GAM: 39✬
✫
✩
✪
s(Age) + s(Start)
Age
s(A
ge)
0 50 100 150 200
-6-4
-20
2
oo
o
ooo
oo
oo
o
o
oo
o
o
o
o
o
oo
oo
oo
oo
o
o
oo
ooo
o
o
o
o
o
oo
o oo
oo
oo
o
oo
o
o
o
oo
o
o
o
o
oo
ooo
o oo
oo
ooo
ooo
o
o
oo
o
Start
s(S
tart
)
5 10 15-1
0-5
0 oo
o
o
oo
ooo
oo
o
o o
o
oo
o
o
o
o
oo
o
o
o
o
oooo
o o
o
o
o
o
o
o oo
ooo
o
o
oo
o
oo
o
o
o
oo
o
o
oo
o o
o
oo
o
o
oo
o
oo
o
o
oo
o
oo
o
o
Age
poly
(Age
, 2)
0 50 100 150 200
-6-4
-20
2
o o
o
ooo
oo
oo
o
o
oo
o
o
o
o
o
oo
oo
oo
oo
o
o
oo
ooo
o
o
o
o
o
oo
o oo
oooo
o
oo
o
o
o
oo
o
o
o
o
oo
ooo
o o
o
oo
ooo
o
oo
o
o
oo
o
Start
I((S
tart
- 1
0) *
(S
tart
> 1
0))
5 10 15
-4-2
02
4
o
o
o
o
oo
ooo
oo
o
oo
o
oo
o
o
o
o
oo
o
o
o
o
oooo
oo
o
o
o
o
o
oo
o
ooo
o
o
oo
o
o
o
o
o
ooo
o
o
o
o
o o
o
oo
o
o
o
o
oo
o
o
o
oo
o
oo
o
o
poly(Age, 2) + I( (Start-10)*(Start>10) )