Upload
myrna
View
60
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Additive Models, Trees, and Related Methods. 2006. 02. 17. Partly based on Prof. Prem Goel ’ s Slides. 9.1 Generalized Additive Models. Mean function: f j : unspecified smooth (nonparametric) functions Relate conditional mean of Y to an additive function of X ’ s via a link function g. - PowerPoint PPT Presentation
Citation preview
Additive Models, Trees, and Related Methods
2006. 02. 17.
Partly based on Prof. Prem Goel’s Slides
9.1 Generalized Additive Models
Mean function:
fj: unspecified smooth (nonparametric) functions
Relate conditional mean of Y to an additive function of X’s via a link function g.
1 1 2 2[ ( )] ( ) ( ) ( )p pg X f X f X f X
( ) ( | )X E Y X
Standard Link Functions
Advanced Link Functions
Fitting Additive Models Fit each fj using scatterplot smoother and esti
mate all p functions simultaneously For example, the cubic smoothing spline as smoot
her Criterion: penalized sum of squares (9.7)
An additve cubic spline model minimizes this Each fj is cubic spline in the component Xj
Knots at each of the unique values xij
( )j jY f X
The backfitting algorithm
• Can accommodate other fitting methods in same way, by specifying appropritate smoothing operator Sj.
• For a large class of linear smoothers, backfitting is equivalent to a Gauss-Seidel algorithm
Additive Logistic Regression For the logistic regression model and other generalize
d additive models, the appropriate criterion is a penalized log-likelihood.
To maximize it, the backfitting procedure is used in conjunction with a likelihood maximizer.
Local Scoring Algorithm for the Additive Logistic Regression
Pr( 1| )log ( )
Pr( 0 | ) j j
Y Xf X
Y X
9.2 Tree-Based Methods Partition the feature
space into a set of rectangles and fit a simple model in each one.
CART and C4.5
Regression Tree Assume recursive binary partition In each partition, Y is modeled with a
different constant.
For each split, choose the variable and split-point which minimizes sum of squares.
Repeat with each subset, until reach a minimum node size
( ) ( )m mf x c I x R
1 21 2
2 21 2
,( , ) ( , )
min[min ( ) min ( ) ]i i
i ij s c c
x R j s x R j s
y c y c
Regression Tree How large should we grow the tree? Cost-complexity pruning
Find tree which minimizes
Choosing adaptively by weakest link pruning
Collapse the smallest per-node increase in RSS until we get the single-node tree.
Among these sequence of trees, there exists a tree that minimizes cost-complexity
Cross-validation
21ˆ( ) ( ) | |, ( ) ( )
i m
m m m i mx Rm
C T N Q T T Q T y cN
Classification Trees
Only change in the criteria to split nodes and pruning the tree.
Node Impurity Measures
• Cross-entropy and Gini index are more sensitive to changes in the node probabilities than the misclassification rate.
• Either cross-entropy and Gini index should be used when growing the tree.
• When pruning, any of the three can be used.
Other Issues Instability
Hierarchical process: error on the upper split is propagated down.
Bagging Lack of smoothness in prediction surface.
Can degrade performance in regression. MARS
ROC curves By varying relative sizes of the losses L01 and
L10 in loss matrix, increase/decrease the sensitivity/specificity
9.3 PRIM-Bump Hunting Patient Rule Induction Method Seeks boxes in which the response
average is high. Not binary split
Hard to interpret the collection of rules. Individual rule is simpler.
Patient Do not fragment the data quickly as binary
partition. Can help the top-down greedy algorithm
find a better solution.
PRIM
PRIM
9.4 MARS: Multivariate Adaptive Regression Splines
Basic element – pair of piecewise linear basis function
Form each reflected pairs for each input Xj with knots at each observed value of that input. Total 2Np basis functions
1 2{( ) , ( ) | { , , , }, 1, , }j j j j NjX t t X t x x x j p
Model Building
Forward Selection
General Basis Selection Rule
Backward Deletion
Effective # of Parameters
Other Issues MARS for classification
Two classes: 0/1 code and regression More than two classes: optimal
scoring (Section 12.5) MARS vs. CART
Piecewise linear basis vs. step functions
Multiplication vs. splitting Not necessarily binary splitting.
9.5 Hierarchical Mixtures of Experts
Soft gating network with expert at terminal node.
Hierarchical Mixtures of Experts
Estimation of parameters EM algorithm E-step: compute expectations of gating
probabilities M-step: estimate the parameters in the expert
networks by multiple logistic regression. HME vs. CART
Similar to CART with linear combination splits. Soft split: better to model gradual response
transition No method to find a good tree topology for HME
Hierarchical Mixtures of Experts
9.6 Missing Data Whether the missing data mechanism distorte
d the observed data. Missing at random(MAR) – missing data mechanis
m is independent of the observed data.
Missing completely at random(MCAR) – missing data mechanism is independent of data.
Pr( | , ) Pr( | , )obs R Z R Z
Pr( | , ) Pr( | ) R Z R
Missing Data
Assuming MCAR1. Discard observations with any
missing values.2. Rely on the learning algorithm to
deal with missing values in its training phase.
3. Impute all missing values before training.
9.7 Computational Considerations
Additive Model fitting: O(mpN+pNlogN), m is # of iterations.
Trees: O(pNlogN) for initial sorting and split computation)
MARS: O(NM2+pM2N), M is # of terms HME: O(Np2) for the regression, Np2K2(E
M algorith takes long to converge)