Upload
reagan-rasor
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Feature Selection for Supervised and Unsupervised Learning
Mário A. T. Figueiredo
Institute of Telecommunications, andInstituto Superior Técnico
Technical University of LisbonPORTUGAL
Work herein reported was done in collaboration with:
L. Carin, B. Krishnapuram, and A. Hartemink, Duke University;A. K. Jain and M. Law, Michigan State University.
[email protected] www.lx.it.pt/~mtf
UFL, January 2004 M. Figueiredo, IST
Outline
1. Introduction2. Review LASSO Regression3. The LASSO Penalty for Multinomial Logistic Regression4. Bound Optimization Algorithms (Parallel and Sequential)5. Non-linear Feature Weighting/Selection6. Experimental Results
Part I Supervised Learning
Part III Unsupervised Learning1. Introduction2. Review of Model-Based Clustering with Finite Mixtures3. Feature Saliency4. An EM Algorithm to Estimate Feature Saliency5. Model Selection6. Experimental Results
Part II Performance Bounds
UFL, January 2004 M. Figueiredo, IST
Supervised Learning
Goal: to learn a functional dependency...
...from a set of examples (the training data):
- Discriminative (non-generative) approach: no attempt to model the joint density
set of parameters
UFL, January 2004 M. Figueiredo, IST
Complexity Control Via Bayes Point Estimation
Bayesian (point estimation) approach:
Prior controls the “complexity” of
Good generalization requires complexity control.
Likelihood function
Maximum a posteriori (MAP) point estimate of
Prediction for a new “input” :
UFL, January 2004 M. Figueiredo, IST
Bayes Point Estimate Versus Fully Bayesian Approach
Point prediction for a new “input” :
Point estimate
We will not consider fully Bayesian approaches here.
Fully Bayesian prediction
UFL, January 2004 M. Figueiredo, IST
Linear (w.r.t. ) Regression
e.g., radial basis functions, splines, wavelets, polynomials,...
We consider functions which are linear w.r.t.
where is some dictionary of functions;
Notable particular cases: linear regression:
kernel regression:as in SVM, RVM, etc...
UFL, January 2004 M. Figueiredo, IST
Likelihood Function for Regression
Likelihood function, for Gaussian observation model
where Assuming that y and thecolumns of H are centered, we drop w.l.o.g.
Maximum likelihood / ordinary least squares estimate
…undetermined if
…the design matrix.
UFL, January 2004 M. Figueiredo, IST
called ridge regression, or weight decay (in neural nets parlance).
With a Gaussian prior
Bayesian Point Estimates: Ridge and the LASSO
LASSO regression [Tibshirani, 1996], prunning priors for NNs [Williams, 1995],basis pursuit [Chen, Donoho, Saunders, 1995].
With a Laplacian prior
promotes sparseness of i.e., its components are either significantly large, or zero.
feature selection
UFL, January 2004 M. Figueiredo, IST
Algorithms to Compute the LASSO
Special purpose algorithms[Tibshirani, 1996], [Fu, 1998], [Osborne, Presnell, Turlach, 2000],
For orthogonal H, closed-formsolution: “soft threshold”
More insight on the LASSO,see [Tibshirani, 1996].
[Efron, Hastie, Johnstone, & Tibshirani, 2002]Least angle regression (LAR); currently the best approach.
UFL, January 2004 M. Figueiredo, IST
EM Algorithm for the LASSO
LASSO can be computed by EM using a hierarchical formulation:
are independent,
This possibility mentioned in [Tibshirani, 1996]; not very efficient, but…
Treat the as missing data and apply standard EM. This leads to:
which can be called an iteratively reweighted ridge regression (IRRR)
UFL, January 2004 M. Figueiredo, IST
About
The previous derivation opens the door to other priors.
For example, a Jeffreys prior:
The EM algorithm becomes (still IRRR-type):
Interestingly, similar to the FOCUSS algorithm for regressionwith an penalty [Kreutz-Delgado & Rao, 1998]. Strong sparseness!
Problem: non-convex objective, results depend on initialization. Possibility: initialize with OLS estimate.
UFL, January 2004 M. Figueiredo, IST
Some Results
Same vectors as in [Tibshirani, 1996]:
Design matrices and experimental procedure as in [Tibshirani, 1996]
Model error (ME) improvement w.r.t. OLS estimate:
Close to best in each case, without any cross-validation;more results in [Figueiredo, NIPS’2001] and [Figueiredo, PAMI’2003].
UFL, January 2004 M. Figueiredo, IST
Classification via Logistic Regression
Binary classification:
Multi-class, with “1 of m” encoding:
Class
Recall thatmay denote the components of , or other (nonlinear)functions of ,such as kernels.
UFL, January 2004 M. Figueiredo, IST
Classification via Logistic Regression
Since , we can set w.l.o.g.
Parameters to estimate:
Maximum log-likelihood estimate:
If is separable, is unbounded, thus undefined.
UFL, January 2004 M. Figueiredo, IST
Penalized (point Bayes) Logistic Regression
Penalized (or point Bayes MAP) estimate
where
Gaussian prior Penalized log. reg.
Laplacian prior(LASSO prior)
favors sparseness,feature selection.
For linear regression it does,what about for logistic regression?
UFL, January 2004 M. Figueiredo, IST
Laplacian Prior for Logistic Regression
Simple test with 2 training points:
class 1
class -1 w/ Laplacian prior
class 1
class -1 As decreases, becomes less relevant
Linear logistic regression
w/ Gaussian prior
UFL, January 2004 M. Figueiredo, IST
Algorithms for Logistic Regression
where
Standard algorithm: Newton-Raphson, a.k.a. iteratively reweighted least squares (IRLS)
IRLS is easily applied without any prior or with Gaussian prior.
IRLS not applicable with Laplacian prior: is not differentiable.
Alternative: bound optimization algorithms
For ML logistic regression [Böhning & Lindsay, 1988], [Böhning,1992].
More general formulations [de Leeuw & Michailides, 1993], [Lange, Hunter, & Yang, 2000].
UFL, January 2004 M. Figueiredo, IST
Bound Optimization Algorithms (BOA)
Sufficient (in fact more than sufficient)to prove monotonicity:
Notes: should be easy to maximize
EM is a BOA
Optimization problem:
where is such that
....with equality if and only if
Bound optimization algorithm:
UFL, January 2004 M. Figueiredo, IST
Deriving Bound Functions
Many ways to obtain bound functions
For example, well known that Jensen’s inequality underlies EM
Via Hessian bound: suppose is concave, with Hessian bounded below,
where is a positive definite matrix.
Can use r.h.s. as with
UFL, January 2004 M. Figueiredo, IST
Quasi-Newton Monotonic Algorithm
Update equation
is simple to solve, leads to
This is a quasi-Newton algorithm, with B replacing the Hessian.
Unlike the Newton algorithm, it is monotonic.
UFL, January 2004 M. Figueiredo, IST
Aplication to ML Logistic Regression
For logistic regression, can be shown that [Böhning,1992]
Kroneker product
Also easy to compute the gradient and finally plug into
Under a ridge-type Gaussian prior
can be computed off-line.
UFL, January 2004 M. Figueiredo, IST
Aplication to LASSO Logistic Regression
For LASSO logistic regression
already bounded via Hessian bound...need bound for log prior
quadratic bound
Easy to show that, for any
...with equality iff
UFL, January 2004 M. Figueiredo, IST
Aplication to LASSO Logistic Regression
After dropping additive terms,
where
The update equation is an IRRR
which can be rewritten as
where
UFL, January 2004 M. Figueiredo, IST
Aplication to LASSO Logistic Regression
The update equation
has computational cost
May not be OK for kernel classification for large
OK for linear classification if not too large.
This is the cost of standard IRLS for ML logistic regression ...but now with a Laplacian prior.
UFL, January 2004 M. Figueiredo, IST
Sequential Update Algorithm for LASSO Logistic Regression
Recall that
Let’s bound only via the Hessian bound, leaving
Maximizing only w.r.t. the –th component of
for
UFL, January 2004 M. Figueiredo, IST
Sequential Update Algorithm for LASSO Logistic Regression
The update equation
Can be shown that updatingall components has cost
may be much less than
Usually also uses fewer iterations,since we do not bound the priorand is “incremental”
has a simple closed-form expression
UFL, January 2004 M. Figueiredo, IST
Sequential Update Algorithm for Ridge Logistic Regression
With a Gaussian prior, the update equation
also has a simple closed-form expression v
For , we get the update rule for ML logistic regression.
Important issue: how to choose the order of update, i.e., which component to update next?
In all the results presented, we use a simple cyclic schedule.
We’re currently investigating other alternatives (good, but cheap).
UFL, January 2004 M. Figueiredo, IST
Related Work
Sequential update for the relevance vector machine (RVM)[Tipping & Faul, 2003]Comments: the objective function of the RVM is not concave, results may depend critically on initialization and order of update.
Kernel logistic regression; the import vector machine (IVM)[Zhu & Hastie, 2001]Comments: sparseness not encouraged in the objective function (Gaussianprior) but by early stopping a greedy algorithm.
Efficient algorithm for SVM with penalty [Zhu, Rosset, Hastie, & Tibshirani, 2003]Comments: efficient, though not simple, algorithm; the SVM objective is different from the logistic regression objective.
Least angle regression (LAR) [Efron, Hastie, Johnstone, & Tibshirani, 2002]Comments: as far as we know, not yet/easily applied to logistic regression.
UFL, January 2004 M. Figueiredo, IST
Experimental Results
Three standard benchmark datasets: Crabs, Iris, and Forensic Glass.Three well-known gene expression datasets: AML/ALL, Colon, and Yeast.
Penalty weight adjusted by cross-validation.
Comparison with state-of-the-art classifiers: RVM and SVM (SVMlight)
Summary of datasets:
Not exactly CV, but 30different 50/12 splits
No CV, fixed standard split
UFL, January 2004 M. Figueiredo, IST
Experimental Results
All kernel classifiers (RBF Gaussian and linear). For RBF, width tuned by CV with SVM, and used for all other methods.
Linear classification of AML/ALL (no kernel): 1 error, 81 (of 7129) features (genes) selected.
[Krishnapuram, Carin, Hartemink, and Figueiredo, 2004 (submitted)]
Number of errors
Number of kernels
Results:
BMSLR Bayesian multinomial sparse logistic regressionBMGLR Bayesian multinomial Gaussian logistic regression
UFL, January 2004 M. Figueiredo, IST
Non-linear Feature Selection
We have considered a fixed dictionary of functions
i.e., feature selection is done on parameters appearing linearly
or “generalized linearly”
Let us now look at “non-linear parameters”, i.e., inside the dictionary:
UFL, January 2004 M. Figueiredo, IST
Non-linear Feature Selection
For logistic regression,
We need to further constraint the problem....consider parameterizations of the type:
Polynomial
Gaussian
For kernels:
UFL, January 2004 M. Figueiredo, IST
This corresponds to a different scaling of each original feature.
Sparseness of feature selection.
We can also adopt Laplacian prior for
Feature Scaling/Selection
Estimation criterion:
Logistic log-likelihood
UFL, January 2004 M. Figueiredo, IST
EM Algorithm for Feature Scaling/Selection
Optimization problem:
We use again a bound optimization algorithm (BOA),
with Hessian bound for and the quadratic bound for
Maximizing can’t be done in closed-form.
Easy to maximize w.r.t. , with fixed.
Maximization w.r.t. done by conjugate gradient;necessary gradients are easy to derive.
JCFO – Joint classifier and feature optimization [Krishnapuram, Carin, Hartemink, and Figueiredo, IEEE-TPAMI, 2004].
UFL, January 2004 M. Figueiredo, IST
Experimental Results on Gene Expression Data (Full LOOCV)
Method AML/ALL Colon
Boosting 95.8 72.6
SVM (linear kernel) 94.4 77.4
SVM (quadratic kernel) 95.8 74.2
RVM (no kernel) 97.2 88.7
Logistic (no kernel) 97.2 71.0
Sparse probit (quadr. kernel) 95.8 84.6
Sparse probit (linear kernel) 97.2 91.9
JCFO (quadratic kernel) 98.6 88.7
JCFO (linear kernel) 100 96.8
Accuracy (%)
[Ben-Dor et al, 2000]
[Krishnapuram et al, 2002]
Tipically around25~30 genes selected,i.e., non-zero
UFL, January 2004 M. Figueiredo, IST
Top 12 Genes for AML/ALL (sorted by mean |i| )
*
*
*
*
* Agree with [Golub et al, 1999]; many others in the top 25.Antibodies to MPO are used in clinical diagnosis of AML
UFL, January 2004 M. Figueiredo, IST
Top 12 Genes for Colon (sorted by mean qi )
*
***
* Known to be implicated in colon cancer.
UFL, January 2004 M. Figueiredo, IST
PART II
Non-trivial Bounds for Sparse Classifiers
UFL, January 2004 M. Figueiredo, IST
Introduction
Training data (here, we consider only binary problems):
assumed to be i.i.d. from an underlying distribution
Key question:how are the tworelated?
Given a classifier
True generalization error (not computable, is unknown):
Sample error:
UFL, January 2004 M. Figueiredo, IST
PAC Performance Bounds
PAC (probably approximately correct) bounds are of the form:
and hold independently of
Usually, bounds have the form:
uniformly over
UFL, January 2004 M. Figueiredo, IST
PAC Performance Bounds
There are several ways to derive
- Vapnik-Chervonenkis (VC) theory (see, e.g., [Vapnik, 1998])
VC usually leads to trivial bounds (>1, unless n is huge).
- Compression arguments [Graepel, Herbrich, & Shawe-Taylor, 2000]
Compression bounds are not applicable to point sparse classifiersof the type herein presented, or of the RVM type [Herbrich, 2000].
We apply PAC-Bayesian bounds [McAllester, 1999], [Seeger, 2002].
UFL, January 2004 M. Figueiredo, IST
Some Definitions
: some point “estimate” of . Let
a Laplacian centered at
we’ll call this the “posterior”, although not in the usual sense.
Point classifier (PC) at the one we’reinterested in
Gibbs classifier (GC) at
a sample from
Bayes voting classifier (BVC) at
UFL, January 2004 M. Figueiredo, IST
Key Lemmas
Lemma 1: for any , the decision of the PC with is the sameas that of a BVC based on any symmetric posteriorcentered on
Proof: a simple pairing argument (see, e.g., [Herbrich, 2002]).
Lemma 2: for any “posterior” , the generalization error of the BVC is less than twice that of the GC.
Proof: see [Herbrich, 2002].
Conclusion: we can use PAC-Bayesian bounds for GC for our PC.
UFL, January 2004 M. Figueiredo, IST
PAC-Bayesian Theorem
Let be our prior (meaning it is independent of )
Let be our posterior (meaning it may depend on )
Generalization error for a Gibbs classifier:
Expected sample/empirical error for a Gibbs classifier:
McAllester’s PAC-Bayesian theorem relates these two errors.
UFL, January 2004 M. Figueiredo, IST
PAC-Bayesian Theorem
Theorem: with as defined above, the following inequality holds with probability at least over random training samples of size
where is the Kullback-Leibler div. between two Bernoullies, and
is the Kullback-Leibler div. between posterior and prior.
UFL, January 2004 M. Figueiredo, IST
Tightening the PAC-Bayesian Theorem
With our Laplacian prior and posterior, we have:
and show that
Due to the convexity of the KLD, it is easy to (numerically) find
1 2 3 4
0.5
1
1.5
2
2.5
Since we can choose freely:
UFL, January 2004 M. Figueiredo, IST
Using the PAC-Bayesian Theorem
Set a prior parameter and choose a confidence level
Find such that
Using this prior and , find a point estimate
From these, we know that with probability at least
and evaluate the corresponding by Monte Carlo.
With this , define the “posterior” as above,
UFL, January 2004 M. Figueiredo, IST
To obtain an explicit bound on
Using the PAC-Bayesian Theorem
can easily be found numerically.
is always non-trivial, i.e.,
UFL, January 2004 M. Figueiredo, IST
Using the PAC-Bayesian Theorem
Finally, notice that the PAC-Bayesian bound applies tothe Gibbs classifier, but recall Lemma 2.
Lemma 2: for any “posterior” , the generalization error of the BVC is less than twice that of the GC.
In practice, we have observed that the BVC usually generalizesas well as (often much better than) the GC. We believe the factor2 can be reduced, but we have not yet been able to show it.
UFL, January 2004 M. Figueiredo, IST
Example of PAC-Bayesian Bound
“Mines” dataset
Maybe tight enoughto guide selection of
UFL, January 2004 M. Figueiredo, IST
Conclusions for Part II
-PAC-Bayesian bound for sparse classifier.
-The bound (unlike VC bounds) is always non-trivial.
-Tightness still requires large sample sizes.
-Future goals: tightening the bounds, ..as always.
UFL, January 2004 M. Figueiredo, IST
PART III
Feature Selection in Unsupervised Learning
UFL, January 2004 M. Figueiredo, IST
FS is a widely studied problem in supervised learning.
FS is a not widely studied in unsupervised learning (clustering).
A good reason: in the absence of labels, how do you assess the usefulness of each feature?
Feature Selection in Unsupervised Learning
Approach: how relevant is each feature (component of x) for the mixture nature of the data?
We address this problem in the context of model-based clustering using finite mixtures:
UFL, January 2004 M. Figueiredo, IST
Example:
x2 is irrelevant for the mixture nature of this data.
x1 is relevant for the mixture nature of this data.
Any PCA-type analysis ofthis data would not be useful here.
Example of Relevant and Irrelevant Features
UFL, January 2004 M. Figueiredo, IST
Interplay Between Number of Clusters and Features
Example:
Using only x1, we find 2 components.
Using x1 and x2, we find 7 components (needed to fit the non-Gaussian density of x2)
Marginals
UFL, January 2004 M. Figueiredo, IST
Most classical FS methods for supervised learningrequire combinatorial searches.
For d features, there are 2d possible feature subsets.
Alternative: assign real valued feature weights and encourage sparseness (like seen above for supervised learning).
Approaches to Feature Selection
[Law, Figueiredo and Jain, TPAMI, 2004 (to appear)]
UFL, January 2004 M. Figueiredo, IST
Maximum Likelihood Estimation and Missing Data
Missing data
One-of-k encoding: component i
“Complete” log-likelihood
...would be easy to maximize, if we had
Training data
Maximum likelihood estimate of
where
UFL, January 2004 M. Figueiredo, IST
EM Algorithm for Maximum Likelihood Estimation
E-step:
Current estimate of the probabilitythat was produced by component i
M-step:
because
UFL, January 2004 M. Figueiredo, IST
Model Selection for Mixtures
Important issue: how to select k (number of components)?
[Figueiredo & Jain, 2002] MML-based approach;roughly, an MDL/BIC with careful definition of the amount of data from which each parameter is estimated.
Resulting criterion leads to a simple modification of EM
Original M-step expression for
This update “kills” weak components
New expression:
where Number of parametersof each component.
UFL, January 2004 M. Figueiredo, IST
Feature Selection for Mixtures
Simplifying assumption: in each component, the featuresare independent.
parameters of the density of the l-thfeature in component i
Let some features have a common density in all components. These features are “irrelevant”.
if feature l is relevant if feature l is irrelevant
Common (w.r.t. i) densities of the irrelevant features
UFL, January 2004 M. Figueiredo, IST
The likelihood
To apply EM, we see as missing data and define
we call it “feature saliency”
Feature Saliency
Can be shown that the resulting likelihood (marginal w.r.t. ) is
UFL, January 2004 M. Figueiredo, IST
Applying EM
We address
by EM, using and as missing data.
In addition to the variables defined above, the E-step now also involves the following variables:
both easily computed in closed form.
UFL, January 2004 M. Figueiredo, IST
Applying EM
M-step: (ommitting the variances)
Assuming that and are both univariate Gaussian with arbitrary mean and variance.
mean in
mean in
UFL, January 2004 M. Figueiredo, IST
Model Selection
To perform feature selection, we want to encourage someof the saliencies to become either 0 or 1.
This can be achieved with the same MML-type criterionused above to select k
The modified M-step is:
where: number of parameters in
number of parameters in
UFL, January 2004 M. Figueiredo, IST
800 samples, projected on thefirst two dimensions
Mixture of 4 Gaussians (with identity covariance) with d = 10,
0
0
3
0
1
μ
0
0
9
1
2
μ
0
0
4
6
3
μ
0
0
10
7
4
μ
2 relevant features
8 irrelevant features
Synthetic Example
UFL, January 2004 M. Figueiredo, IST
final
initial
Common density
UFL, January 2004 M. Figueiredo, IST
Feature saliency values (mean 1 s.d.) over 10 runs
relevant features
irrelevant features
Synthetic Example
UFL, January 2004 M. Figueiredo, IST
- Several standard benchmark data sets:
Name n d k
Wine 178 13 3
Wisconsin breast cancer 569 30 2
Image segmentation 2320 18 7
Texture classification 4000 19 4
- These are standard data-sets for supervised classification.
- We fit mixtures, ignoring the labels.
- We classify the points and compare to the labels.
Real Data
UFL, January 2004 M. Figueiredo, IST
Real Data: Results
Name % error (sd) % error (sd)
Wine 6.61 (3.91) 8.06 (3.73)
Wisconsin breast cancer 9.55 (1.99) 10.09 (2.70)
Image segmentation 20.19 (1.54) 32.84 (5.10)
Texture classification 4.04 (0.76) 4.85 (0.98)
with FS without FS
- For these data-sets, our approach is able to improve the performance of mixture-based unsupervised classification.
UFL, January 2004 M. Figueiredo, IST
Research Directions
- More efficient algorithms for logistic regression with LASSO prior.
- Investigating the performance of generalized Gaussian priors with exponents other than 1 (LASSO) or 2 (Ridge)
- Deriving performce bounds for this type of approach
In supervised learning:
In unsupervised learning:
- More efficien algorithms
- Removing the conditional independence assumption
- Extension to other mixtures (e.g., multinomial for categorical data).