Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

Feature Selection for Supervised and Unsupervised Learning

Mário A. T. Figueiredo

Institute of Telecommunications, andInstituto Superior Técnico

Technical University of LisbonPORTUGAL

Work herein reported was done in collaboration with:

L. Carin, B. Krishnapuram, and A. Hartemink, Duke University;A. K. Jain and M. Law, Michigan State University.

[email protected] www.lx.it.pt/~mtf

UFL, January 2004 M. Figueiredo, IST

Outline

1. Introduction2. Review LASSO Regression3. The LASSO Penalty for Multinomial Logistic Regression4. Bound Optimization Algorithms (Parallel and Sequential)5. Non-linear Feature Weighting/Selection6. Experimental Results

Part I Supervised Learning

Part III Unsupervised Learning1. Introduction2. Review of Model-Based Clustering with Finite Mixtures3. Feature Saliency4. An EM Algorithm to Estimate Feature Saliency5. Model Selection6. Experimental Results

Part II Performance Bounds


Supervised Learning

Goal: to learn a functional dependency...

...from a set of examples (the training data):

- Discriminative (non-generative) approach: no attempt to model the joint density

set of parameters


Complexity Control Via Bayes Point Estimation

Bayesian (point estimation) approach:

Prior controls the “complexity” of

Good generalization requires complexity control.

Likelihood function

Maximum a posteriori (MAP) point estimate of

Prediction for a new “input” :


Bayes Point Estimate Versus Fully Bayesian Approach

Point prediction for a new “input” :

Point estimate

We will not consider fully Bayesian approaches here.

Fully Bayesian prediction


Linear (w.r.t. ) Regression

e.g., radial basis functions, splines, wavelets, polynomials,...

We consider functions which are linear w.r.t.

where is some dictionary of functions;

Notable particular cases: linear regression:

kernel regression:as in SVM, RVM, etc...


Likelihood Function for Regression

Likelihood function, for Gaussian observation model

where Assuming that y and thecolumns of H are centered, we drop w.l.o.g.

Maximum likelihood / ordinary least squares estimate

…undetermined if

…the design matrix.


called ridge regression, or weight decay (in neural nets parlance).

With a Gaussian prior

Bayesian Point Estimates: Ridge and the LASSO

LASSO regression [Tibshirani, 1996], prunning priors for NNs [Williams, 1995],basis pursuit [Chen, Donoho, Saunders, 1995].

With a Laplacian prior

promotes sparseness of i.e., its components are either significantly large, or zero.

feature selection


Algorithms to Compute the LASSO

Special purpose algorithms[Tibshirani, 1996], [Fu, 1998], [Osborne, Presnell, Turlach, 2000],

For orthogonal H, closed-formsolution: “soft threshold”

More insight on the LASSO,see [Tibshirani, 1996].

[Efron, Hastie, Johnstone, & Tibshirani, 2002]Least angle regression (LAR); currently the best approach.


EM Algorithm for the LASSO

LASSO can be computed by EM using a hierarchical formulation:

are independent,

This possibility mentioned in [Tibshirani, 1996]; not very efficient, but…

Treat the as missing data and apply standard EM. This leads to:

which can be called an iteratively reweighted ridge regression (IRRR)


About

The previous derivation opens the door to other priors.

For example, a Jeffreys prior:

The EM algorithm becomes (still IRRR-type):

Interestingly, similar to the FOCUSS algorithm for regressionwith an penalty [Kreutz-Delgado & Rao, 1998]. Strong sparseness!

Problem: non-convex objective, results depend on initialization. Possibility: initialize with OLS estimate.


Some Results

Same vectors as in [Tibshirani, 1996]:

Design matrices and experimental procedure as in [Tibshirani, 1996]

Model error (ME) improvement w.r.t. OLS estimate:

Close to best in each case, without any cross-validation;more results in [Figueiredo, NIPS’2001] and [Figueiredo, PAMI’2003].


Classification via Logistic Regression

Binary classification:

Multi-class, with “1 of m” encoding:

Class

Recall thatmay denote the components of , or other (nonlinear)functions of ,such as kernels.


Classification via Logistic Regression

Since , we can set w.l.o.g.

Parameters to estimate:

Maximum log-likelihood estimate:

If is separable, is unbounded, thus undefined.


Penalized (point Bayes) Logistic Regression

Penalized (or point Bayes MAP) estimate

where

Gaussian prior Penalized log. reg.

Laplacian prior(LASSO prior)

favors sparseness,feature selection.

For linear regression it does,what about for logistic regression?


Laplacian Prior for Logistic Regression

Simple test with 2 training points:

class 1

class -1 w/ Laplacian prior

class 1

class -1 As decreases, becomes less relevant

Linear logistic regression

w/ Gaussian prior


Algorithms for Logistic Regression

where

Standard algorithm: Newton-Raphson, a.k.a. iteratively reweighted least squares (IRLS)

IRLS is easily applied without any prior or with Gaussian prior.

IRLS not applicable with Laplacian prior: is not differentiable.

Alternative: bound optimization algorithms

For ML logistic regression [Böhning & Lindsay, 1988], [Böhning,1992].

More general formulations [de Leeuw & Michailides, 1993], [Lange, Hunter, & Yang, 2000].


Bound Optimization Algorithms (BOA)

Sufficient (in fact more than sufficient)to prove monotonicity:

Notes: should be easy to maximize

EM is a BOA

Optimization problem:

where is such that

....with equality if and only if

Bound optimization algorithm:


Deriving Bound Functions

Many ways to obtain bound functions

For example, well known that Jensen’s inequality underlies EM

Via Hessian bound: suppose is concave, with Hessian bounded below,

where is a positive definite matrix.

Can use r.h.s. as with


Quasi-Newton Monotonic Algorithm

Update equation

is simple to solve, leads to

This is a quasi-Newton algorithm, with B replacing the Hessian.

Unlike the Newton algorithm, it is monotonic.


Aplication to ML Logistic Regression

For logistic regression, can be shown that [Böhning,1992]

Kroneker product

Also easy to compute the gradient and finally plug into

Under a ridge-type Gaussian prior

can be computed off-line.


Aplication to LASSO Logistic Regression

For LASSO logistic regression

already bounded via Hessian bound...need bound for log prior

quadratic bound

Easy to show that, for any

...with equality iff



After dropping additive terms,

where

The update equation is an IRRR

which can be rewritten as

where



The update equation

has computational cost

May not be OK for kernel classification for large

OK for linear classification if not too large.

This is the cost of standard IRLS for ML logistic regression ...but now with a Laplacian prior.


Sequential Update Algorithm for LASSO Logistic Regression

Recall that

Let’s bound only via the Hessian bound, leaving

Maximizing only w.r.t. the –th component of

for


Sequential Update Algorithm for LASSO Logistic Regression

The update equation

Can be shown that updatingall components has cost

may be much less than

Usually also uses fewer iterations,since we do not bound the priorand is “incremental”

has a simple closed-form expression


Sequential Update Algorithm for Ridge Logistic Regression

With a Gaussian prior, the update equation

also has a simple closed-form expression v

For , we get the update rule for ML logistic regression.

Important issue: how to choose the order of update, i.e., which component to update next?

In all the results presented, we use a simple cyclic schedule.

We’re currently investigating other alternatives (good, but cheap).


Related Work

Sequential update for the relevance vector machine (RVM)[Tipping & Faul, 2003]Comments: the objective function of the RVM is not concave, results may depend critically on initialization and order of update.

Kernel logistic regression; the import vector machine (IVM)[Zhu & Hastie, 2001]Comments: sparseness not encouraged in the objective function (Gaussianprior) but by early stopping a greedy algorithm.

Efficient algorithm for SVM with penalty [Zhu, Rosset, Hastie, & Tibshirani, 2003]Comments: efficient, though not simple, algorithm; the SVM objective is different from the logistic regression objective.

Least angle regression (LAR) [Efron, Hastie, Johnstone, & Tibshirani, 2002]Comments: as far as we know, not yet/easily applied to logistic regression.


Experimental Results

Three standard benchmark datasets: Crabs, Iris, and Forensic Glass.Three well-known gene expression datasets: AML/ALL, Colon, and Yeast.

Penalty weight adjusted by cross-validation.

Comparison with state-of-the-art classifiers: RVM and SVM (SVMlight)

Summary of datasets:

Not exactly CV, but 30different 50/12 splits

No CV, fixed standard split


Experimental Results

All kernel classifiers (RBF Gaussian and linear). For RBF, width tuned by CV with SVM, and used for all other methods.

Linear classification of AML/ALL (no kernel): 1 error, 81 (of 7129) features (genes) selected.

[Krishnapuram, Carin, Hartemink, and Figueiredo, 2004 (submitted)]

Number of errors

Number of kernels

Results:

BMSLR Bayesian multinomial sparse logistic regressionBMGLR Bayesian multinomial Gaussian logistic regression


Non-linear Feature Selection

We have considered a fixed dictionary of functions

i.e., feature selection is done on parameters appearing linearly

or “generalized linearly”

Let us now look at “non-linear parameters”, i.e., inside the dictionary:


Non-linear Feature Selection

For logistic regression,

We need to further constraint the problem....consider parameterizations of the type:

Polynomial

Gaussian

For kernels:


This corresponds to a different scaling of each original feature.

Sparseness of feature selection.

We can also adopt Laplacian prior for

Feature Scaling/Selection

Estimation criterion:

Logistic log-likelihood


EM Algorithm for Feature Scaling/Selection

Optimization problem:

We use again a bound optimization algorithm (BOA),

with Hessian bound for and the quadratic bound for

Maximizing can’t be done in closed-form.

Easy to maximize w.r.t. , with fixed.

Maximization w.r.t. done by conjugate gradient;necessary gradients are easy to derive.

JCFO – Joint classifier and feature optimization [Krishnapuram, Carin, Hartemink, and Figueiredo, IEEE-TPAMI, 2004].


Experimental Results on Gene Expression Data (Full LOOCV)

Method AML/ALL Colon

Boosting 95.8 72.6

SVM (linear kernel) 94.4 77.4

SVM (quadratic kernel) 95.8 74.2

RVM (no kernel) 97.2 88.7

Logistic (no kernel) 97.2 71.0

Sparse probit (quadr. kernel) 95.8 84.6

Sparse probit (linear kernel) 97.2 91.9

JCFO (quadratic kernel) 98.6 88.7

JCFO (linear kernel) 100 96.8

Accuracy (%)

[Ben-Dor et al, 2000]

[Krishnapuram et al, 2002]

Tipically around25~30 genes selected,i.e., non-zero


Top 12 Genes for AML/ALL (sorted by mean |i| )

*

*

*

*

* Agree with [Golub et al, 1999]; many others in the top 25.Antibodies to MPO are used in clinical diagnosis of AML


Top 12 Genes for Colon (sorted by mean qi )

*

***

* Known to be implicated in colon cancer.


PART II

Non-trivial Bounds for Sparse Classifiers


Introduction

Training data (here, we consider only binary problems):

assumed to be i.i.d. from an underlying distribution

Key question:how are the tworelated?

Given a classifier

True generalization error (not computable, is unknown):

Sample error:


PAC Performance Bounds

PAC (probably approximately correct) bounds are of the form:

and hold independently of

Usually, bounds have the form:

uniformly over


PAC Performance Bounds

There are several ways to derive

- Vapnik-Chervonenkis (VC) theory (see, e.g., [Vapnik, 1998])

VC usually leads to trivial bounds (>1, unless n is huge).

- Compression arguments [Graepel, Herbrich, & Shawe-Taylor, 2000]

Compression bounds are not applicable to point sparse classifiersof the type herein presented, or of the RVM type [Herbrich, 2000].

We apply PAC-Bayesian bounds [McAllester, 1999], [Seeger, 2002].


Some Definitions

: some point “estimate” of . Let

a Laplacian centered at

we’ll call this the “posterior”, although not in the usual sense.

Point classifier (PC) at the one we’reinterested in

Gibbs classifier (GC) at

a sample from

Bayes voting classifier (BVC) at


Key Lemmas

Lemma 1: for any , the decision of the PC with is the sameas that of a BVC based on any symmetric posteriorcentered on

Proof: a simple pairing argument (see, e.g., [Herbrich, 2002]).

Lemma 2: for any “posterior” , the generalization error of the BVC is less than twice that of the GC.

Proof: see [Herbrich, 2002].

Conclusion: we can use PAC-Bayesian bounds for GC for our PC.


PAC-Bayesian Theorem

Let be our prior (meaning it is independent of )

Let be our posterior (meaning it may depend on )

Generalization error for a Gibbs classifier:

Expected sample/empirical error for a Gibbs classifier:

McAllester’s PAC-Bayesian theorem relates these two errors.


PAC-Bayesian Theorem

Theorem: with as defined above, the following inequality holds with probability at least over random training samples of size

where is the Kullback-Leibler div. between two Bernoullies, and

is the Kullback-Leibler div. between posterior and prior.


Tightening the PAC-Bayesian Theorem

With our Laplacian prior and posterior, we have:

and show that

Due to the convexity of the KLD, it is easy to (numerically) find

1 2 3 4

0.5

1

1.5

2

2.5

Since we can choose freely:


Using the PAC-Bayesian Theorem

Set a prior parameter and choose a confidence level

Find such that

Using this prior and , find a point estimate

From these, we know that with probability at least

and evaluate the corresponding by Monte Carlo.

With this , define the “posterior” as above,


To obtain an explicit bound on


can easily be found numerically.

is always non-trivial, i.e.,



Finally, notice that the PAC-Bayesian bound applies tothe Gibbs classifier, but recall Lemma 2.

Lemma 2: for any “posterior” , the generalization error of the BVC is less than twice that of the GC.

In practice, we have observed that the BVC usually generalizesas well as (often much better than) the GC. We believe the factor2 can be reduced, but we have not yet been able to show it.


Example of PAC-Bayesian Bound

“Mines” dataset

Maybe tight enoughto guide selection of


Conclusions for Part II

-PAC-Bayesian bound for sparse classifier.

-The bound (unlike VC bounds) is always non-trivial.

-Tightness still requires large sample sizes.

-Future goals: tightening the bounds, ..as always.


PART III

Feature Selection in Unsupervised Learning


FS is a widely studied problem in supervised learning.

FS is a not widely studied in unsupervised learning (clustering).

A good reason: in the absence of labels, how do you assess the usefulness of each feature?

Feature Selection in Unsupervised Learning

Approach: how relevant is each feature (component of x) for the mixture nature of the data?

We address this problem in the context of model-based clustering using finite mixtures:


Example:

x2 is irrelevant for the mixture nature of this data.

x1 is relevant for the mixture nature of this data.

Any PCA-type analysis ofthis data would not be useful here.

Example of Relevant and Irrelevant Features


Interplay Between Number of Clusters and Features

Example:

Using only x1, we find 2 components.

Using x1 and x2, we find 7 components (needed to fit the non-Gaussian density of x2)

Marginals


Most classical FS methods for supervised learningrequire combinatorial searches.

For d features, there are 2d possible feature subsets.

Alternative: assign real valued feature weights and encourage sparseness (like seen above for supervised learning).

Approaches to Feature Selection

[Law, Figueiredo and Jain, TPAMI, 2004 (to appear)]


Maximum Likelihood Estimation and Missing Data

Missing data

One-of-k encoding: component i

“Complete” log-likelihood

...would be easy to maximize, if we had

Training data

Maximum likelihood estimate of

where


EM Algorithm for Maximum Likelihood Estimation

E-step:

Current estimate of the probabilitythat was produced by component i

M-step:

because


Model Selection for Mixtures

Important issue: how to select k (number of components)?

[Figueiredo & Jain, 2002] MML-based approach;roughly, an MDL/BIC with careful definition of the amount of data from which each parameter is estimated.

Resulting criterion leads to a simple modification of EM

Original M-step expression for

This update “kills” weak components

New expression:

where Number of parametersof each component.


Feature Selection for Mixtures

Simplifying assumption: in each component, the featuresare independent.

parameters of the density of the l-thfeature in component i

Let some features have a common density in all components. These features are “irrelevant”.

if feature l is relevant if feature l is irrelevant

Common (w.r.t. i) densities of the irrelevant features


The likelihood

To apply EM, we see as missing data and define

we call it “feature saliency”

Feature Saliency

Can be shown that the resulting likelihood (marginal w.r.t. ) is


Applying EM

We address

by EM, using and as missing data.

In addition to the variables defined above, the E-step now also involves the following variables:

both easily computed in closed form.


Applying EM

M-step: (ommitting the variances)

Assuming that and are both univariate Gaussian with arbitrary mean and variance.

mean in

mean in


Model Selection

To perform feature selection, we want to encourage someof the saliencies to become either 0 or 1.

This can be achieved with the same MML-type criterionused above to select k

The modified M-step is:

where: number of parameters in

number of parameters in


800 samples, projected on thefirst two dimensions

Mixture of 4 Gaussians (with identity covariance) with d = 10,

0

0

3

0

1

μ

0

0

9

1

2

μ

0

0

4

6

3

μ

0

0

10

7

4

μ

2 relevant features

8 irrelevant features

Synthetic Example


final

initial

Common density


Feature saliency values (mean 1 s.d.) over 10 runs

relevant features

irrelevant features

Synthetic Example


- Several standard benchmark data sets:

Name n d k

Wine 178 13 3

Wisconsin breast cancer 569 30 2

Image segmentation 2320 18 7

Texture classification 4000 19 4

- These are standard data-sets for supervised classification.

- We fit mixtures, ignoring the labels.

- We classify the points and compare to the labels.

Real Data


Real Data: Results

Name % error (sd) % error (sd)

Wine 6.61 (3.91) 8.06 (3.73)

Wisconsin breast cancer 9.55 (1.99) 10.09 (2.70)

Image segmentation 20.19 (1.54) 32.84 (5.10)

Texture classification 4.04 (0.76) 4.85 (0.98)

with FS without FS

- For these data-sets, our approach is able to improve the performance of mixture-based unsupervised classification.


Research Directions

- More efficient algorithms for logistic regression with LASSO prior.

- Investigating the performance of generalized Gaussian priors with exponents other than 1 (LASSO) or 2 (Ridge)

- Deriving performce bounds for this type of approach

In supervised learning:

In unsupervised learning:

- More efficien algorithms

- Removing the conditional independence assumption

- Extension to other mixtures (e.g., multinomial for categorical data).

Documents

Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical