70
Feature Selection for pervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical University of Lisbon PORTUGAL herein reported was done in collaboration with: . Carin, B. Krishnapuram, and A. Hartemink, Duke University; . K. Jain and M. Law, Michigan State University. [email protected] www.lx.it.pt/~mtf

Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

Embed Size (px)

Citation preview

Page 1: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

Feature Selection for Supervised and Unsupervised Learning

Mário A. T. Figueiredo

Institute of Telecommunications, andInstituto Superior Técnico

Technical University of LisbonPORTUGAL

Work herein reported was done in collaboration with:

L. Carin, B. Krishnapuram, and A. Hartemink, Duke University;A. K. Jain and M. Law, Michigan State University.

[email protected] www.lx.it.pt/~mtf

Page 2: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Outline

1. Introduction2. Review LASSO Regression3. The LASSO Penalty for Multinomial Logistic Regression4. Bound Optimization Algorithms (Parallel and Sequential)5. Non-linear Feature Weighting/Selection6. Experimental Results

Part I Supervised Learning

Part III Unsupervised Learning1. Introduction2. Review of Model-Based Clustering with Finite Mixtures3. Feature Saliency4. An EM Algorithm to Estimate Feature Saliency5. Model Selection6. Experimental Results

Part II Performance Bounds

Page 3: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Supervised Learning

Goal: to learn a functional dependency...

...from a set of examples (the training data):

- Discriminative (non-generative) approach: no attempt to model the joint density

set of parameters

Page 4: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Complexity Control Via Bayes Point Estimation

Bayesian (point estimation) approach:

Prior controls the “complexity” of

Good generalization requires complexity control.

Likelihood function

Maximum a posteriori (MAP) point estimate of

Prediction for a new “input” :

Page 5: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Bayes Point Estimate Versus Fully Bayesian Approach

Point prediction for a new “input” :

Point estimate

We will not consider fully Bayesian approaches here.

Fully Bayesian prediction

Page 6: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Linear (w.r.t. ) Regression

e.g., radial basis functions, splines, wavelets, polynomials,...

We consider functions which are linear w.r.t.

where is some dictionary of functions;

Notable particular cases: linear regression:

kernel regression:as in SVM, RVM, etc...

Page 7: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Likelihood Function for Regression

Likelihood function, for Gaussian observation model

where Assuming that y and thecolumns of H are centered, we drop w.l.o.g.

Maximum likelihood / ordinary least squares estimate

…undetermined if

…the design matrix.

Page 8: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

called ridge regression, or weight decay (in neural nets parlance).

With a Gaussian prior

Bayesian Point Estimates: Ridge and the LASSO

LASSO regression [Tibshirani, 1996], prunning priors for NNs [Williams, 1995],basis pursuit [Chen, Donoho, Saunders, 1995].

With a Laplacian prior

promotes sparseness of i.e., its components are either significantly large, or zero.

feature selection

Page 9: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Algorithms to Compute the LASSO

Special purpose algorithms[Tibshirani, 1996], [Fu, 1998], [Osborne, Presnell, Turlach, 2000],

For orthogonal H, closed-formsolution: “soft threshold”

More insight on the LASSO,see [Tibshirani, 1996].

[Efron, Hastie, Johnstone, & Tibshirani, 2002]Least angle regression (LAR); currently the best approach.

Page 10: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

EM Algorithm for the LASSO

LASSO can be computed by EM using a hierarchical formulation:

are independent,

This possibility mentioned in [Tibshirani, 1996]; not very efficient, but…

Treat the as missing data and apply standard EM. This leads to:

which can be called an iteratively reweighted ridge regression (IRRR)

Page 11: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

About

The previous derivation opens the door to other priors.

For example, a Jeffreys prior:

The EM algorithm becomes (still IRRR-type):

Interestingly, similar to the FOCUSS algorithm for regressionwith an penalty [Kreutz-Delgado & Rao, 1998]. Strong sparseness!

Problem: non-convex objective, results depend on initialization. Possibility: initialize with OLS estimate.

Page 12: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Some Results

Same vectors as in [Tibshirani, 1996]:

Design matrices and experimental procedure as in [Tibshirani, 1996]

Model error (ME) improvement w.r.t. OLS estimate:

Close to best in each case, without any cross-validation;more results in [Figueiredo, NIPS’2001] and [Figueiredo, PAMI’2003].

Page 13: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Classification via Logistic Regression

Binary classification:

Multi-class, with “1 of m” encoding:

Class

Recall thatmay denote the components of , or other (nonlinear)functions of ,such as kernels.

Page 14: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Classification via Logistic Regression

Since , we can set w.l.o.g.

Parameters to estimate:

Maximum log-likelihood estimate:

If is separable, is unbounded, thus undefined.

Page 15: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Penalized (point Bayes) Logistic Regression

Penalized (or point Bayes MAP) estimate

where

Gaussian prior Penalized log. reg.

Laplacian prior(LASSO prior)

favors sparseness,feature selection.

For linear regression it does,what about for logistic regression?

Page 16: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Laplacian Prior for Logistic Regression

Simple test with 2 training points:

class 1

class -1 w/ Laplacian prior

class 1

class -1 As decreases, becomes less relevant

Linear logistic regression

w/ Gaussian prior

Page 17: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Algorithms for Logistic Regression

where

Standard algorithm: Newton-Raphson, a.k.a. iteratively reweighted least squares (IRLS)

IRLS is easily applied without any prior or with Gaussian prior.

IRLS not applicable with Laplacian prior: is not differentiable.

Alternative: bound optimization algorithms

For ML logistic regression [Böhning & Lindsay, 1988], [Böhning,1992].

More general formulations [de Leeuw & Michailides, 1993], [Lange, Hunter, & Yang, 2000].

Page 18: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Bound Optimization Algorithms (BOA)

Sufficient (in fact more than sufficient)to prove monotonicity:

Notes: should be easy to maximize

EM is a BOA

Optimization problem:

where is such that

....with equality if and only if

Bound optimization algorithm:

Page 19: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Deriving Bound Functions

Many ways to obtain bound functions

For example, well known that Jensen’s inequality underlies EM

Via Hessian bound: suppose is concave, with Hessian bounded below,

where is a positive definite matrix.

Can use r.h.s. as with

Page 20: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Quasi-Newton Monotonic Algorithm

Update equation

is simple to solve, leads to

This is a quasi-Newton algorithm, with B replacing the Hessian.

Unlike the Newton algorithm, it is monotonic.

Page 21: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Aplication to ML Logistic Regression

For logistic regression, can be shown that [Böhning,1992]

Kroneker product

Also easy to compute the gradient and finally plug into

Under a ridge-type Gaussian prior

can be computed off-line.

Page 22: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Aplication to LASSO Logistic Regression

For LASSO logistic regression

already bounded via Hessian bound...need bound for log prior

quadratic bound

Easy to show that, for any

...with equality iff

Page 23: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Aplication to LASSO Logistic Regression

After dropping additive terms,

where

The update equation is an IRRR

which can be rewritten as

where

Page 24: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Aplication to LASSO Logistic Regression

The update equation

has computational cost

May not be OK for kernel classification for large

OK for linear classification if not too large.

This is the cost of standard IRLS for ML logistic regression ...but now with a Laplacian prior.

Page 25: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Sequential Update Algorithm for LASSO Logistic Regression

Recall that

Let’s bound only via the Hessian bound, leaving

Maximizing only w.r.t. the –th component of

for

Page 26: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Sequential Update Algorithm for LASSO Logistic Regression

The update equation

Can be shown that updatingall components has cost

may be much less than

Usually also uses fewer iterations,since we do not bound the priorand is “incremental”

has a simple closed-form expression

Page 27: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Sequential Update Algorithm for Ridge Logistic Regression

With a Gaussian prior, the update equation

also has a simple closed-form expression v

For , we get the update rule for ML logistic regression.

Important issue: how to choose the order of update, i.e., which component to update next?

In all the results presented, we use a simple cyclic schedule.

We’re currently investigating other alternatives (good, but cheap).

Page 28: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Related Work

Sequential update for the relevance vector machine (RVM)[Tipping & Faul, 2003]Comments: the objective function of the RVM is not concave, results may depend critically on initialization and order of update.

Kernel logistic regression; the import vector machine (IVM)[Zhu & Hastie, 2001]Comments: sparseness not encouraged in the objective function (Gaussianprior) but by early stopping a greedy algorithm.

Efficient algorithm for SVM with penalty [Zhu, Rosset, Hastie, & Tibshirani, 2003]Comments: efficient, though not simple, algorithm; the SVM objective is different from the logistic regression objective.

Least angle regression (LAR) [Efron, Hastie, Johnstone, & Tibshirani, 2002]Comments: as far as we know, not yet/easily applied to logistic regression.

Page 29: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Experimental Results

Three standard benchmark datasets: Crabs, Iris, and Forensic Glass.Three well-known gene expression datasets: AML/ALL, Colon, and Yeast.

Penalty weight adjusted by cross-validation.

Comparison with state-of-the-art classifiers: RVM and SVM (SVMlight)

Summary of datasets:

Not exactly CV, but 30different 50/12 splits

No CV, fixed standard split

Page 30: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Experimental Results

All kernel classifiers (RBF Gaussian and linear). For RBF, width tuned by CV with SVM, and used for all other methods.

Linear classification of AML/ALL (no kernel): 1 error, 81 (of 7129) features (genes) selected.

[Krishnapuram, Carin, Hartemink, and Figueiredo, 2004 (submitted)]

Number of errors

Number of kernels

Results:

BMSLR Bayesian multinomial sparse logistic regressionBMGLR Bayesian multinomial Gaussian logistic regression

Page 31: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Non-linear Feature Selection

We have considered a fixed dictionary of functions

i.e., feature selection is done on parameters appearing linearly

or “generalized linearly”

Let us now look at “non-linear parameters”, i.e., inside the dictionary:

Page 32: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Non-linear Feature Selection

For logistic regression,

We need to further constraint the problem....consider parameterizations of the type:

Polynomial

Gaussian

For kernels:

Page 33: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

This corresponds to a different scaling of each original feature.

Sparseness of feature selection.

We can also adopt Laplacian prior for

Feature Scaling/Selection

Estimation criterion:

Logistic log-likelihood

Page 34: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

EM Algorithm for Feature Scaling/Selection

Optimization problem:

We use again a bound optimization algorithm (BOA),

with Hessian bound for and the quadratic bound for

Maximizing can’t be done in closed-form.

Easy to maximize w.r.t. , with fixed.

Maximization w.r.t. done by conjugate gradient;necessary gradients are easy to derive.

JCFO – Joint classifier and feature optimization [Krishnapuram, Carin, Hartemink, and Figueiredo, IEEE-TPAMI, 2004].

Page 35: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Experimental Results on Gene Expression Data (Full LOOCV)

Method AML/ALL Colon

Boosting 95.8 72.6

SVM (linear kernel) 94.4 77.4

SVM (quadratic kernel) 95.8 74.2

RVM (no kernel) 97.2 88.7

Logistic (no kernel) 97.2 71.0

Sparse probit (quadr. kernel) 95.8 84.6

Sparse probit (linear kernel) 97.2 91.9

JCFO (quadratic kernel) 98.6 88.7

JCFO (linear kernel) 100 96.8

Accuracy (%)

[Ben-Dor et al, 2000]

[Krishnapuram et al, 2002]

Tipically around25~30 genes selected,i.e., non-zero

Page 36: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Top 12 Genes for AML/ALL (sorted by mean |i| )

*

*

*

*

* Agree with [Golub et al, 1999]; many others in the top 25.Antibodies to MPO are used in clinical diagnosis of AML

Page 37: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Top 12 Genes for Colon (sorted by mean qi )

*

***

* Known to be implicated in colon cancer.

Page 38: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

PART II

Non-trivial Bounds for Sparse Classifiers

Page 39: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Introduction

Training data (here, we consider only binary problems):

assumed to be i.i.d. from an underlying distribution

Key question:how are the tworelated?

Given a classifier

True generalization error (not computable, is unknown):

Sample error:

Page 40: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

PAC Performance Bounds

PAC (probably approximately correct) bounds are of the form:

and hold independently of

Usually, bounds have the form:

uniformly over

Page 41: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

PAC Performance Bounds

There are several ways to derive

- Vapnik-Chervonenkis (VC) theory (see, e.g., [Vapnik, 1998])

VC usually leads to trivial bounds (>1, unless n is huge).

- Compression arguments [Graepel, Herbrich, & Shawe-Taylor, 2000]

Compression bounds are not applicable to point sparse classifiersof the type herein presented, or of the RVM type [Herbrich, 2000].

We apply PAC-Bayesian bounds [McAllester, 1999], [Seeger, 2002].

Page 42: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Some Definitions

: some point “estimate” of . Let

a Laplacian centered at

we’ll call this the “posterior”, although not in the usual sense.

Point classifier (PC) at the one we’reinterested in

Gibbs classifier (GC) at

a sample from

Bayes voting classifier (BVC) at

Page 43: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Key Lemmas

Lemma 1: for any , the decision of the PC with is the sameas that of a BVC based on any symmetric posteriorcentered on

Proof: a simple pairing argument (see, e.g., [Herbrich, 2002]).

Lemma 2: for any “posterior” , the generalization error of the BVC is less than twice that of the GC.

Proof: see [Herbrich, 2002].

Conclusion: we can use PAC-Bayesian bounds for GC for our PC.

Page 44: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

PAC-Bayesian Theorem

Let be our prior (meaning it is independent of )

Let be our posterior (meaning it may depend on )

Generalization error for a Gibbs classifier:

Expected sample/empirical error for a Gibbs classifier:

McAllester’s PAC-Bayesian theorem relates these two errors.

Page 45: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

PAC-Bayesian Theorem

Theorem: with as defined above, the following inequality holds with probability at least over random training samples of size

where is the Kullback-Leibler div. between two Bernoullies, and

is the Kullback-Leibler div. between posterior and prior.

Page 46: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Tightening the PAC-Bayesian Theorem

With our Laplacian prior and posterior, we have:

and show that

Due to the convexity of the KLD, it is easy to (numerically) find

1 2 3 4

0.5

1

1.5

2

2.5

Since we can choose freely:

Page 47: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Using the PAC-Bayesian Theorem

Set a prior parameter and choose a confidence level

Find such that

Using this prior and , find a point estimate

From these, we know that with probability at least

and evaluate the corresponding by Monte Carlo.

With this , define the “posterior” as above,

Page 48: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

To obtain an explicit bound on

Using the PAC-Bayesian Theorem

can easily be found numerically.

is always non-trivial, i.e.,

Page 49: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Using the PAC-Bayesian Theorem

Finally, notice that the PAC-Bayesian bound applies tothe Gibbs classifier, but recall Lemma 2.

Lemma 2: for any “posterior” , the generalization error of the BVC is less than twice that of the GC.

In practice, we have observed that the BVC usually generalizesas well as (often much better than) the GC. We believe the factor2 can be reduced, but we have not yet been able to show it.

Page 50: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Example of PAC-Bayesian Bound

“Mines” dataset

Maybe tight enoughto guide selection of

Page 51: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Conclusions for Part II

-PAC-Bayesian bound for sparse classifier.

-The bound (unlike VC bounds) is always non-trivial.

-Tightness still requires large sample sizes.

-Future goals: tightening the bounds, ..as always.

Page 52: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

PART III

Feature Selection in Unsupervised Learning

Page 53: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

FS is a widely studied problem in supervised learning.

FS is a not widely studied in unsupervised learning (clustering).

A good reason: in the absence of labels, how do you assess the usefulness of each feature?

Feature Selection in Unsupervised Learning

Approach: how relevant is each feature (component of x) for the mixture nature of the data?

We address this problem in the context of model-based clustering using finite mixtures:

Page 54: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Example:

x2 is irrelevant for the mixture nature of this data.

x1 is relevant for the mixture nature of this data.

Any PCA-type analysis ofthis data would not be useful here.

Example of Relevant and Irrelevant Features

Page 55: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Interplay Between Number of Clusters and Features

Example:

Using only x1, we find 2 components.

Using x1 and x2, we find 7 components (needed to fit the non-Gaussian density of x2)

Marginals

Page 56: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Most classical FS methods for supervised learningrequire combinatorial searches.

For d features, there are 2d possible feature subsets.

Alternative: assign real valued feature weights and encourage sparseness (like seen above for supervised learning).

Approaches to Feature Selection

[Law, Figueiredo and Jain, TPAMI, 2004 (to appear)]

Page 57: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Maximum Likelihood Estimation and Missing Data

Missing data

One-of-k encoding: component i

“Complete” log-likelihood

...would be easy to maximize, if we had

Training data

Maximum likelihood estimate of

where

Page 58: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

EM Algorithm for Maximum Likelihood Estimation

E-step:

Current estimate of the probabilitythat was produced by component i

M-step:

because

Page 59: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Model Selection for Mixtures

Important issue: how to select k (number of components)?

[Figueiredo & Jain, 2002] MML-based approach;roughly, an MDL/BIC with careful definition of the amount of data from which each parameter is estimated.

Resulting criterion leads to a simple modification of EM

Original M-step expression for

This update “kills” weak components

New expression:

where Number of parametersof each component.

Page 60: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Feature Selection for Mixtures

Simplifying assumption: in each component, the featuresare independent.

parameters of the density of the l-thfeature in component i

Let some features have a common density in all components. These features are “irrelevant”.

if feature l is relevant if feature l is irrelevant

Common (w.r.t. i) densities of the irrelevant features

Page 61: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

The likelihood

To apply EM, we see as missing data and define

we call it “feature saliency”

Feature Saliency

Can be shown that the resulting likelihood (marginal w.r.t. ) is

Page 62: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Applying EM

We address

by EM, using and as missing data.

In addition to the variables defined above, the E-step now also involves the following variables:

both easily computed in closed form.

Page 63: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Applying EM

M-step: (ommitting the variances)

Assuming that and are both univariate Gaussian with arbitrary mean and variance.

mean in

mean in

Page 64: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Model Selection

To perform feature selection, we want to encourage someof the saliencies to become either 0 or 1.

This can be achieved with the same MML-type criterionused above to select k

The modified M-step is:

where: number of parameters in

number of parameters in

Page 65: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

800 samples, projected on thefirst two dimensions

Mixture of 4 Gaussians (with identity covariance) with d = 10,

0

0

3

0

1

μ

0

0

9

1

2

μ

0

0

4

6

3

μ

0

0

10

7

4

μ

2 relevant features

8 irrelevant features

Synthetic Example

Page 66: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

final

initial

Common density

Page 67: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Feature saliency values (mean 1 s.d.) over 10 runs

relevant features

irrelevant features

Synthetic Example

Page 68: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

- Several standard benchmark data sets:

Name n d k

Wine 178 13 3

Wisconsin breast cancer 569 30 2

Image segmentation 2320 18 7

Texture classification 4000 19 4

- These are standard data-sets for supervised classification.

- We fit mixtures, ignoring the labels.

- We classify the points and compare to the labels.

Real Data

Page 69: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Real Data: Results

Name % error (sd) % error (sd)

Wine 6.61 (3.91) 8.06 (3.73)

Wisconsin breast cancer 9.55 (1.99) 10.09 (2.70)

Image segmentation 20.19 (1.54) 32.84 (5.10)

Texture classification 4.04 (0.76) 4.85 (0.98)

with FS without FS

- For these data-sets, our approach is able to improve the performance of mixture-based unsupervised classification.

Page 70: Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical

UFL, January 2004 M. Figueiredo, IST

Research Directions

- More efficient algorithms for logistic regression with LASSO prior.

- Investigating the performance of generalized Gaussian priors with exponents other than 1 (LASSO) or 2 (Ridge)

- Deriving performce bounds for this type of approach

In supervised learning:

In unsupervised learning:

- More efficien algorithms

- Removing the conditional independence assumption

- Extension to other mixtures (e.g., multinomial for categorical data).