Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion...

Preview:

Citation preview

Classification for High DimensClassification for High Dimensional Problems Using Bayesiaional Problems Using Bayesian Neural Networks and Dirichln Neural Networks and Dirichlet Diffusion Treeset Diffusion Trees

Radford M. Neal and Jianguo Zhang Radford M. Neal and Jianguo Zhang the winners of NIPS2003 feature selection challengethe winners of NIPS2003 feature selection challenge

University of TorontoUniversity of Toronto

The resultsThe results

•Combination of Bayesian neural networks and classification based on Bayesian clustering with a Dirichlet diffusion tree model. •A Dirichlet diffusion tree method is used for Arcene. •Bayesian neural networks (as in BayesNN-large) are used for Gisette, Dexter, and Dorothea. •For Madelon, the class probabilities from a Bayesian neural network and from a Dirichlet diffusion tree method are averaged, then thresholded to produce predictions.

Their General ApproachTheir General Approach

Use simple techniques to reduce Use simple techniques to reduce the computational difficulty of the the computational difficulty of the problem, then apply more problem, then apply more sophisticated Bayesian methods.sophisticated Bayesian methods.– The simple techniques: PCA and The simple techniques: PCA and

feature selection by significance tests.feature selection by significance tests.– Bayesian neural networks.Bayesian neural networks.– Automatic Relevance Determination.Automatic Relevance Determination.

(I) First level feature (I) First level feature reductionreduction

Feature selection using significance tests (first level) An initial feature subset was found by sim

ple univariate significance tests. (correlation coefficient, symmetrical uncertainty )

Assumption: Relevant variables will be at least somewhat relevant on their own.

For all tests, a p-value was found by comparing to the distribution found when permuting the class labels.

Dimensionality reduction with PCA (an alternative for FS) There are probably better

dimensionality reduction methods than PCA, but that’s what we used. One reason is that it’s feasible even when p is huge, provided n is not too large - time required is of order min(pn2, np2).

PCA was done using all the data (training, validation, and test).

(II) Building learning model (II) Building learning model & Second level feature & Second level feature SelectionSelection

Bayesian Neural Networks

Conventional neural network learning

Bayesian Neural Network Learning Based on the statistic Based on the statistic

interpretation of the conventional interpretation of the conventional neural network learningneural network learning

Bayesian Neural Network Learning Bayesian predictions are found by integration rather

than maximization. For a test case x, y is predicted:

Conventional neural network only consider Conventional neural network only consider parameters with maximum posteriorparameters with maximum posterior

Bayesian Neural Network consider all possible Bayesian Neural Network consider all possible parameters in the parameter space.parameters in the parameter space.

Can be implemented by Gaussian Can be implemented by Gaussian approximation and MCMCapproximation and MCMC

ARD Prior

Still remember the decay?

How? (by optimize the decay parameter)– Associate weights from each input with a decay

parameter– There are theories for optimizing the decays.

Result.If an input feature x is irrelevant, its relevance hyper-parameter β=1/a will tend to be small, forcing the relevant weight from that input to be near zero.

Some Strong Points of Some Strong Points of This AlgorithmThis Algorithm Bayesian learning integrates over the post

erior distribution for the network parameters, rather than picking a single “optimal” set of parameters. This farther helps to avoid overfitting.

ARD can be used to adjust the relevance of input features

We can using prior to incorporate external knowledge

Dirichlet Diffusion Trees An Bayesian hierarchical clustering

method

The methodsThe methods

BayesNN-smallfeatures selected using significance tests.

BayesNN-largeprinciple components

BayesNN-DFT-combothe class probabilities from a Bayesian neural network and from a Dirichlet diffusion tree method are averaged, then thresholded to produce predictions.

About the datasetsAbout the datasets

The resultsThe results

•http://www.nipsfsc.ecs.soton.ac.uk/

Thanks.Thanks.

Any Question?Any Question?

Recommended