16
Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli Department of Computer Science, University of Pisa, Italy

Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

Embed Size (px)

Citation preview

Page 1: Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

Gene selection usingRandom Voronoi Ensembles

Stefano RovettaDepartment of Computer and Information Sciences,University of Genoa, Italy

Francesco masulliDepartment of Computer Science,University of Pisa, Italy

Page 2: Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

The input selection problem

HardGiven d inputs, there are 2d possible subsetsand no guarantee that larger subset perform better/worse than smaller(a.k.a.: no monotonicity)

ClassicA lot of references dating back from aboutmid-seventies

ImportantCurse of dimensionality, Generalization,Cost of measurements, Cost of computation ...

Page 3: Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

A different perspective

Although old, the input selection problem is being actively studied now

From optimizationClassic approach: improve training speed / generalization ability / computational resources requirements...

...to model analysisMainstream approach as of today:find the subset of inputs which account the most for the observed phenomenon

A tool for scientific inquiry, not for system design

Page 4: Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

Gene selection

Bioinformatics is where input selection is a current (hot) topic

DNA microarrays provide bulks of simultaeous data – e.g., gene expression

We have to find out which genes are the most relevant to a given pathology

(Good candidates to be the true cause)

We are interested in a specific approach:assessing the relative importance of eachinput variable (gene)

Page 5: Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

Problem statement

We address:

– Classification problems

– with 2 classes only to simplify the analysis (can be extended to multiclass)

– seeking a saliency ranking

- on a d-dimensional vector space: x d

A single separating function is assumed,denoted by

g(x)

Page 6: Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

Outline of the technique

The proposed technique has three components

1 – a local analysis step with a basic classifier

2 – a resampling procedure to iterate step 1

3 – an integration step

Page 7: Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

Saliency(or importance or sensitivity or...)

Many definitionsIntuitively: some attribute of an input variablewhich measures its influence on the solutionof a given (classification) problem

The derivative of the output w.r.t. each input variable is a natural measure of influence

g(x) = (∂g(x)/∂x1, ... , ∂g(x)/∂x

d)

But...

Page 8: Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

Finite sample effects

The rule is learned from a training set:random variability

Derivatives and local fluctuationsoften it is better to study difference ratios

( f(x+Δ) – f(x) ) / Δ

rather than derivatives

f'(x)

Page 9: Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

Use of linear separatorsIf the decision function is of the form

g(x) = w . x

then derivatives w.r.t. inputs are constantand given directly by the coefficient vector w

SVMs can provide the optimum linear separatorsw.r.t. a given generalization bound

2-norm soft margin optimization:bound on generalization error based on (soft) margin

such linear separators are robust in terms of sample variations

(they depend on support vectors only)

Page 10: Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

Local analysis

The linear separator is applied on a local basisNonlinear g(x) can be studied by local linearization

Voronoi partitioning

A Voronoi tessellation is performedon the training set

Linear analysis is applied within eachVoronoi polyhedron(a localized subset of training samples)

We obtain a saliency ranking directly by t = w/max{wi}

(signs can be discarded and analyzed separately)

Page 11: Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

Drawbacks

Several:mainly border effects andsmall sample sizewithin Voronoi polyhedra

Solution: resampling

The Voronoi tessellation is performed several times

Random Voronoi tessellations are used each time

Page 12: Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

An ensemble method

The procedure can be seen as an ensemble of localized linear classifiers

The necessary classifier diversity is provided by random Voronoi tessellations

What we need next:Integration of local analyses

Page 13: Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

Integrating by clustering

For each Voronoi polyhedron of each resampling step, we obtain a pair of d-dimensional vectors (or a 2d-dimensional combined vector)

vi = ( t

i , y

i )

where:ti the saliency ranking

yi the Voronoi centroid (site)

To integrate the local analyseswe perform a c-means type clustering on vectors v

i

Page 14: Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

Some details on the clustering step

- The clustering technique is the Graded Possibilistic c-Means algorithm

- The dimensionality problem is easily tackled by working only within the subspace spanned by the training set

- Clusters are obtained by merging (averaging) sets of vectors v

i which are close either by

their y (location) or by their t (saliency pattern) components

- The number of clusters is currently to be prespecified (as in standard c-means) It is independent on the number of voronoi sites used

Page 15: Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

Results“Leukemia” data set by Golub et al.

Page 16: Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli

Discussion and future workThe results indicate that some of the genes indicated by the original work by Golub et al. are found to be important also by our approach.

Extensive validation (by the help of domain experts or biologists) must be done

The direction (sign) of saliency has been found to be always in agreement with statistical correlation as indicated by the original work.

Further experiments:a new data set (still unpublished) is currentlybeing investigated

An interesting tweak:replacing the general c-means-type clusteringwith a technique specifically tailored on rank data