IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to...

Preview:

Citation preview

IPK GaterslebenPattern Recognition Group

Correlation-based Data Processing

and its Application to Biology

stricker@ipk-gatersleben.de

Marc Strickert

Osnabrück, 14. Januar 2005

Pattern Recognition Group

Schloss Dagstuhl

Leibniz Institute of Plant Genetics and Crop Plant Research Gatersleben

IPK GaterslebenPattern Recognition Group

Goals

1. Attribute rating

2. Clustering

3. Classification

4. Visualization

of biological data,

exploiting properties of

Pearson correlation.

IPK GaterslebenPattern Recognition Group

Euclidean distances may be problematic

d1= (x1-y1)2+ … + (x5-y5)21

2 d2= (x1-y1)2+ … + (x5-y5)2

identical despite ofdifferent shapes

[ John Lee and Michel Verleysen ]

IPK GaterslebenPattern Recognition Group

Pearson correlation invariant to scaling and shifting

amplitudevertical offset

same correlations as above!

same profiles, aligned

raw data

Up-regulated gene profiles

Euclideanview

'Pearson'view

IPK GaterslebenPattern Recognition Group

Derivatives of squared Euclidean and Pearson correlation

Squared Euclidean:

Pearson correlation:

IPK GaterslebenPattern Recognition Group

Applications for derivative of similarity measure

4. Visualization

(High-Throughput MDS)

2. Clustering

(Neural Gas for Correlation, NG-C)3. Classification

(GRLVQ-C)

1. Attribute rating

(Variance analogon)

IPK GaterslebenPattern Recognition Group

Attribute rating

=

Squared Euclidean distance

Variance as double sum of derivatives

IPK GaterslebenPattern Recognition Group

Correlation Analogon to Euclidean Variance

X

W

IPK GaterslebenPattern Recognition Group

Clustering: Neural Gas (NG revisited)

NG-C:

IPK GaterslebenPattern Recognition Group

High centroid reproducibility with NG-C

NG-C

k-means

23 gene expression centroids, 10 independent runs

Indeterminate final states.

Crisp final states.

IPK GaterslebenPattern Recognition Group

Classification with relevance learning

For example used in

GeneralizedLearningVector Quantizationwith Correlation(GRLVQ-C)

Adaptive Pearson correlation:

IPK GaterslebenPattern Recognition Group

Leukemia cancer data set: AML / ALL separation

GRLVQ-C: Relevance factors top 10 gene ranking.

1 prototype per class + relevance learning.

consistent with Golub et al.

IPK GaterslebenPattern Recognition Group

Visualization of high-dimensional data

High-dimensional data (constant source)

Low-dimensional points (variable target)

AB

C

A' B'

C'3D 2D

d12

d23

d13

d12

d23d13

“embedding”

Gradient-based stochastic optimization HiT-MDS.

!

IPK GaterslebenPattern Recognition Group

Maximize distance correlations: source ≈ reconstruction

original inter-point distance matrix

reconstructed inter-point distance matrix

Adaptive parameters point coordinates

Minimize embedding stress function using negative Fischer's Z':

IPK GaterslebenPattern Recognition Group

Iterative gradient descent for stress function minimization

| derivative of Fischer's Z'

| for Euclidean spaces

IPK GaterslebenPattern Recognition Group

High-Throughput Multi-Dimensional Scaling (HiT-MDS)

Initialize X by random projection (or smarter).

Calculate correlation r(X,X) once.

Draw next Pattern xi.

Minimize stress s to all xj: xik ~ -∂s / ∂xi

k.

recalculate distances dij.

adapt

Hit-MDS Algorithm

, , and r.

Input xi X Embedding xi X

dij dij

r(dij , dij)

s

1

12

2

3

34

4

IPK GaterslebenPattern Recognition Group

Applications of dimension reduction (visualization)

1. Gene space browser.

2. Macro-experiment grouping.

day 0

day 26

1

2

IPK GaterslebenPattern Recognition Group

Embedding 12k Genes (14 time points) in 2D

UI

D

D

I

U

orig spline

FITFITFIT

EUC COR SRC

COR COR

EUCEuclidean distance

CORPearson correlation

SRCSpearman rank cor.

IPK GaterslebenPattern Recognition Group

Gene browser (4824 high-quality genes)

0 2 4 6 8 10 12 14 16 18 20 22 24 26

DAF

[ visualization: www.ggobi.org ]

IPK GaterslebenPattern Recognition Group

Gene browser for powers of correlation: (1-r)8

IPK GaterslebenPattern Recognition Group

Gene clustering (k=11), relevant genes in front

IPK GaterslebenPattern Recognition Group

3D-View of 62 macroarrays (4824 genes)

IPK GaterslebenPattern Recognition Group

Data processing challenges in biology

Data Sets from- metabolite measurements (2D-gels, HPLC),- QTL LOD-score pattern compression,- DNA-sequence arrangement.

Missing value imputation ( probabilistic models)

Association studies ( common latent space, CCA)

Rank-based data analysis ( distribution models)

Faithful low-dimensional data representation

Proximity data handling

Common language: R / MATLAB / … ?

IPK GaterslebenPattern Recognition Group

Thanks

http://pgrc-16.ipk-gatersleben.de/~stricker/

http://hitmds.webhop.net/

Pattern recognition group (IPK, headed by Udo Seiffert)

Nese Sreenivasulu (IPK, Molecular Biology)

Barbara Hammer (TU-Clausthal)

Thomas Villmann (University of Leipzig)

IPK GaterslebenPattern Recognition Group

Some References

Strickert, M.; Sreenivasulu N.; Peterek, S.; Weschke W.; Mock, H.-P.; Seiffert, U.Unsupervised Feature Selection for Biomarker Identification in Chromatography and Gene Expression Data. In F. Schwenker and S. Marinai (Eds.), Artificial Neural Networks in Pattern Recognition, LNAI 4087, pp. 274-285, 2006.

Strickert M.; Sreenivasulu N.; Seiffert, U.Sanger-driven MDSLocalize - A Comparative study for Genomic Data. In. M. Verleysen (Ed.), Proc.14th European Symp. Artificial Neural Networks (ESANN 2006), Bruges, Belgium. D-Side publishers Evere/Belgium, pp. 265-270, 2006.

Strickert, M.; Seiffert, U.; Sreenivasulu, N.; Weschke, W.; Villmann, T.; Hammer, B.Generalized Relevance LVQ (GRLVQ) with Correlation Measures for Gene Expression Data.Neurocomputing 69(2006), pp. 651-659, Springer, 2006.

Strickert M.; Sreenivasulu N.; Usadel, B.; Seiffert, U.Correlation-maximizing surrogate gene space for visual mining of gene expression patterns in developing barley endosperm tissue.To appear in BMC Bioinformatics, 2007.

Strickert M.; Sreenivasulu N.; Seiffert, U.Browsing temporally regulated gene expressions in correlation-maximizing space.Accepted presentation at conference on Analysis of Compatibility Pathways (March 4-6, 2007).

Recommended