Download ppt - IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology [email protected] Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Correlation-based Data Processing

and its Application to Biology

[email protected]

Marc Strickert

Osnabrück, 14. Januar 2005

Pattern Recognition Group

Schloss Dagstuhl

Leibniz Institute of Plant Genetics and Crop Plant Research Gatersleben


Goals

1. Attribute rating

2. Clustering

3. Classification

4. Visualization

of biological data,

exploiting properties of

Pearson correlation.


Euclidean distances may be problematic

d1= (x1-y1)2+ … + (x5-y5)21

2 d2= (x1-y1)2+ … + (x5-y5)2

identical despite ofdifferent shapes

[ John Lee and Michel Verleysen ]


Pearson correlation invariant to scaling and shifting

amplitudevertical offset

same correlations as above!

same profiles, aligned

raw data

Up-regulated gene profiles

Euclideanview

'Pearson'view


Derivatives of squared Euclidean and Pearson correlation

Squared Euclidean:

Pearson correlation:


Applications for derivative of similarity measure

4. Visualization

(High-Throughput MDS)

2. Clustering

(Neural Gas for Correlation, NG-C)3. Classification

(GRLVQ-C)

1. Attribute rating

(Variance analogon)


Attribute rating

=

Squared Euclidean distance

Variance as double sum of derivatives


Correlation Analogon to Euclidean Variance

X

W


Clustering: Neural Gas (NG revisited)

NG-C:


High centroid reproducibility with NG-C

NG-C

k-means

23 gene expression centroids, 10 independent runs

Indeterminate final states.

Crisp final states.


Classification with relevance learning

For example used in

GeneralizedLearningVector Quantizationwith Correlation(GRLVQ-C)

Adaptive Pearson correlation:


Leukemia cancer data set: AML / ALL separation

GRLVQ-C: Relevance factors top 10 gene ranking.

1 prototype per class + relevance learning.

consistent with Golub et al.


Visualization of high-dimensional data

High-dimensional data (constant source)

Low-dimensional points (variable target)

AB

C

A' B'

C'3D 2D

d12

d23

d13

d12

d23d13

“embedding”

Gradient-based stochastic optimization HiT-MDS.

!


Maximize distance correlations: source ≈ reconstruction

original inter-point distance matrix

reconstructed inter-point distance matrix

Adaptive parameters point coordinates

Minimize embedding stress function using negative Fischer's Z':


Iterative gradient descent for stress function minimization

| derivative of Fischer's Z'

| for Euclidean spaces


High-Throughput Multi-Dimensional Scaling (HiT-MDS)

Initialize X by random projection (or smarter).

Calculate correlation r(X,X) once.

Draw next Pattern xi.

Minimize stress s to all xj: xik ~ -∂s / ∂xi

k.

recalculate distances dij.

adapt

Hit-MDS Algorithm

, , and r.

Input xi X Embedding xi X

dij dij

r(dij , dij)

s

1

12

2

3

34

4


Applications of dimension reduction (visualization)

1. Gene space browser.

2. Macro-experiment grouping.

day 0

day 26

1

2


Embedding 12k Genes (14 time points) in 2D

UI

D

D

I

U

orig spline

FITFITFIT

EUC COR SRC

COR COR

EUCEuclidean distance

CORPearson correlation

SRCSpearman rank cor.


Gene browser (4824 high-quality genes)

0 2 4 6 8 10 12 14 16 18 20 22 24 26

DAF

…

[ visualization: www.ggobi.org ]


Gene browser for powers of correlation: (1-r)8


Gene clustering (k=11), relevant genes in front


3D-View of 62 macroarrays (4824 genes)


Data processing challenges in biology

Data Sets from- metabolite measurements (2D-gels, HPLC),- QTL LOD-score pattern compression,- DNA-sequence arrangement.

Missing value imputation ( probabilistic models)

Association studies ( common latent space, CCA)

Rank-based data analysis ( distribution models)

Faithful low-dimensional data representation

Proximity data handling

Common language: R / MATLAB / … ?


Thanks

http://pgrc-16.ipk-gatersleben.de/~stricker/

http://hitmds.webhop.net/

Pattern recognition group (IPK, headed by Udo Seiffert)

Nese Sreenivasulu (IPK, Molecular Biology)

Barbara Hammer (TU-Clausthal)

Thomas Villmann (University of Leipzig)


Some References

Strickert, M.; Sreenivasulu N.; Peterek, S.; Weschke W.; Mock, H.-P.; Seiffert, U.Unsupervised Feature Selection for Biomarker Identification in Chromatography and Gene Expression Data. In F. Schwenker and S. Marinai (Eds.), Artificial Neural Networks in Pattern Recognition, LNAI 4087, pp. 274-285, 2006.

Strickert M.; Sreenivasulu N.; Seiffert, U.Sanger-driven MDSLocalize - A Comparative study for Genomic Data. In. M. Verleysen (Ed.), Proc.14th European Symp. Artificial Neural Networks (ESANN 2006), Bruges, Belgium. D-Side publishers Evere/Belgium, pp. 265-270, 2006.

Strickert, M.; Seiffert, U.; Sreenivasulu, N.; Weschke, W.; Villmann, T.; Hammer, B.Generalized Relevance LVQ (GRLVQ) with Correlation Measures for Gene Expression Data.Neurocomputing 69(2006), pp. 651-659, Springer, 2006.

Strickert M.; Sreenivasulu N.; Usadel, B.; Seiffert, U.Correlation-maximizing surrogate gene space for visual mining of gene expression patterns in developing barley endosperm tissue.To appear in BMC Bioinformatics, 2007.

Strickert M.; Sreenivasulu N.; Seiffert, U.Browsing temporally regulated gene expressions in correlation-maximizing space.Accepted presentation at conference on Analysis of Compatibility Pathways (March 4-6, 2007).