IPK GaterslebenPattern Recognition Group
Correlation-based Data Processing
and its Application to Biology
Marc Strickert
Osnabrück, 14. Januar 2005
Pattern Recognition Group
Schloss Dagstuhl
Leibniz Institute of Plant Genetics and Crop Plant Research Gatersleben
IPK GaterslebenPattern Recognition Group
Goals
1. Attribute rating
2. Clustering
3. Classification
4. Visualization
of biological data,
exploiting properties of
Pearson correlation.
IPK GaterslebenPattern Recognition Group
Euclidean distances may be problematic
d1= (x1-y1)2+ … + (x5-y5)21
2 d2= (x1-y1)2+ … + (x5-y5)2
identical despite ofdifferent shapes
[ John Lee and Michel Verleysen ]
IPK GaterslebenPattern Recognition Group
Pearson correlation invariant to scaling and shifting
amplitudevertical offset
same correlations as above!
same profiles, aligned
raw data
Up-regulated gene profiles
Euclideanview
'Pearson'view
IPK GaterslebenPattern Recognition Group
Derivatives of squared Euclidean and Pearson correlation
Squared Euclidean:
Pearson correlation:
IPK GaterslebenPattern Recognition Group
Applications for derivative of similarity measure
4. Visualization
(High-Throughput MDS)
2. Clustering
(Neural Gas for Correlation, NG-C)3. Classification
(GRLVQ-C)
1. Attribute rating
(Variance analogon)
IPK GaterslebenPattern Recognition Group
Attribute rating
=
Squared Euclidean distance
Variance as double sum of derivatives
IPK GaterslebenPattern Recognition Group
Correlation Analogon to Euclidean Variance
X
W
IPK GaterslebenPattern Recognition Group
Clustering: Neural Gas (NG revisited)
NG-C:
IPK GaterslebenPattern Recognition Group
High centroid reproducibility with NG-C
NG-C
k-means
23 gene expression centroids, 10 independent runs
Indeterminate final states.
Crisp final states.
IPK GaterslebenPattern Recognition Group
Classification with relevance learning
For example used in
GeneralizedLearningVector Quantizationwith Correlation(GRLVQ-C)
Adaptive Pearson correlation:
IPK GaterslebenPattern Recognition Group
Leukemia cancer data set: AML / ALL separation
GRLVQ-C: Relevance factors top 10 gene ranking.
1 prototype per class + relevance learning.
consistent with Golub et al.
IPK GaterslebenPattern Recognition Group
Visualization of high-dimensional data
High-dimensional data (constant source)
Low-dimensional points (variable target)
AB
C
A' B'
C'3D 2D
d12
d23
d13
d12
d23d13
“embedding”
Gradient-based stochastic optimization HiT-MDS.
!
IPK GaterslebenPattern Recognition Group
Maximize distance correlations: source ≈ reconstruction
original inter-point distance matrix
reconstructed inter-point distance matrix
Adaptive parameters point coordinates
Minimize embedding stress function using negative Fischer's Z':
IPK GaterslebenPattern Recognition Group
Iterative gradient descent for stress function minimization
| derivative of Fischer's Z'
| for Euclidean spaces
IPK GaterslebenPattern Recognition Group
High-Throughput Multi-Dimensional Scaling (HiT-MDS)
Initialize X by random projection (or smarter).
Calculate correlation r(X,X) once.
Draw next Pattern xi.
Minimize stress s to all xj: xik ~ -∂s / ∂xi
k.
recalculate distances dij.
adapt
Hit-MDS Algorithm
, , and r.
Input xi X Embedding xi X
dij dij
r(dij , dij)
s
1
12
2
3
34
4
IPK GaterslebenPattern Recognition Group
Applications of dimension reduction (visualization)
1. Gene space browser.
2. Macro-experiment grouping.
day 0
day 26
1
2
IPK GaterslebenPattern Recognition Group
Embedding 12k Genes (14 time points) in 2D
UI
D
D
I
U
orig spline
FITFITFIT
EUC COR SRC
COR COR
EUCEuclidean distance
CORPearson correlation
SRCSpearman rank cor.
IPK GaterslebenPattern Recognition Group
Gene browser (4824 high-quality genes)
0 2 4 6 8 10 12 14 16 18 20 22 24 26
DAF
…
[ visualization: www.ggobi.org ]
IPK GaterslebenPattern Recognition Group
Gene browser for powers of correlation: (1-r)8
IPK GaterslebenPattern Recognition Group
Gene clustering (k=11), relevant genes in front
IPK GaterslebenPattern Recognition Group
3D-View of 62 macroarrays (4824 genes)
IPK GaterslebenPattern Recognition Group
Data processing challenges in biology
Data Sets from- metabolite measurements (2D-gels, HPLC),- QTL LOD-score pattern compression,- DNA-sequence arrangement.
Missing value imputation ( probabilistic models)
Association studies ( common latent space, CCA)
Rank-based data analysis ( distribution models)
Faithful low-dimensional data representation
Proximity data handling
Common language: R / MATLAB / … ?
IPK GaterslebenPattern Recognition Group
Thanks
http://pgrc-16.ipk-gatersleben.de/~stricker/
http://hitmds.webhop.net/
Pattern recognition group (IPK, headed by Udo Seiffert)
Nese Sreenivasulu (IPK, Molecular Biology)
Barbara Hammer (TU-Clausthal)
Thomas Villmann (University of Leipzig)
IPK GaterslebenPattern Recognition Group
Some References
Strickert, M.; Sreenivasulu N.; Peterek, S.; Weschke W.; Mock, H.-P.; Seiffert, U.Unsupervised Feature Selection for Biomarker Identification in Chromatography and Gene Expression Data. In F. Schwenker and S. Marinai (Eds.), Artificial Neural Networks in Pattern Recognition, LNAI 4087, pp. 274-285, 2006.
Strickert M.; Sreenivasulu N.; Seiffert, U.Sanger-driven MDSLocalize - A Comparative study for Genomic Data. In. M. Verleysen (Ed.), Proc.14th European Symp. Artificial Neural Networks (ESANN 2006), Bruges, Belgium. D-Side publishers Evere/Belgium, pp. 265-270, 2006.
Strickert, M.; Seiffert, U.; Sreenivasulu, N.; Weschke, W.; Villmann, T.; Hammer, B.Generalized Relevance LVQ (GRLVQ) with Correlation Measures for Gene Expression Data.Neurocomputing 69(2006), pp. 651-659, Springer, 2006.
Strickert M.; Sreenivasulu N.; Usadel, B.; Seiffert, U.Correlation-maximizing surrogate gene space for visual mining of gene expression patterns in developing barley endosperm tissue.To appear in BMC Bioinformatics, 2007.
Strickert M.; Sreenivasulu N.; Seiffert, U.Browsing temporally regulated gene expressions in correlation-maximizing space.Accepted presentation at conference on Analysis of Compatibility Pathways (March 4-6, 2007).