1
CS/CBB 545 - Data MiningSpectral Methods (PCA,SVD)
#2 - Application
Mark Gerstein, Yale Universitygersteinlab.org/courses/545
(class 2007,03.08 14:30-15:45)
2
Intuition on interpretation of SVD in terms of genes
and conditions
3
SVD for microarray data(Alter et al, PNAS 2000)
4
5
Notation• m=1000 genes
– row-vectors
– 10 eigengene (vi) of dimension 10 conditions
• n=10 conditions (assays)– column vectors
– 10 eigenconditions (ui) of dimension 1000 genes
6
Understanding Eigengenes (vi) in terms PCA on (large) gene-gene correlation matrix
7
Understanding Eigenconditions (ui) in terms of PCA on (small) condition-condition correlation matrix
Bra - ket notation
8
Plotting Experiments in Low Dimension Subspace
9
Close up on Eigengenes
10Copyright ©2000 by the National Academy of Sciences
Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106
Genes sorted by correlation with top 2 eigengenes
11Copyright ©2000 by the National Academy of Sciences
Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106
Same thing different experiment: Genes sorted by relative correlation with first two eigengenes for alpha-factor experiment
12Copyright ©2000 by the National Academy of Sciences
Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106
Normalized elutriation
expression in the subspace
associated with the cell cycle
13
Biplot Applied to Genes and Conditions
See grouping of arrays and genes on same plot
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
14
Spectral Biclustering
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
15
Biclustering to associate particular genes with certain phenotypes
Conditions
Reo
rder
ed G
enes
(Sor
ted
acco
rdin
g to
a cl
assi
ficat
ion
vect
or)
?
Matrix of raw data
Gen
es
Reordered Conditions(Sorted according to
a classification vector)
Shuffled Matrix(containing checkerboard
“biclusters” of conditions with marker genes)
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
16
Pomeroy et. al. , Nature 415 (2002) 436Prediction of central nervous system embryonal tumor outcome based on gene expression
5 types of brain tumors
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
17
Intuition on Identification of Blocky Matrices
2
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
18
6 6 6 6
6 6
4 4 4 4
4 4 4 4
5 5 5
5 5 56 6
6 6 6
8 8 8 8
8 8 8 8
8 8 8
5
7
5
3 3
54 4 4 4
7 7 7
7 7 7 7
7
3
3 3 3
37
6
8 37 37
E
E
E
a
a
aD
D
D
c
c
c
b
b
b
b
a
Ax yGene partition vector
tumor 1 tumor 2
Gen
e cl
uste
r 1
Gen
e cl
uste
r 2
tumor 3
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
19
5
4 4 4
4 4 4
4
7 7 7 '
7 7 7 '
7 7 7 '
7
3 3 3
6 6 6
6 6 6
6 6 6
6 6 6
'
3 3 3
5 5
5 5 '
3 3 3
4 4
4 4 4
5
5 5 5
8
7
8 8 '
8 8 8 '
8 8 8 '
8 8 8 '
'
'
7
D
D
D
c
c
E
E
E
b
b
a
c
a
a
b
a
b
'TA y x
Tissue partition vector
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
20
6 6 6 6
6 6
4 4 4 4
4 4 4 4
5 5 5
5 5 56 6
6 6 6
8 8 8 8
8 8 8 8
8 8 8
5
7
5
3 3
54 4 4 4
7 7 7
7 7 7 7
7
3
3 3 3
37
6
8 37 37
E
E
E
a
a
aD
D
D
c
c
c
b
b
b
b
a
5
4 4 4
4 4 4
4
7 7 7 '
7 7 7 '
7 7 7 '
7
3 3 3
6 6 6
6 6 6
6 6 6
6 6 6
'
3 3 3
5 5
5 5 '
3 3 3
4 4
4 4 4
5
5 5 5
8
7
8 8 '
8 8 8 '
8 8 8 '
8 8 8 '
'
'
7
D
D
D
c
c
E
E
E
b
b
a
c
a
a
b
a
b
'TA Ax x
Ax y 'TA y x
Biclustering by SVD
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
21
6 6 6 6
6 6
4 4 4 4
4 4 4 4
5 5 5
5 5 56 6
6 6 6
8 8 8 8
8 8 8 8
8 8 8
5
7
5
3 3
54 4 4 4
7 7 7
7 7 7 7
7
3
3 3 3
37
6
8 37 37
E
E
E
a
a
aD
D
D
c
c
c
b
b
b
b
a
Identify checkerboard matrices by their action
on classification vectors: Formulation as “eigenproblem”
Checkerboard Matrix A
Condition Classification Vect. x
Conditions
Gen
es
Gene Classification Vector y
A A x = x’T
A A y = y’T
5
4 4 4
4 4 4
4
7 7 7 '
7 7 7 '
7 7 7 '
7
3 3 3
6 6 6
6 6 6
6 6 6
6 6 6
'
3 3 3
5 5
5 5 '
3 3 3
4 4
4 4 4
5
5 5 5
8
7
8 8 '
8 8 8 '
8 8 8 '
8 8 8 '
'
'
7
D
D
D
c
c
E
E
E
b
b
a
c
a
a
b
a
b
Genes
Con
ditio
ns
x’
yAT
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
22
SVD to Solve Eigenproblem
[Botstein]
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
23
Yuval Kluger et al. Genome Res. 2003; 13: 703-716
Figure 1. Overview of important parts of the biclustering process
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
24
4 8 9 3
8 5
1 6 6 3
7 3 2 4
2 7 6
8 6 12 9
6 5 7
8 9 7 8
9 6 8 9
7 9 9
5
8
2
6 1
84 3 4 5
9 4 7
7 5 8 8
6
2
1 5 3
27
6
7 36 49
E
E
E
a
a
aD
D
D
c
c
c
b
b
b
b
a
Gene partition with noisy data
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
25
.11
.12
.11
.12 .12
.11 .11
.10 .10 .10
.07 .07 .07
.12
.12 .12
.09
.10
.10 .10
.09 .0
.1
.12 .12
.
1 .11 .11
.10 .10
.10
.07
.07 .07
.04 .04 .04
.04 .04 .04
.04 .04 .04
9
.09 .09 .0.11
.11 .11 .11
12 .12 .1
.07 .07
.07 .07 .
2 .
07
.
.
10 .10 .1
9
.09 ..1 0907 .
02
1
1
09
D
D
D
c
E
E
b
b
b
bE
c
a
a
c
a
a
1R Ax y
Normalization Rescales Rows and Columns to Same Means
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
26
.18 .09 .0.16 .32 .16
.16 .32 .16
9
.18 .09
.214 .107 .107
.21
.16 .32 .16
4 .107 .107
.214 .1
.09
.18 .09 .0
.16 .32 .1
.14 .28 .14
.14 .28 .14
.14 .28 .14
.14 .28 .14
.09 .18 .0 .312 .1
07 .10
56 .15
9
.18 .
9
.09 .
7
.214 .10
18 .0
6
.31
09
2 .
7
1
.0
56
6
.
9
1
.
9
107
56
.0 .312 .156 .
'
'
9 .18 .0
'
'
156
'
'
'
'
'
9 '
'
D
D
E
E
E
b
bD
c
c
b
c
a
a
a
b
a
1 'TC A y x Rescale columns
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
27
Representative Cancer Data set
• Lymphoma Data from Dalla-Favera et al. at Columbia
• Informatics from Stolovitzky & Califano at IBM
• Supervised learning some identified characteristic genes associated with different types of lymphoma
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
28
Patients (samples) sorted according to projection onto blocky classification eigenvector
(u2)
Gen
es s
orte
d ac
cord
ing
to
proj
ectio
n on
to b
lock
y cl
assi
ficat
ion
eige
nvec
tor
(v2)
Matrix values represent outer products of two blocky
classification eigenvectors
Results on Representative Cancer Data set
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
29
Actual Data with Normalization and Sorting
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
30
Actual Data just with Sorting
(no normalization)
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
31
Actual Data (no normalization
or sorting)
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
32
Actual Data just with Sorting
(no normalization)
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
33
Actual Data with Normalization and Sorting
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
34
Patients (samples) sorted according to projection onto blocky classification eigenvector
(u2)
Gen
es s
orte
d ac
cord
ing
to
proj
ectio
n on
to b
lock
y cl
assi
ficat
ion
eige
nvec
tor
(v2)
Matrix values represent outer products of two blocky
classification eigenvectors
Just signal from top classification
eigenvectors
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
35
Low Dimension Representation
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
36
Patients (samples) sorted according to projection onto blocky classification eigenvector
(u2)
Gen
es s
orte
d ac
cord
ing
to
proj
ectio
n on
to b
lock
y cl
assi
ficat
ion
eige
nvec
tor
(v2)
Actual Values of Projections onto
Classification Eigenvectors
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
37
Classification of Cancers Based on Projection onto two top classification
eigenvectors: Better with Normalization
Normalized (“bistochastization”)
CLL DLCLFL DLCL
Straight SVD
Four types of Cancer in Della Favera dataset
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
38
Golub, TR et. al., Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 1999 286
biclustering bistochastization
SVD bi-normalization Normalized cuts
ALL (B) ALL (T)AML