Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
SIB course 4-8 Feb 2008Statistical analysis applied to genome
and proteome analyses
Sven BergmannDepartment of Medical Genetics
University of LausanneRue de Bugnon 27 - DGM 328
CH-1005 LausanneSwitzerland
work: ++41-21-692-5452cell: ++41-78-663-4980
http://serverdgm.unil.ch/bergmann
Part1:Analysis tools for large datasets
• Standard toolsk-means, PCA, SVD
• Modular analysis toolsCTWC, ISA, PPA
Why to study a large heterogeneous set of expression data?
Large: Better signals from noisy data!
Heterogeneous: Global view at transcription program!
Supervised vs. unsupervised approachesLarge genome-wide data may contain answers to questions we do not ask! Need for both hypothesis-driven and exploratory analyses!
MotivationsHow to get large-scale expression data?
Pool genome-wide expression measurements from many experiments!
stress
2 4 6 8
1000
2000
3000
4000
5000
6000
cell-cycle
1 2 3 4 5
1000
2000
3000
4000
5000
6000200 400 600 800 1000
1000
2000
3000
4000
5000
6000
large-scaleexpression data
genes
diverse conditionssets of specific conditions
How to make sense of millions of numbers?
New Analysis and Visualization Tools are needed!
Hundreds of samples
Thousandsof genes
K-means Clustering“guess” k=3 (# of clusters)
http://en.wikipedia.org/wiki/K-means_algorithm
2
K-means Clustering
1. Start with random positions of centroids ( )
“guess” k=3 (# of clusters)
http://en.wikipedia.org/wiki/K-means_algorithm
K-means Clustering
2. Assign each data point to closest centroid
1. Start with random positions of centroids ( )
“guess” k=3 (# of clusters)
http://en.wikipedia.org/wiki/K-means_algorithm
K-means Clustering
3. Move centroids to center of assigned points
2. Assign each data point to closest centroid
1. Start with random positions of centroids ( )
“guess” k=3 (# of clusters)
http://en.wikipedia.org/wiki/K-means_algorithm
K-means Clustering
Iterate 1-3 until minimal cost
3. Move centroids to center of assigned points
2. Assign each data point to closest centroid
1. Start with random positions of centroids ( )
with k clusters Si, i = 1,2,...,k and centroids µi (the mean point of all the points )
“guess” k=3 (# of clusters)
K-means ClusteringPlus:• visual • intuitive• relatively fast
Minus:• have to “guess” number of clusters• can give different results for distinct “starting seeds”
• distances computed over all features• one cluster only per element• no cluster hierarchy
Hierachical Clustering
Plus:• Shows (re-orderd) data• Gives hierarchy
Minus:• Does not work well for many genes(usually apply cut-off on fold-change)
• Similarity over all genes/conditions• Clusters do not overlap
3
Principle Component Analysis
Principle components (PCs) are projections onto subspace with the largest variation in the data
http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
Example: 2PCs for 3d-data
http://ordination.okstate.edu/PCA.htm
Raw data points: {a, …, z}
Example: 2PCs for 3d-data
http://ordination.okstate.edu/PCA.htmNormalized data points: zero mean (& unit std)!
Example: 2PCs for 3d-data
http://ordination.okstate.edu/PCA.htm
Identification of axes with the most variance
Most variance is along PCA1
The direction of most variance
perpendicular to PCA1 defines
PCA2
Example: 2PCs for 3d-data
Cluster?
http://ordination.okstate.edu/PCA.htm
Reminder: Matrix multiplications
Definition:
Scheme:
Vectorized:
Example:http://en.wikipedia.org/wiki/Matrix multiplication
4
How do we get the PCs?• The PCs are the eigenvectors of the
covariance matrix C computed from the (mean-centered) data matrix E:
C = ET·E /(n-1)
C·pc = λ·pcC·pc = λ·pc C
1 300
300· =
1
300
1
300·
λ
pc
C = ET·E /(n-1) ETE=C
1 300
300·
1
300
30016k
6k/(n-1)
PCA: Example deletion mutants
And how to project?• The projected data is just the product of
the original data with the PCs:
E’ = E · PC
• Principle Component or Transformation Matrix:PC = [pc1, pc2, …, pcn]
(where n is the number of PCs used)
E’ = E · PC E3001
6k
=
n
·E’1
6k
• The original gene expression profiles are over 300 arrays.
• The transformed data contain projections on n “eigen-genes”(linear combinations of the 300 arrays shown in red)
300
…n1 2
1
PCA: Example deletion mutants
-0.08 -0.06 -0.04 -0.02 0 0.02 0.04-0.1
-0.05
0
0.05
0.1
0.15
PCA1
PC
A2
The first 2 “eigen-genes” separate data into 3 clusters
PCA: Example deletion mutants
-0.04 -0.02 0 0.02 0.04 0.06 0.08-0.15
-0.1
-0.05
0
0.05
0.1
PCA1
PC
A3
Third “eigen-gene” (PCA3) reveals little structure!
PCA: Example deletion mutants
5
Singular Value Decomposition
V: PC matrix of “eigen-genes”(composed of eigenvectors of C = ET·E)
U: PC matrix of “eigen-arrays”(composed of eigenvectors of C’ = E·ET)
D: diagonal matrix
E = U·D·VT
“SVD = bi-PCA”
http://public.lanl.gov/mewall/kluwer2002.html
SVD: Matrix representation
E = U·D·VT
E3001
6k
= · …
3001
…unu1u2
…λ1λ2
λn0
0 v1v2
vn
·6k
n 1 n
n
1
nU D VT
ui: eigen-arrays vi: eigen-genes λi: eigenvaluesi = 1, …, n n: rank(E) = #(independent arrays)
Alter O., Brown P.O., Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 2000; 97:10101-06.
E = U·D·VT = ∑i λi·ui·viT (full expansion)
E1 = λ1·u1·v1T (rank-1 expansion)
∆ = |E - E1|2 (sum of residuals)
minimize ∆ for free u1 and v1:E·v1= λ1·u1 & ET·u1 = λ1·v1implying:E·ET·u1 = λ1
2·u1 & ET·E·v1 = λ12·v1
SVD: What is optimized?
Bergmann et al., Phys. Rev. E 67, 031902 (2003)
SVD: Example deletion mutants
E1 = λ1·u1·v1T
E1
3001
6k
= · 300λ1
v1· 1 =u1
(1)·v1(1) ··· u1
(1)·v1(300)
: : : :
u1(6k)·v1
(1) ··· u1(6k)·v1
(300)
λ1
1
u16k
= · · =high low
low low
highhigh
low
low
SVD: Example deletion mutants
gene
s
arrays
original data
50 100 150 200 250 300
50
100
150
200
gene
s
eigen-arrays
U (n=1)
1
50
100
150
200
eige
n-ge
nes
arrays
VT (n=1)
50 100 150 200 250 300
1
arrays
gene
s
SVD(data) = U D VT (n=1)
50 100 150 200 250 300
50
100
150
200
-1
0
1
SVD: Example deletion mutants
gene
s
arrays
original data
50 100 150 200 250 300
50
100
150
200
gene
s
eigen-arrays
U (n=2)
1 2
50
100
150
200
eige
n-ge
nes
arrays
VT (n=2)
50 100 150 200 250 300
1
2
arrays
gene
s
SVD(data) = U D VT (n=2)
50 100 150 200 250 300
50
100
150
200
-1
0
1
6
SVD: Example deletion mutants
gene
s
arrays
original data
50 100 150 200 250 300
50
100
150
200ge
nes
eigen-arrays
U (n=3)
1 2 3
50
100
150
200
eige
n-ge
nes
arrays
VT (n=3)
50 100 150 200 250 300
1
2
3
arrays
gene
s
SVD(data) = U D VT (n=3)
50 100 150 200 250 300
50
100
150
200
-1
0
1
Part1:Analysis tools for large datasets
• Standard toolsk-means, PCA, SVD
• Modular analysis toolsCTWC, ISA, PPA
How to extract biological information from large-scale expression data?
200 400 600 800 1000
1000
2000
3000
4000
5000
6000
Hierarchical clustering and other correlation-based methods may begood for small data sets, but:
Problems with large data:• Clusters cannot overlap!
• Clustering based oncorrelations over all conditions:- sensitive to noise- computation intensive
Search for transcription modules:
Set of genes co-regulated undera certain set of conditions
• context specific
• allow for overlaps
How to extract biological information from large-scale expression data?
Overview of “modular” analysis tools• Cheng Y and Church GM. Biclustering of expression data.
(Proc Int Conf Intell Syst Mol Biol. 2000;8:93-103)• Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene
microarray data. (Proc Natl Acad Sci U S A. 2000 Oct 24;97(22):12079-84)• Tanay A, Sharan R, Kupiec M, Shamir R. Revealing modularity and organization
in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. (Proc Natl Acad Sci U S A. 2004 Mar 2;101(9):2981-6)
• Sheng Q, Moreau Y, De Moor B. Biclustering microarray data by Gibbs sampling. (Bioinformatics. 2003 Oct;19 Suppl 2:ii196-205)
• Gasch AP and Eisen MB. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering.(Genome Biol. 2002 Oct 10;3(11):RESEARCH0059)
• Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P. 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. (Genome Biol. 2000;1(2):RESEARCH0003.)
… and many more! http://serverdgm.unil.ch/bergmann/Publications/review.pdf
Coupled two-way Clustering
7
How to “hear” the relevant genes?
Song A
Song B
Inside CTWC: Iterations
S1G1Init
S68……S113
S2(G6)...S2(G21)S3(G6)…S3(G21)
G161………G216
G2(S4)...G2(S11)…G5(S4)...G5(S11)
5
S52,...
S67
S1(G6)…S1(G21)
G98,..G105…G151,..G160
G1(S4)…G1(S11)
4
S12,……S51
S2(G1)…S2(G5)S3(G1)…S3(G5)
G22………G97
G2(S1)…G2(S3)…G5(S1)…G5(S3)
3
S4,S5,S6S10,S11None
S1(G2)…S1(G5)
G6,G7,….G13G14,…G21
G1(S2)G1(S3)
2
S2,S3S1(G1)G2,G3,…G5G1(S1)1
SamplesGenesDepth
Two-way clustering
• No need for correlations!
• decomposes data into “transcription modules”
• integrates external information
• allows for interspecies comparative analysis
One example in more detail:
The (Iterative) Signature Algorithm:
J Ihmels, G Friedlander, SB, O Sarig, Y Ziv & N Barkai Nature Genetics (2002)
Trip to the “Amazon”:
5 10 15 20 25 30 35 40 45 50
10
20
30
40
50
60
70
80
90
100
How to find related items?
items
customers
re-commended
items
your choice
customers with
similar choice
False Positives:
8
5 10 15 20 25 30 35 40 45 50
10
20
30
40
50
60
70
80
90
100
How to find related genes?
genes
conditions
similarly expressed
genes
your guess
relevant conditions
J Ihmels, G Friedlander, SB, O Sarig, Y Ziv & N Barkai Nature Genetics (2002)
IGg
gcGc Es
∈=
}:{ CCCcccC tssCcS σ>−∈=∈
cSc
gcCcg Ess
∈=
}:{ GGGgggG tssGgS σ>−∈=∈
IG
Signature Algorithm: Score definitions
initial guesses(genes)
thresholding:
condition scores
How to find related genes? Scores and thresholds!ge
ne s
core
scondition scores
thre
shol
ding
:
How to find related genes? Scores and thresholds!
gene
sco
res
condition scores
thresholding:
How to find related genes? Scores and thresholds!Iterative Signature Algorithm
INPUT OUTPUTOUTPUT = INPUT
“Transcription Module”SB, J Ihmels & N Barkai Physical Review E (2003)
9
Identification of transcription modules using many random “seeds”
random“seeds”
Transcription modules
Independent identification:Modules may overlap!
New Tools: Module Visualization
http://serverdgm.unil.ch/bergmann/Fibroblasts/visualiser.html
Gene enrichment analysisThe hypergeometric distribution f(M,A,K,T) gives the probability
that K out of A genes with a particular annotation match with a
module having M genes if there are T genes in total.
http://en.wikipedia.org/wiki/Hypergeometric_distribution
Decomposing expression data into annotated transcriptional modules
identified >100 transcriptional modules in yeast:
high functional consistency!
many functional links “waiting” to be verified experimentally
J Ihmels, SB & N Barkai Bioinformatics 2005
Module hierarchies and networks Higher-order structure
correlated
anti-correlated
C
10
Organisms
Data types
Conditions
Developmental
Physiological
Environmental
Experimental
Clinical
– Protein expression– Tissue specific expression– Interaction data– Localization data– …?
Biological Insight
The challenge of many datasets: How to integrate all the information?
BLASTsignature algorithm
Mapping Transcription Modules
For distant organisms correlation patterns generally are distinct
SB, J Ihmels & N Barkai PLoS Biology (2004)
What about related organisms?
J Ihmels, SB, J Berman & N Barkai Science (2005)
pairwise correlation (over all arrays)
gene
s
Promoter analysis: The “Rapid Growth Element” AATTTT Data Integration: Example NCI60
11
Our (modular) approach: The model
Co-modulesGene-modules Drug-modules
C3
F4
C4
F3
G3
G4
[AGF]
[AGF]
[BFC]
[BFC]
C5 D3
F6
C6 D4
F5
[BFC]
[BFC]
[CDF]
[CDF]
C1 D1
F2
C2 D2
F1
G1
G2
[AGF] [CDF][BFC]
[AGF] [CDF][BFC]
G
D
CM E
M D
G4
D4
C3 C4
Drug-modules
Gene-modules
C5 C6
Modules and Co-modules
D3
G3
M ED
Co-modules
G2
G1
C1 C2
D1
D2
E CG R D MED
Iteratively refine genes, cell-lines and drugs to get co-modules
The Ping-Pong algorithm!
1
2
3
4
Co-modules have predictive power for drug-gene associations
Co-modules analysis provides biological focus through data integration
• Analysis of large-scale expression data bears great potential to understand global transcription programs and their evolution
• Innovative analysis tools needed to extract information from such data
• (Iterative) Signature & Ping-Pong Algorithms:– decomposes data into “transcription modules”– integrates external information– allows for interspecies comparative analysis
Take-home Messages: