11
1 SIB course 4-8 Feb 2008 Statistical analysis applied to genome and proteome analyses Sven Bergmann Department of Medical Genetics University of Lausanne Rue de Bugnon 27 - DGM 328 CH-1005 Lausanne Switzerland work: ++41-21-692-5452 cell: ++41-78-663-4980 http://serverdgm.unil.ch/bergmann Part1: Analysis tools for large datasets Standard tools k-means, PCA, SVD Modular analysis tools CTWC, ISA, PPA Why to study a large heterogeneous set of expression data? Large: Better signals from noisy data! Heterogeneous: Global view at transcription program! Supervised vs. unsupervised approaches Large genome-wide data may contain answers to questions we do not ask! Need for both hypothesis- driven and exploratory analyses! Motivations How to get large-scale expression data? Pool genome-wide expression measurements from many experiments! stress 2 4 6 8 1000 2000 3000 4000 5000 6000 cell- cycle 1 2 3 4 5 1000 2000 3000 4000 5000 6000 200 400 600 800 1000 1000 2000 3000 4000 5000 6000 large-scale expression data genes diverse conditions sets of specific conditions How to make sense of millions of numbers? New Analysis and Visualization Tools are needed! Hundreds of samples Thousands of genes K-means Clustering “guess” k=3 (# of clusters) http://en.wikipedia.org/wiki/K-means_algorithm

New Analysis of large datasets - Part1 · 2013. 6. 28. · Part1: Analysis tools for large datasets • Standard tools k-means, PCA, SVD • Modular analysis tools CTWC, ISA, PPA

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: New Analysis of large datasets - Part1 · 2013. 6. 28. · Part1: Analysis tools for large datasets • Standard tools k-means, PCA, SVD • Modular analysis tools CTWC, ISA, PPA

1

SIB course 4-8 Feb 2008Statistical analysis applied to genome

and proteome analyses

Sven BergmannDepartment of Medical Genetics

University of LausanneRue de Bugnon 27 - DGM 328

CH-1005 LausanneSwitzerland

work: ++41-21-692-5452cell: ++41-78-663-4980

http://serverdgm.unil.ch/bergmann

Part1:Analysis tools for large datasets

• Standard toolsk-means, PCA, SVD

• Modular analysis toolsCTWC, ISA, PPA

Why to study a large heterogeneous set of expression data?

Large: Better signals from noisy data!

Heterogeneous: Global view at transcription program!

Supervised vs. unsupervised approachesLarge genome-wide data may contain answers to questions we do not ask! Need for both hypothesis-driven and exploratory analyses!

MotivationsHow to get large-scale expression data?

Pool genome-wide expression measurements from many experiments!

stress

2 4 6 8

1000

2000

3000

4000

5000

6000

cell-cycle

1 2 3 4 5

1000

2000

3000

4000

5000

6000200 400 600 800 1000

1000

2000

3000

4000

5000

6000

large-scaleexpression data

genes

diverse conditionssets of specific conditions

How to make sense of millions of numbers?

New Analysis and Visualization Tools are needed!

Hundreds of samples

Thousandsof genes

K-means Clustering“guess” k=3 (# of clusters)

http://en.wikipedia.org/wiki/K-means_algorithm

Page 2: New Analysis of large datasets - Part1 · 2013. 6. 28. · Part1: Analysis tools for large datasets • Standard tools k-means, PCA, SVD • Modular analysis tools CTWC, ISA, PPA

2

K-means Clustering

1. Start with random positions of centroids ( )

“guess” k=3 (# of clusters)

http://en.wikipedia.org/wiki/K-means_algorithm

K-means Clustering

2. Assign each data point to closest centroid

1. Start with random positions of centroids ( )

“guess” k=3 (# of clusters)

http://en.wikipedia.org/wiki/K-means_algorithm

K-means Clustering

3. Move centroids to center of assigned points

2. Assign each data point to closest centroid

1. Start with random positions of centroids ( )

“guess” k=3 (# of clusters)

http://en.wikipedia.org/wiki/K-means_algorithm

K-means Clustering

Iterate 1-3 until minimal cost

3. Move centroids to center of assigned points

2. Assign each data point to closest centroid

1. Start with random positions of centroids ( )

with k clusters Si, i = 1,2,...,k and centroids µi (the mean point of all the points )

“guess” k=3 (# of clusters)

K-means ClusteringPlus:• visual • intuitive• relatively fast

Minus:• have to “guess” number of clusters• can give different results for distinct “starting seeds”

• distances computed over all features• one cluster only per element• no cluster hierarchy

Hierachical Clustering

Plus:• Shows (re-orderd) data• Gives hierarchy

Minus:• Does not work well for many genes(usually apply cut-off on fold-change)

• Similarity over all genes/conditions• Clusters do not overlap

Page 3: New Analysis of large datasets - Part1 · 2013. 6. 28. · Part1: Analysis tools for large datasets • Standard tools k-means, PCA, SVD • Modular analysis tools CTWC, ISA, PPA

3

Principle Component Analysis

Principle components (PCs) are projections onto subspace with the largest variation in the data

http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

Example: 2PCs for 3d-data

http://ordination.okstate.edu/PCA.htm

Raw data points: {a, …, z}

Example: 2PCs for 3d-data

http://ordination.okstate.edu/PCA.htmNormalized data points: zero mean (& unit std)!

Example: 2PCs for 3d-data

http://ordination.okstate.edu/PCA.htm

Identification of axes with the most variance

Most variance is along PCA1

The direction of most variance

perpendicular to PCA1 defines

PCA2

Example: 2PCs for 3d-data

Cluster?

http://ordination.okstate.edu/PCA.htm

Reminder: Matrix multiplications

Definition:

Scheme:

Vectorized:

Example:http://en.wikipedia.org/wiki/Matrix multiplication

Page 4: New Analysis of large datasets - Part1 · 2013. 6. 28. · Part1: Analysis tools for large datasets • Standard tools k-means, PCA, SVD • Modular analysis tools CTWC, ISA, PPA

4

How do we get the PCs?• The PCs are the eigenvectors of the

covariance matrix C computed from the (mean-centered) data matrix E:

C = ET·E /(n-1)

C·pc = λ·pcC·pc = λ·pc C

1 300

300· =

1

300

1

300·

λ

pc

C = ET·E /(n-1) ETE=C

1 300

300·

1

300

30016k

6k/(n-1)

PCA: Example deletion mutants

And how to project?• The projected data is just the product of

the original data with the PCs:

E’ = E · PC

• Principle Component or Transformation Matrix:PC = [pc1, pc2, …, pcn]

(where n is the number of PCs used)

E’ = E · PC E3001

6k

=

n

·E’1

6k

• The original gene expression profiles are over 300 arrays.

• The transformed data contain projections on n “eigen-genes”(linear combinations of the 300 arrays shown in red)

300

…n1 2

1

PCA: Example deletion mutants

-0.08 -0.06 -0.04 -0.02 0 0.02 0.04-0.1

-0.05

0

0.05

0.1

0.15

PCA1

PC

A2

The first 2 “eigen-genes” separate data into 3 clusters

PCA: Example deletion mutants

-0.04 -0.02 0 0.02 0.04 0.06 0.08-0.15

-0.1

-0.05

0

0.05

0.1

PCA1

PC

A3

Third “eigen-gene” (PCA3) reveals little structure!

PCA: Example deletion mutants

Page 5: New Analysis of large datasets - Part1 · 2013. 6. 28. · Part1: Analysis tools for large datasets • Standard tools k-means, PCA, SVD • Modular analysis tools CTWC, ISA, PPA

5

Singular Value Decomposition

V: PC matrix of “eigen-genes”(composed of eigenvectors of C = ET·E)

U: PC matrix of “eigen-arrays”(composed of eigenvectors of C’ = E·ET)

D: diagonal matrix

E = U·D·VT

“SVD = bi-PCA”

http://public.lanl.gov/mewall/kluwer2002.html

SVD: Matrix representation

E = U·D·VT

E3001

6k

= · …

3001

…unu1u2

…λ1λ2

λn0

0 v1v2

vn

·6k

n 1 n

n

1

nU D VT

ui: eigen-arrays vi: eigen-genes λi: eigenvaluesi = 1, …, n n: rank(E) = #(independent arrays)

Alter O., Brown P.O., Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 2000; 97:10101-06.

E = U·D·VT = ∑i λi·ui·viT (full expansion)

E1 = λ1·u1·v1T (rank-1 expansion)

∆ = |E - E1|2 (sum of residuals)

minimize ∆ for free u1 and v1:E·v1= λ1·u1 & ET·u1 = λ1·v1implying:E·ET·u1 = λ1

2·u1 & ET·E·v1 = λ12·v1

SVD: What is optimized?

Bergmann et al., Phys. Rev. E 67, 031902 (2003)

SVD: Example deletion mutants

E1 = λ1·u1·v1T

E1

3001

6k

= · 300λ1

v1· 1 =u1

(1)·v1(1) ··· u1

(1)·v1(300)

: : : :

u1(6k)·v1

(1) ··· u1(6k)·v1

(300)

λ1

1

u16k

= · · =high low

low low

highhigh

low

low

SVD: Example deletion mutants

gene

s

arrays

original data

50 100 150 200 250 300

50

100

150

200

gene

s

eigen-arrays

U (n=1)

1

50

100

150

200

eige

n-ge

nes

arrays

VT (n=1)

50 100 150 200 250 300

1

arrays

gene

s

SVD(data) = U D VT (n=1)

50 100 150 200 250 300

50

100

150

200

-1

0

1

SVD: Example deletion mutants

gene

s

arrays

original data

50 100 150 200 250 300

50

100

150

200

gene

s

eigen-arrays

U (n=2)

1 2

50

100

150

200

eige

n-ge

nes

arrays

VT (n=2)

50 100 150 200 250 300

1

2

arrays

gene

s

SVD(data) = U D VT (n=2)

50 100 150 200 250 300

50

100

150

200

-1

0

1

Page 6: New Analysis of large datasets - Part1 · 2013. 6. 28. · Part1: Analysis tools for large datasets • Standard tools k-means, PCA, SVD • Modular analysis tools CTWC, ISA, PPA

6

SVD: Example deletion mutants

gene

s

arrays

original data

50 100 150 200 250 300

50

100

150

200ge

nes

eigen-arrays

U (n=3)

1 2 3

50

100

150

200

eige

n-ge

nes

arrays

VT (n=3)

50 100 150 200 250 300

1

2

3

arrays

gene

s

SVD(data) = U D VT (n=3)

50 100 150 200 250 300

50

100

150

200

-1

0

1

Part1:Analysis tools for large datasets

• Standard toolsk-means, PCA, SVD

• Modular analysis toolsCTWC, ISA, PPA

How to extract biological information from large-scale expression data?

200 400 600 800 1000

1000

2000

3000

4000

5000

6000

Hierarchical clustering and other correlation-based methods may begood for small data sets, but:

Problems with large data:• Clusters cannot overlap!

• Clustering based oncorrelations over all conditions:- sensitive to noise- computation intensive

Search for transcription modules:

Set of genes co-regulated undera certain set of conditions

• context specific

• allow for overlaps

How to extract biological information from large-scale expression data?

Overview of “modular” analysis tools• Cheng Y and Church GM. Biclustering of expression data.

(Proc Int Conf Intell Syst Mol Biol. 2000;8:93-103)• Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene

microarray data. (Proc Natl Acad Sci U S A. 2000 Oct 24;97(22):12079-84)• Tanay A, Sharan R, Kupiec M, Shamir R. Revealing modularity and organization

in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. (Proc Natl Acad Sci U S A. 2004 Mar 2;101(9):2981-6)

• Sheng Q, Moreau Y, De Moor B. Biclustering microarray data by Gibbs sampling. (Bioinformatics. 2003 Oct;19 Suppl 2:ii196-205)

• Gasch AP and Eisen MB. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering.(Genome Biol. 2002 Oct 10;3(11):RESEARCH0059)

• Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P. 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. (Genome Biol. 2000;1(2):RESEARCH0003.)

… and many more! http://serverdgm.unil.ch/bergmann/Publications/review.pdf

Coupled two-way Clustering

Page 7: New Analysis of large datasets - Part1 · 2013. 6. 28. · Part1: Analysis tools for large datasets • Standard tools k-means, PCA, SVD • Modular analysis tools CTWC, ISA, PPA

7

How to “hear” the relevant genes?

Song A

Song B

Inside CTWC: Iterations

S1G1Init

S68……S113

S2(G6)...S2(G21)S3(G6)…S3(G21)

G161………G216

G2(S4)...G2(S11)…G5(S4)...G5(S11)

5

S52,...

S67

S1(G6)…S1(G21)

G98,..G105…G151,..G160

G1(S4)…G1(S11)

4

S12,……S51

S2(G1)…S2(G5)S3(G1)…S3(G5)

G22………G97

G2(S1)…G2(S3)…G5(S1)…G5(S3)

3

S4,S5,S6S10,S11None

S1(G2)…S1(G5)

G6,G7,….G13G14,…G21

G1(S2)G1(S3)

2

S2,S3S1(G1)G2,G3,…G5G1(S1)1

SamplesGenesDepth

Two-way clustering

• No need for correlations!

• decomposes data into “transcription modules”

• integrates external information

• allows for interspecies comparative analysis

One example in more detail:

The (Iterative) Signature Algorithm:

J Ihmels, G Friedlander, SB, O Sarig, Y Ziv & N Barkai Nature Genetics (2002)

Trip to the “Amazon”:

5 10 15 20 25 30 35 40 45 50

10

20

30

40

50

60

70

80

90

100

How to find related items?

items

customers

re-commended

items

your choice

customers with

similar choice

False Positives:

Page 8: New Analysis of large datasets - Part1 · 2013. 6. 28. · Part1: Analysis tools for large datasets • Standard tools k-means, PCA, SVD • Modular analysis tools CTWC, ISA, PPA

8

5 10 15 20 25 30 35 40 45 50

10

20

30

40

50

60

70

80

90

100

How to find related genes?

genes

conditions

similarly expressed

genes

your guess

relevant conditions

J Ihmels, G Friedlander, SB, O Sarig, Y Ziv & N Barkai Nature Genetics (2002)

IGg

gcGc Es

∈=

}:{ CCCcccC tssCcS σ>−∈=∈

cSc

gcCcg Ess

∈=

}:{ GGGgggG tssGgS σ>−∈=∈

IG

Signature Algorithm: Score definitions

initial guesses(genes)

thresholding:

condition scores

How to find related genes? Scores and thresholds!ge

ne s

core

scondition scores

thre

shol

ding

:

How to find related genes? Scores and thresholds!

gene

sco

res

condition scores

thresholding:

How to find related genes? Scores and thresholds!Iterative Signature Algorithm

INPUT OUTPUTOUTPUT = INPUT

“Transcription Module”SB, J Ihmels & N Barkai Physical Review E (2003)

Page 9: New Analysis of large datasets - Part1 · 2013. 6. 28. · Part1: Analysis tools for large datasets • Standard tools k-means, PCA, SVD • Modular analysis tools CTWC, ISA, PPA

9

Identification of transcription modules using many random “seeds”

random“seeds”

Transcription modules

Independent identification:Modules may overlap!

New Tools: Module Visualization

http://serverdgm.unil.ch/bergmann/Fibroblasts/visualiser.html

Gene enrichment analysisThe hypergeometric distribution f(M,A,K,T) gives the probability

that K out of A genes with a particular annotation match with a

module having M genes if there are T genes in total.

http://en.wikipedia.org/wiki/Hypergeometric_distribution

Decomposing expression data into annotated transcriptional modules

identified >100 transcriptional modules in yeast:

high functional consistency!

many functional links “waiting” to be verified experimentally

J Ihmels, SB & N Barkai Bioinformatics 2005

Module hierarchies and networks Higher-order structure

correlated

anti-correlated

C

Page 10: New Analysis of large datasets - Part1 · 2013. 6. 28. · Part1: Analysis tools for large datasets • Standard tools k-means, PCA, SVD • Modular analysis tools CTWC, ISA, PPA

10

Organisms

Data types

Conditions

Developmental

Physiological

Environmental

Experimental

Clinical

– Protein expression– Tissue specific expression– Interaction data– Localization data– …?

Biological Insight

The challenge of many datasets: How to integrate all the information?

BLASTsignature algorithm

Mapping Transcription Modules

For distant organisms correlation patterns generally are distinct

SB, J Ihmels & N Barkai PLoS Biology (2004)

What about related organisms?

J Ihmels, SB, J Berman & N Barkai Science (2005)

pairwise correlation (over all arrays)

gene

s

Promoter analysis: The “Rapid Growth Element” AATTTT Data Integration: Example NCI60

Page 11: New Analysis of large datasets - Part1 · 2013. 6. 28. · Part1: Analysis tools for large datasets • Standard tools k-means, PCA, SVD • Modular analysis tools CTWC, ISA, PPA

11

Our (modular) approach: The model

Co-modulesGene-modules Drug-modules

C3

F4

C4

F3

G3

G4

[AGF]

[AGF]

[BFC]

[BFC]

C5 D3

F6

C6 D4

F5

[BFC]

[BFC]

[CDF]

[CDF]

C1 D1

F2

C2 D2

F1

G1

G2

[AGF] [CDF][BFC]

[AGF] [CDF][BFC]

G

D

CM E

M D

G4

D4

C3 C4

Drug-modules

Gene-modules

C5 C6

Modules and Co-modules

D3

G3

M ED

Co-modules

G2

G1

C1 C2

D1

D2

E CG R D MED

Iteratively refine genes, cell-lines and drugs to get co-modules

The Ping-Pong algorithm!

1

2

3

4

Co-modules have predictive power for drug-gene associations

Co-modules analysis provides biological focus through data integration

• Analysis of large-scale expression data bears great potential to understand global transcription programs and their evolution

• Innovative analysis tools needed to extract information from such data

• (Iterative) Signature & Ping-Pong Algorithms:– decomposes data into “transcription modules”– integrates external information– allows for interspecies comparative analysis

Take-home Messages: