44
Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Embed Size (px)

Citation preview

Page 1: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Workshop report: Biclustering Methods for Microarray Data,

Hassalt University, Belgium

Guy Harari

Page 2: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

FABIA: factor analysis for bicluster acquisition

Sepp Hochreiter et al,.

University of Linz, Austria

Page 3: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

FABIA - Motivation

• Plaid models: for bicluster i:• They use least squares fit for model selection • Thus assume Gaussian effects• However, microarray datasets are not

Gaussian (heavy tails)

kij i ki ij

Page 4: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

FABIA – model

• Biclusters have multiplicative coherent values

• λ – prototype• z - factors• In the example above:

2104

0000

31.5

06

4208

Tz

2 1 0 4Tz 1 0 1.5 2T

Page 5: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

FABIA – model

• For p biclusters and additive Gaussian noise:

• The j-th sample (column in X) is:

• where is the j-the column of Z. • Λ and Z are sparse.

1

pT

i ii

X Z

z

1

p

j i ij j j ji

x z z

jz

Page 6: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Generative Model for Factor Analysis

• Data was produced by:– Picking values independently

from some Gaussian hidden factors.

– Linearly combining the factors using a factor loading matrix.

– Add Gaussian noise for each input

ijw

(0,1)if N

jx

if

2( , )j jN

Page 7: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Generative Model for Factor Analysis

• Assume factors and noise areindependent.

• Assume also . • Select #factors by e.g.

Kaiser criterion –• Extract factors using e.g.

maximum likelihood.

ijw

( )Cov F I

( )# 1Cov XEV

(0,1)if N

jx

if

2( , )j jN

Page 8: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

FABIA – model

• Fix the value for j.• Factors are the ‘s, .• • Biclusters shouldn’t be correlated.• are the loading matrix’s entries.• is diagonal – independent

Gaussian noise.

ijz 1 i p

ij

(0,1)ijz N

jx

ijz

(0 , )N

jCov z I

ij

Page 9: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Sparseness• We want sparse solutions for and• So use Laplace distribution for :

• For use one of:1. FABIA:

2. FABIAS:

z iz

2| |

1

1( )

2i

p pz

i

p z e

i 2| |

1

1

2ki

n n

ik

p e

0

ii

i

c for sp spLp

for sp spL

2

1 1

1

n n

ki kik k

i

n

spn

parameter

Page 10: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Model Selection• Center the data to zero median.• Normalization – divide values by row’s std.• Use EM where the parameters are and .• Rank biclusters according to mutual

information:

• Determine members of each bicluster using two thresholds for values and .

1

; | \ ; | \l

T Ti i j ij j ij

j

I X z Z z I x z z z

ki ijz

Page 11: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Experiments – Simulated Datasets

• n=1000 genes, l=100 samples • p=10 multiplicative biclusters• Generate :– Choose - the number of genes in bicluster i -

uniformly at random from {10,…,210}.– Choose genes from {1,…,1000}.– Set components not in bicluster i to

.– Set components in bicluster i to .

i

iN

iN

i2(0,0.2 )N

i ( 3,1)N

Page 12: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Experiments – Simulated Datasets

• Generate :– Choose - the number of samples in bicluster i -

uniformly at random from {5,…,25}.– Choose samples from {1,…,100}.– Set components not in bicluster i to .– Set components in bicluster i to .

• Add random noise to all entries according to .

• Compute the dataset with

izziN

ziN

2(0,0.2 )Niz

iz (2,1)N

2(0,3 )N

1

pT

i ii

X Z

z

Page 13: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Evaluation – consensus score

• For two sets of biclusters:– Compute similarity between each pair of

biclusters, one from each set.– Find maximum assignment using the Munkres

(Hungarian) algorithm.– Penalize different numbers of biclusters - Divide

the sum of similarities of the assigned biclusters by the number of biclusters of the largest set.

• Use Jaccard index for computing similarity.

Page 14: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Simulated Datasets - Results

• Average score and STD for each method:

Page 15: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Simulated Datasets - Results• Avg. and STD of information content and similarity:

Page 16: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Simulated additive datasets

• Generate biclusters in the same way.• Use additive model for each bicluster:

• Choose from and from .

• Choose from one of three models:– Low signal – – Moderate signal – – High signal –

ikj i ik ij ik

2(0.5,0.2 )Nij 2(1,0.5 )N

i2(0,2 )N

2( 2,0.5 )N 2( 4,0.5 )N

Page 17: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Additive Datasets - results

• Low signal:

Page 18: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Additive Datasets - results

• Moderate signal:

Page 19: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Additive Datasets - results

• High signal:

Page 20: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Gene Expression Datasets

• Breast cancer (Van’t Veer et al., 2002) – 3 classes (clusters) were found in Hoshida et al., 2007.

• Multiple tissue types dataset (Su et al., 2002)• Diffuse large-B-cell lymphoma dataset (DLBCL)

(Rosenwald et al., 2002) – 3 classes (clusters) were found in Hoshida et al. (2007).

Page 21: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Gene Expression Datasets - results

Page 22: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Biological Interpretation• Breast cancer:– Bicluster 1 is related to cell cycle (GO and KEGG,

) and to the proteins CDC2 (division control) and KIF (mitosis).

– Bicluster 2 is related to immune response (GO, ) and cytokine-cytokine receptor interaction (KEGG ), and to cytokine-related proteins as CCR5, CCL4 and CSF2RB.

• Multiple tissue – no biological interpretation.

910p

2610p 1010p

Page 23: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Biological Interpretation

• DLBCL:– Bicluster 1 is related to the ribosome (GO ,

KEGG ) and to B-cell receptor signaling (KEGG ).

– Bicluster 2 is related to the immune system (GO , KEGG ).

610p 810p

910p

610p 910p

Page 24: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Drag Design

• Goal: find compounds with similar effects on gene expression.

• Use Affymetrix GeneChip HT HG-U133+ PM array plates with 12*8 samples per plate.

• Selected compounds are active on a cancer cell line.

• Each compound was testes in a group of three replicates.

Page 25: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Drag Design

• 3 biclusters were found to have 2-5 replicate sets.

• One of them extracted genes related to mitosis (GO ).

• The compounds of this bicluster are now under investigation by Johnson & Johnson Pharmaceutical R&D.

1310p

Page 26: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Biclustering Gene Expression Time Series

Sara C Madeira, Technical University of Lisbon

Page 27: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Introduction

• Input: columns correspond to samples taken in consecutive instants of time.

• Output: biclusters with contiguous columns.• Motivation: biological processes start and end

in a contiguous time leading to increased/decreased activity of some genes.

• Goal: find all maximal contiguous column coherent (CCC) biclusters sorted by a statistical score.

Page 28: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Discretization

• Let be the input expression matrix.• Define

• Standardize A’ to mean=0 and STD=1 by gene.

'n mA

' '( 1) '

'

'' ' '( 1)

' '( 1)

' '( 1)

, 0,

1, 0 0,

1, 0 0,

0, 0 0.

i j ijij

ij

ij ij i j

ij i j

ij i j

A Aif A

A

A if A and A

if A and A

if A and A

Page 29: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Discretization

• Define

• Where D symbolizes Down-regulation, U for Up-regulation and N for No-change.

• And t=1 is the standard deviation of a gene.

''

''

, ,

, ,

, .

ij

ij ij

D if A t

A U f A t

N otherwise

Page 30: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

CCC-Bicluster

• Definition: A CCC-Bicluster is a subset of rows

and contiguous subset of

columns such that

for all rows and columns

.

• Note that each CCC-Bicluster defines a string S

which is common to every row in I.

IJA

1, , kI i i

, 1, , 1,J r r s s

ij kjA A ,i k Ij J

Page 31: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Suffix Trees1. Each node, other than the

root, has at least two children.2. Each edges is labeled with

nonempty substring of S (here “BANANA”)

3. No two edges out of a node have edge labels starting with the same symbol.

4. The label from the root to a leaf is a suffix of S.

Page 32: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Example

Internal node = row-maximal, right-maximal CCC-Bicluster

Page 33: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Main Result• Every (inclusion) maximal CCC-Bicluster with

at least two rows corresponds to an internal node in the suffix tree such that:– It does not have incoming suffix links, or,– It has incoming suffix links only from nodes having

less leaves in their subtress.

• Each such an internal node defines a maximal CCC-Bicluster with at least two rows.

• This implies an O(nm) time algorithm for finding all CCC-Biclusters.

Page 34: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Experiments – Simulated Datasets

• Generate a random 1000 x 50 dataset.• Apply the algorithm on it.• Plant 10 CCC-Biclusters on the same dataset.• Apply again the algorithm on the dataset.• Define a similarity measure to be Jaccard index

(genes and conditions) and a statistical test.• Filter out similar biclusters and those didn’t

pass the statistical test.

Page 35: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

The Statistical Test

• Null hypothesis – expression values of a subset of genes evolve independently.

• Expression patterns are modeled by a first-order Markov Chain, e.g. for the pattern :

where 2Pr( ) Pr( 2 3 4) Pr( 2) Pr( 3 | 2) Pr( 4 | 3)Bp U D U U D U U D

2Pr( 2) ,

UU

n

2 3Pr( 2 3)Pr( 3 | 2) ,

Pr( 2) 2

U DU DD U

U U

3 4Pr( 3 4)Pr( 4 | 3) .

Pr( 3) 3

D UD UU D

D D

2Bp

Page 36: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

The Statistical Test

• n – the number of genes in the dataset.• I – the subset of genes in a CCC-Bicluster.• The significance of a CCC-Bicluster B with an

expression pattern is:

1

1

| | 1

( ) Pr( ) 1 Pr( )n

j n j

B Bj I

pval B p p

Bp

Page 37: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Simulated Datasets - results

• 165 CCC-Biclusters passed the test at the 1 percent level, after Bonferroni correction.

Page 38: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Experiments – Real Datasets

• Use yeast heat shock response dataset from Gasch et al.

• 25 CCC-Biclusters were found to be highly significant at the 1% after Bonferroni corr.

• 9 of them removed after similarity check.• Test results for GO enrichment (hypergeo.)

Page 39: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Real Datasets - results

Page 40: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Up-regulated CCC-Biclusters

Page 41: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Down-regulated CCC-Biclusters

Page 42: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Improvements

• Allow errors: replacement of D/U with N and vice versa.

• Discover biclusters with opposite patterns (anti-correlated).

• Allow scaled and time-lagged (shifted) patterns.• TriClustering – genes x time points x exemplars

(different patients/stress conditions).

Page 43: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Other talks

• “biclust” R package – Ludwig Maximilian University of Munich (Inst. of statistics) and Hasselt University.

• ISA and related tools (R packages) – Gabor Csardi, University of Lausanne, Switzerland.

• Clustering of dose-response microarray data – Hasselt University, Johnson & Johnson PR&D.

• Model- and graph-based clustering of genomic data – Freiburg inst. For advanced studies, Ger.

Page 44: Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

Questions?