33
Biclustering of Expressio n Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

Embed Size (px)

Citation preview

Page 1: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

Biclustering of Expression Databy Yizong Cheng and Geoge M. Church

Presented by Bojun YanMarch 25, 2004

Page 2: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

outline

1. MicroArray and its relative research 1.1 MicroArray Gene Expression Data 1.2 Main research about MicroArray

2. Why Bicluster?

2.1 Preceding research and its faults 2.2 The concept of Bicluster 2.3 Similarity measure

3. The hardness of Bicluster

Page 3: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

4. Methods proposed by this paper

4.1 Relative Works and paper’s goal

4.2 Definition of mean squared residue score

4.3 Some special matrices’ scores

4.4 Some Theorems deduced by authors

4.5 Algorithms proposed by this paper

5. Experiment

5.1 Data preparation

5.2 Determining Algorithm Parameters

5.3 Final Algorithm

5.4 Results and Display

Page 4: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

1. MicroArray and its relative Research

1.1 MicroArray Gene expression data:

Being generated by DNA chips and other microarray technique, Row---Genes, Column---Conditions or Samples

1.2 Main Research about MicroArray (1) Gene Clustering: Finding the genes having similar functions (2) Conditions Clustering: Helpful to case analysis (3) Classification: Tumor Classification, Cancer prediction (4) Gene Selection: Find the genes relative to some disease (5) Gen Network: Explore the regulatory interaction between the genes

1.3 Paper Target: Biclustering

Page 5: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

2. Why Bicluster?

2.1 Preceding research and its faults

• Goal: Discover the regulatory patterns or condition similarities

• Methods: Based on Euclidean distance or the dot product between the vectors (equally weighted)

(1) Group genes (row)

(2) Group conditions (column)

• Result: Partition the genes or conditions into mutually exclusive groups or hierarchies

Page 6: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

• Faults: obscuring some other similarity groups while discovering some similarity groups

2.2 The concept of Bicluster Clustering the genes(rows) and conditions(columns) simult

aneously---subspace clustering

2.3 Similarity Measure (1)Based on Distance Metric, such as Minkowski distances

(2)Cosine Measure

Page 7: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

(3)Pearson Correlation

(4)Extended Jaccard Similarity

(5)Mean Sqare Residue (proposed by this paper)

+ A measure of the coherence of the genes and conditions in the bicluster

+ Symmetric function of the genes and conditions + Group genes and conditions simultaneously

Page 8: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

3. Hardness of the bicluster

• The problem of finding a maximum bicluster with a score lower than a threshold includes the problem of finding a maximum biclique in a bipartite graph as a special case

• Finding the largest constant square submatrix is proven to be NP-hard (Johnson, 1987)

• The problem of finding a minimum set of biclusters, either mutually exclusive or overlapping, to cover all the elements in a data matrix has been shown to be NP-hard(Orlin,1977)

Page 9: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

4. Methods proposed by this paper

4.1 Relative Works and the paper’s goal

(1) Relative Works

• Divisive algorithm: partitioning data into sets with approximately constant values, proposed by Morgan and Sonquist(1963) and Hartigan(1972)

• Hartigan mentioned that the criterion for partitioning may be a two-way analysis of variance model, similar to the mean squared residue scoring proposed in this article

• Mirkin(1996) presents a node addition algorithm.

Page 10: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

• “biclustering” has been used by Mirkin(1996), which means simultaneous clustering of both row and column sets in a data matrix.

• The term “direct clustering”(Hartigan 1972),and “box clustering”(Mirkin,1996) have the same meaning.

(2) The Paper’s Goal and criterion:

• Goal: Finding of a set of genes showing strikingly similar up-regulation and down-regulation under a set of conditions.

• Criterion: A low mean squared residue score plus a large variation from the constant as a criterion for identifying these genes and conditions

• Overlapping: Biclusters should be allowed to overlap in expression data analysis

Page 11: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

4.2 Definition of mean squared residue score

Page 12: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

The row variance:

It is an accompanying score to reject trival or constant biclusters.

Page 13: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

4.3 Scores of some special matrice

• A special case for a perfect score( a zero mean squared residue score) is a constant bicluster of elements of a single value

• For the matrix aij=ij, i,j>0, no submatrix of a size larger than a single cell has a score lower than 0.5

• A K×K matrix of all 0s except one 1 has the score

Equation:

• A matrix with elements randomly and uniformly generated in the range of [a,b] has an expected score of (b-a)(b-a)/12. For example the range is [0,800], the expected score is 53,333.

Page 14: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

111

0),(

1

1

1

111

111

111

JIH

444

0),(

7

4

1

777

444

111

JIH

654

0),(

8

5

2

987

654

321

JIH

Page 15: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

• Some characteristic of mean square residue score

(1)Adding a constant number to the matrix will not affect the H(I,J) score

(2)Multiplying a constant number will affect the score (by the square of the constant)

(3)Both will not affect the ranking of the biclusters in a matrix

Page 16: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

4.4 Theorems deduced by authors

Page 17: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004
Page 18: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

Comments on Algorithm 0:

• Algorithm 0, although a polynomial-time one, will not be efficient enough for a quick analysis of most expression data matrices.

• The complexity of Algorithm 0 is o((n+m)nm)

Page 19: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004
Page 20: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

Comments on Algorithm 1:

• In each iterate, a complete recalculation for step1 and step 2 is needed

• The time complexity of Algorithm 1 is o(nm)

• Higher efficiency than Algorithm 0, but not the best.

Page 21: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004
Page 22: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

Comments on Algorithm 2:

• Need to properly select parameter α>1

• Without updating the score after the removal of each node

• The time complexity of Algorithm 2 is o(logn+longm)

• One may miss some large δ-bicluster

Page 23: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004
Page 24: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

Comments on Algorithm 3:

• The time complexity is o(mn)

• The resulting δ-bicluster may still not be maximal because of two reasons:

(1)Lemma 3 only gives a sufficient condition for adding rows and conditions

(2)By adding rows and columns, the score may decrease to the p

oint it much smaller than δ

Page 25: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

5. Experiment

5.1 Data preparationDatasets and Parameters:(1)Yeast data,o-value=300, n=100

(2)Human data, o-value=1200,n=100

Missing Data Replacement: Replace the missing data using the random number underlying the

uinform distriubiton

Biclusters is Compared to the Cluster results from(1)Travazoie et al. (1999)

(2)Alizadeh et al. (2000)

Page 26: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

5.2 Determining Algorithm Parameters

Thinking about the clusters from the papres Travazoie et al. (1999)

and Alizadeh et al. (2000)

For yeast data, δ= 300, α=1.2

For human gene data, δ= 1200, α=1.2

The number of biclusters is n=100

Masking discovered Biclusters: Each time a bicluster was discovered, the elements in it will be replaced by random number because the algorithms are deterministic

5.3 Final Algorithm

Page 27: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004
Page 28: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

Biclusters for Yeast data

Page 29: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

Biclusters for Yeast data

Page 30: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

Biclusters for Yeast data

Page 31: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

Biclusters for Yeast data

Page 32: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

Biclusters for human data

Page 33: Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

Biclusters for human data