26
Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20 th Dec 2002

Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

Embed Size (px)

Citation preview

Page 1: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

Projected clustering algorithms and their application on genomic data analysis

Probation TalkM.Phil. candidate: Kevin YipSupervisor: Dr. D. Cheung

20th Dec 2002

Page 2: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

2

Projected clustering algorithms and their application on genomic data analysis

Presentation Outline

The research problem: clustering high-dimensional datasets.Recent work in the community: projected (subspace) clustering.My previous work: HARP and HARP.1.My current and future work: studying genomic data, comparing different algorithms, designing similarity measures, etc.

Page 3: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

3

Projected clustering algorithms and their application on genomic data analysis

Curse of Dimensionality

In a dataset of high dimensionality, some attributes may not help identify the characteristics of data points.If the number of such attributes is high, a point can be as close to its nearest neighbor as its farthest neighbor.

Page 4: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

4

Projected clustering algorithms and their application on genomic data analysis

Curse of Dimensionality

Resolution: feature selection

A1 A2

R1 3 8

R2 4 7

R3 9 1

D(R1, R2) 1.4

D(R2, R3) 7.8

Difference 6.4

A3

5

4

1

1.7

8.4

6.6

A4

7

7

1

1.7

10.3

8.6

A5

4

1

5

3.5

11.0

7.6

A6

8

1

3

7.8

11.2

3.4

A7

6

9

4

8.4

12.3

3.9

A8

10

1

6

12.3

13.3

1.0

A9

0

6

2

13.7

13.9

0.2

A10

1

7

9

14.9

14.0

-0.9

Page 5: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

5

Projected clustering algorithms and their application on genomic data analysis

Projected Clustering

What feature selection cannot do: find different relevant attributes for different clusters.

Projection on x1 and x2

x1

x2

C1

C2

C3

Page 6: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

6

Projected clustering algorithms and their application on genomic data analysis

Projected Clustering

Clustering: identify similar objects and form clusters.Projected clustering: identify similar objects and the relevant attributes, and form clusters.

Page 7: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

7

Projected clustering algorithms and their application on genomic data analysis

Projected Clustering

Existing approaches:Grid-based dimension selectionAssociation rule hypergraph partitioningContext-specific Bayesian clusteringMonte Carlo algorithmProjective clustering

All produce projected clusters successfully in different ways.

Page 8: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

8

Projected clustering algorithms and their application on genomic data analysis

Projected Clustering

Some problems with the approaches:Accuracy depends on hard-to-determine input parameters.Unable to determine the dimensionality of each cluster automatically.Can only use density as similarity measure.

In addition, the algorithms have tested on few real datasets.

Page 9: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

9

Projected clustering algorithms and their application on genomic data analysis

HARP and HARP.1

HARP (A Hierarchical Algorithm with Automatic Relevant Attribute Selection for Projected Clustering):

A hierarchical clustering framework with no pre-assumed similarity measure.Requires no hard-to-determine user inputs.Determines the relevant attributes of each cluster automatically.

Page 10: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

10

Projected clustering algorithms and their application on genomic data analysis

HARP and HARP.1

HARP.1 - an implementation of HARP using attribute value density to define the similarity measure:

Makes use of global statistics in attribute selection.Handles mutual disagreement and information loss of the potentially merging clusters.Accepts mixed types of attributes.

Page 11: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

11

Projected clustering algorithms and their application on genomic data analysis

HARP and HARP.1

Some experimental results (synthetic data):

Best score: error% / outlier%

Average: error% / outlier%

Dataset d l HARP.1 PROCLUS Traditional

ORCLUS

SynCat1 20 12 0.0/5.0

3.6/ 1.46.7/ 3.7

2.6 /26.4

5.8 / 5.3

N/A

SynMix1 20 12 0.4/6.8

2.2/17.06.8/10.1

11.6 /11.2

7.9 / 4.6

N/A

SynNum1

20 12 0.8/5.0

1.8/21.47.2/ 8.3

4.4 /32.0

5.9 / 9.2

0.4 /23.82.31 /

8.15d: dimensionality of the datasetl: average number of relevant attributes

Page 12: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

12

Projected clustering algorithms and their application on genomic data analysis

HARP and HARP.1

Some experimental results (real data):

Best score: error% / outlier%

Average: error% / outlier%

Dataset d l* HARP.1 PROCLUS Traditional

ROCK

Soybean 35 26 0.0/0.0

0.0/0.017.3/0.0

2.1 / 0.09.2 / 0.0

No published result

Voting 16 11 6.4/13.6

2.1/55.613.8/7.9

13.1 /11.313.1 / 1.9

6.2 /14.5

Mushroom

22 15 1.4/0.0

3.2/0.0

9.0/0.0

6.0 / 0.05.2 / 0.0

0.4 / 0.0d: dimensionality of the datasetl*: average number of relevant attributes

(determined by HARP.1)

Page 13: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

13

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

A real application of projected clustering - analyzing genomic data:

High dimensionalityNoisy“Correct” partition not always availableA lot of hidden informationNew data available with a high growth rate

Page 14: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

14

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

Page 15: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

15

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

Codon usage data:Study the relationship between the frequencies of different codons and the functions of the genes. codon

geneAAA AAC AAG AAT ACA … TTT

thrL 0.05 0.05 0 0 0.05 0thrA 0.026862 0.019536 0.014652 0.026862 0.002442 0.013431thrB 0.009709 0.019417 0.022654 0.019417 0.006472 0.019417thrC 0.039813 0.025761 0.01171 0.018735 0.004684 0.030445b0005 0.051546 0 0.020619 0.020619 0 0yaaA 0.042802 0.015564 0.027237 0.035019 0 0.031128yaaJ 0.010526 0.021053 0.002105 0.012632 0.004211 0.052632talB 0.056962 0.028481 0.012658 0.006329 0.003165 0.006329mog 0.015464 0.010309 0.015464 0.005155 0.005155 0.020619yaaH 0.02139 0.037433 0.005348 0.010695 0 0.037433

Page 16: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

16

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

Transcriptome data:The full complement of activated genes, mRNAs, or transcripts in a particular tissue at a particular time.Study the expression of the genes• In different samples• Under different situations• At different time

Page 17: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

17

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

Transcriptome data:

Page 18: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

18

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

Transcriptome data:1 2 3 4 5 6 7

51

52

53

54

55

56

57

cond.gene

A

aas A 615.83aas B 640.78aat A 98.54aat B 142.77abc A 109.13abc B 137.72accA A 199.84accA B 154.78accB A 429.75accB B 443.64accC A 212.91accC B 186.16accD A 958.71accD B 917.65

1 2 3 4 5 6 7

51

52

53

54

55

56

57

1 2 3 4 5 6 7

51

52

53

54

55

56

57

1 2 3 4 5 6 7

51

52

53

54

55

56

57

1 2 3 4 5 6 7

51

52

53

54

55

56

57

2000 0

B C D E

944.54 414.56 371.06 367.641000.51 306.28 369.15 454.63121.28 120.35 170.51 87.47170.77 120.43 183.51 88.91313.34 134.02 140.78 209.37319.54 180.75 162.52 187.17306.95 154.25 155.28 200.05293.73 144.74 113.68 132.27635.05 209.85 141.94 208.93702.86 218.05 140.42 235.06328.21 125.01 136.75 263.32341.43 139.27 169.31 223.76

1788.82 849.15 696.13 82.061649.83 774.74 656.01 552.33

Page 19: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

19

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

A sample clustering result:Operon 12_1:trs5_7 0yefJ 4wbbK 4wbbJ 4wbbI 4wbbH 4glf 4rfbX 4rfbC 4rfbA 4rfbD 4rfbB 4

Operon 12_2:hyfA 0hyfB 0hyfC 0hyfD 0hyfE 6hyfF 0hyfG 0hyfH 0hyfI 0b2490 9hyfR 0focB 0

Operon 12_3:rpmJ 6prlA 0rplO 2rpmD 2rpsE 2rplR 2rplF 2rpsH 2rpsN 2rplE 2rplX 2rplN 2

Page 20: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

20

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

Goal: knowledge clustering methodSubspace v.s. non-subspaceSimilarity: density, correlation, pattern, etc.Usability: sensitivity to input parametersAlgorithm type: hierarchical, partitional, graph-based, model-based, etc.Preprocessing: logarithm, mean-centered, PCA, FCA, ICA, etc.

Page 21: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

21

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

Other subtasks:Internal validation for projected clusteringDesigning classifiers based on current resultsBuilding an integrated system for genomic data analysis…

Page 22: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

22

Projected clustering algorithms and their application on genomic data analysis

Conclusion

Research space: both theoretical and practical.Algorithms:

Projected clustering algorithms have much to improve.Many open problems in high-dimensional data analysis.

Genomic data:A lot to discover. CS people really help.

Page 23: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

23

Projected clustering algorithms and their application on genomic data analysis

References

References for projected clustering can be found in the slides of the presentations HARP: A Hierarchical Approach with Automatic Relevant Attribute Selection for Projected Clustering and The Subspace Clustering Problem.Reference web sites for the BioInformatics materials covered in this presentation:

Human Genome Project InformationHKU-Pasteur Research CentreCompanion Web Site, Concepts of Genetics (6th

Ed.)

Page 24: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

24

Projected clustering algorithms and their application on genomic data analysis

Source of FiguresProjected clusters (p.5): C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali. A Monte Carlo Algorithm for Fast Projective Clustering. In ACM SIGMOD International Conference on Management of Data, 2002.Generalized animal cell, genetic information flow (p. 14, 15): William S. Klug and Michael R. Cummings. Concepts of Genetics, Sixth Edition (p. 18, 350).DNA with features (p. 14): Human Genome Project Information, http://www.ornl.gov/hgmis/.

Page 25: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

25

Projected clustering algorithms and their application on genomic data analysis

Source of Figures

GeneChip® Arrays for Gene Expression Analysis (p. 17): Affymetrix, http://www.affymetrix.com/.An illuminated microarray (p. 17): EMBL-EMI, http://www.ebi.ac.uk/.

Page 26: Projected clustering algorithms and their application on genomic data analysis Probation Talk M.Phil. candidate: Kevin Yip Supervisor: Dr. D. Cheung 20

Thank You!