Projected clustering algorithms and their application on genomic data analysis Probation Talk...

Preview:

Citation preview

Projected clustering algorithms and their application on genomic data analysis

Probation TalkM.Phil. candidate: Kevin YipSupervisor: Dr. D. Cheung

20th Dec 2002

2

Projected clustering algorithms and their application on genomic data analysis

Presentation Outline

The research problem: clustering high-dimensional datasets.Recent work in the community: projected (subspace) clustering.My previous work: HARP and HARP.1.My current and future work: studying genomic data, comparing different algorithms, designing similarity measures, etc.

3

Projected clustering algorithms and their application on genomic data analysis

Curse of Dimensionality

In a dataset of high dimensionality, some attributes may not help identify the characteristics of data points.If the number of such attributes is high, a point can be as close to its nearest neighbor as its farthest neighbor.

4

Projected clustering algorithms and their application on genomic data analysis

Curse of Dimensionality

Resolution: feature selection

A1 A2

R1 3 8

R2 4 7

R3 9 1

D(R1, R2) 1.4

D(R2, R3) 7.8

Difference 6.4

A3

5

4

1

1.7

8.4

6.6

A4

7

7

1

1.7

10.3

8.6

A5

4

1

5

3.5

11.0

7.6

A6

8

1

3

7.8

11.2

3.4

A7

6

9

4

8.4

12.3

3.9

A8

10

1

6

12.3

13.3

1.0

A9

0

6

2

13.7

13.9

0.2

A10

1

7

9

14.9

14.0

-0.9

5

Projected clustering algorithms and their application on genomic data analysis

Projected Clustering

What feature selection cannot do: find different relevant attributes for different clusters.

Projection on x1 and x2

x1

x2

C1

C2

C3

6

Projected clustering algorithms and their application on genomic data analysis

Projected Clustering

Clustering: identify similar objects and form clusters.Projected clustering: identify similar objects and the relevant attributes, and form clusters.

7

Projected clustering algorithms and their application on genomic data analysis

Projected Clustering

Existing approaches:Grid-based dimension selectionAssociation rule hypergraph partitioningContext-specific Bayesian clusteringMonte Carlo algorithmProjective clustering

All produce projected clusters successfully in different ways.

8

Projected clustering algorithms and their application on genomic data analysis

Projected Clustering

Some problems with the approaches:Accuracy depends on hard-to-determine input parameters.Unable to determine the dimensionality of each cluster automatically.Can only use density as similarity measure.

In addition, the algorithms have tested on few real datasets.

9

Projected clustering algorithms and their application on genomic data analysis

HARP and HARP.1

HARP (A Hierarchical Algorithm with Automatic Relevant Attribute Selection for Projected Clustering):

A hierarchical clustering framework with no pre-assumed similarity measure.Requires no hard-to-determine user inputs.Determines the relevant attributes of each cluster automatically.

10

Projected clustering algorithms and their application on genomic data analysis

HARP and HARP.1

HARP.1 - an implementation of HARP using attribute value density to define the similarity measure:

Makes use of global statistics in attribute selection.Handles mutual disagreement and information loss of the potentially merging clusters.Accepts mixed types of attributes.

11

Projected clustering algorithms and their application on genomic data analysis

HARP and HARP.1

Some experimental results (synthetic data):

Best score: error% / outlier%

Average: error% / outlier%

Dataset d l HARP.1 PROCLUS Traditional

ORCLUS

SynCat1 20 12 0.0/5.0

3.6/ 1.46.7/ 3.7

2.6 /26.4

5.8 / 5.3

N/A

SynMix1 20 12 0.4/6.8

2.2/17.06.8/10.1

11.6 /11.2

7.9 / 4.6

N/A

SynNum1

20 12 0.8/5.0

1.8/21.47.2/ 8.3

4.4 /32.0

5.9 / 9.2

0.4 /23.82.31 /

8.15d: dimensionality of the datasetl: average number of relevant attributes

12

Projected clustering algorithms and their application on genomic data analysis

HARP and HARP.1

Some experimental results (real data):

Best score: error% / outlier%

Average: error% / outlier%

Dataset d l* HARP.1 PROCLUS Traditional

ROCK

Soybean 35 26 0.0/0.0

0.0/0.017.3/0.0

2.1 / 0.09.2 / 0.0

No published result

Voting 16 11 6.4/13.6

2.1/55.613.8/7.9

13.1 /11.313.1 / 1.9

6.2 /14.5

Mushroom

22 15 1.4/0.0

3.2/0.0

9.0/0.0

6.0 / 0.05.2 / 0.0

0.4 / 0.0d: dimensionality of the datasetl*: average number of relevant attributes

(determined by HARP.1)

13

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

A real application of projected clustering - analyzing genomic data:

High dimensionalityNoisy“Correct” partition not always availableA lot of hidden informationNew data available with a high growth rate

14

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

15

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

Codon usage data:Study the relationship between the frequencies of different codons and the functions of the genes. codon

geneAAA AAC AAG AAT ACA … TTT

thrL 0.05 0.05 0 0 0.05 0thrA 0.026862 0.019536 0.014652 0.026862 0.002442 0.013431thrB 0.009709 0.019417 0.022654 0.019417 0.006472 0.019417thrC 0.039813 0.025761 0.01171 0.018735 0.004684 0.030445b0005 0.051546 0 0.020619 0.020619 0 0yaaA 0.042802 0.015564 0.027237 0.035019 0 0.031128yaaJ 0.010526 0.021053 0.002105 0.012632 0.004211 0.052632talB 0.056962 0.028481 0.012658 0.006329 0.003165 0.006329mog 0.015464 0.010309 0.015464 0.005155 0.005155 0.020619yaaH 0.02139 0.037433 0.005348 0.010695 0 0.037433

16

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

Transcriptome data:The full complement of activated genes, mRNAs, or transcripts in a particular tissue at a particular time.Study the expression of the genes• In different samples• Under different situations• At different time

17

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

Transcriptome data:

18

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

Transcriptome data:1 2 3 4 5 6 7

51

52

53

54

55

56

57

cond.gene

A

aas A 615.83aas B 640.78aat A 98.54aat B 142.77abc A 109.13abc B 137.72accA A 199.84accA B 154.78accB A 429.75accB B 443.64accC A 212.91accC B 186.16accD A 958.71accD B 917.65

1 2 3 4 5 6 7

51

52

53

54

55

56

57

1 2 3 4 5 6 7

51

52

53

54

55

56

57

1 2 3 4 5 6 7

51

52

53

54

55

56

57

1 2 3 4 5 6 7

51

52

53

54

55

56

57

2000 0

B C D E

944.54 414.56 371.06 367.641000.51 306.28 369.15 454.63121.28 120.35 170.51 87.47170.77 120.43 183.51 88.91313.34 134.02 140.78 209.37319.54 180.75 162.52 187.17306.95 154.25 155.28 200.05293.73 144.74 113.68 132.27635.05 209.85 141.94 208.93702.86 218.05 140.42 235.06328.21 125.01 136.75 263.32341.43 139.27 169.31 223.76

1788.82 849.15 696.13 82.061649.83 774.74 656.01 552.33

19

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

A sample clustering result:Operon 12_1:trs5_7 0yefJ 4wbbK 4wbbJ 4wbbI 4wbbH 4glf 4rfbX 4rfbC 4rfbA 4rfbD 4rfbB 4

Operon 12_2:hyfA 0hyfB 0hyfC 0hyfD 0hyfE 6hyfF 0hyfG 0hyfH 0hyfI 0b2490 9hyfR 0focB 0

Operon 12_3:rpmJ 6prlA 0rplO 2rpmD 2rpsE 2rplR 2rplF 2rpsH 2rpsN 2rplE 2rplX 2rplN 2

20

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

Goal: knowledge clustering methodSubspace v.s. non-subspaceSimilarity: density, correlation, pattern, etc.Usability: sensitivity to input parametersAlgorithm type: hierarchical, partitional, graph-based, model-based, etc.Preprocessing: logarithm, mean-centered, PCA, FCA, ICA, etc.

21

Projected clustering algorithms and their application on genomic data analysis

Current and Future Work

Other subtasks:Internal validation for projected clusteringDesigning classifiers based on current resultsBuilding an integrated system for genomic data analysis…

22

Projected clustering algorithms and their application on genomic data analysis

Conclusion

Research space: both theoretical and practical.Algorithms:

Projected clustering algorithms have much to improve.Many open problems in high-dimensional data analysis.

Genomic data:A lot to discover. CS people really help.

23

Projected clustering algorithms and their application on genomic data analysis

References

References for projected clustering can be found in the slides of the presentations HARP: A Hierarchical Approach with Automatic Relevant Attribute Selection for Projected Clustering and The Subspace Clustering Problem.Reference web sites for the BioInformatics materials covered in this presentation:

Human Genome Project InformationHKU-Pasteur Research CentreCompanion Web Site, Concepts of Genetics (6th

Ed.)

24

Projected clustering algorithms and their application on genomic data analysis

Source of FiguresProjected clusters (p.5): C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali. A Monte Carlo Algorithm for Fast Projective Clustering. In ACM SIGMOD International Conference on Management of Data, 2002.Generalized animal cell, genetic information flow (p. 14, 15): William S. Klug and Michael R. Cummings. Concepts of Genetics, Sixth Edition (p. 18, 350).DNA with features (p. 14): Human Genome Project Information, http://www.ornl.gov/hgmis/.

25

Projected clustering algorithms and their application on genomic data analysis

Source of Figures

GeneChip® Arrays for Gene Expression Analysis (p. 17): Affymetrix, http://www.affymetrix.com/.An illuminated microarray (p. 17): EMBL-EMI, http://www.ebi.ac.uk/.

Thank You!