1
Set similarity: given two gene products, G 1 and G 2 , we can consider them as being represented by collections of terms: Based on the two sets, the goal is to define a natural similarity between G 1 and G 2 and , denoted as : • Two types of set similarity: • element based (Dice, Jaccard, Cosine, fuzzy measure) • Expression dimension: real measures (Euclidian measure, etc…). • Sequence dimension: sequence similarity measure (Smith-Waterman, Needleman-Wunsch, etc…) • GO and Abstract dimension: set similarity. Set Similarity Measures for Gene Matching Mihail Popescu # , James Keller + , Joyce Mitchell # # Department of Health Management and Informatics;+Department of Electrical and Computer Engineering; University of Missouri-Columbia, Columbia, MO 65211 Why Similarity Measures? • For a unified clustering approach in a 4D gene space • Gene space dimensions (4D): sequence, microarray expression, literature abstracts (articles), gene ontology (GO) • Two dimensions are numeric (sequence, expression) and two symbolic • The existent symbolic measures are not adequate: • Dice, Jaccard: do not consider the weight of the elements • Maximum and average usually overestimates the or underestimates the similarity, respectively Example: ATM (human ataxia telangiectasia mutated) and STK11 (serine/threonine kinase 11.) The geneticist assessed these two genes as quasi-similar (similarity ~0.5) because: •they both have protein serine/threonine kinase enzyme activity (they share a kinase domain) •They both cause cancers when mutated, including breast cancer. Possible similarity measures Example of Similarity Calculation for the Gene Ontology (GO) Dimension • s(ATM, STK11)=? (GO dimension) Algorithm: •1. Retrieve LocusLink GO annotations: ATM={4674: “ protein serine/threonine kinase activity”, 3677: ” DNA binding”, 4428 ” inositol/phosphatidylinositol kinase activity”, 7131 : ” meiotic recombination”, 6281 : ” DNA repair”, 7165: ” signal transduction”, 5634: ” nucleus”, 16740: ” transferase activity”, 45786: ” negative regulation of cell cycle”} STK11={5524: “ ATP binding”, 4674: ” protein serine/threonine kinase activity”, 6468: ” protein amino acid phosphorylation”, 16740: ” transferase activity”} •2. Compute GO term densities using the Resnik formula [4], the normalized version [.] or the depth in the hierarchy (.) Example of Similarity Calculation for the Retrieved Abstracts Dimension Acknowledgements This research was supported by National Library of Medicine Biomedical and Health Informatics Research Training grant 2-T15-LM07089-11. References [1] C.D. Manning, H. Schutze, Foundations of Statistical Natural Language Processing, MIT Press, 2001. [2] R. Yager, “Criteria Aggregation Functions Using Fuzzy Measures and the Choquet Integral”, Int. Jour. of Fuzzy Systems, Vol.1, No. 2, December 1999. [3] J.J. Jiang, D.W. Conrath, “Semantic Similarity Based on Corpus Statistics and Lexical Ontology”, Proc. of Int. Conf. Research on Comp. Linguistics X, 1997, Taiwan. [4] P.W. Lord, R.D. Stevens, A. Brass, C.A. Goble, “Semantic similarity measure as a tool for exploring the gene ontology”, In Pacific Symposium on Biocomputing, pages 601-612, 2003. [5] M. Sugeno, Fuzzy measures and fuzzy integrals: a survey, (M.M. Gupta, G. N. Saridis, and B.R. Gaines, editors) Fuzzy Automata and Decision Processes, pp. 89-102, North-Holland, New York, 1977. [6] S. Raychaduri, R.B. Altman, “A literature-based method for assessing the functional coherence of a gene group”, Bioinformatics, 19(3), pp. 396:401, Feb. 2003. [7]. M. Grabisch, T. Murofushi, and M. Sugeno (eds.), Fuzzy Measures and Integrals: Theory and Applications , Springer-Verlag, 2000. [8]. Hvidsten TR, Komorowski J, Sandvik AK, Laegreid A. Predicting gene function from gene expressions and ontologies. Pac Symp Biocomput. 2001;:299-310. [9]. Trupti Joshi. Cellular function prediction for hypothetical proteins using high-throughput data. MS thesis, University of Tennessee, Knoxville, 2003. [10]. Keller J, Popescu M, Mitchell J. Soft Computing Tools for Gene Similarity Measures in Bioinformatics, FLINT- CIBI 2003, Berkeley, Dec 15-18, 2003. ATM STK11 Expert assessed sim ilarity ~0.5 -sequence:AC AC ... -expression:195 -abstracts:abstract 11 ... abstract 1n -G O annotations:term 11 ... term 1m -sequence:C C AT... -expression:300 -abstracts:abstract 21 ... abstract 2n -G O annotations:term 21 ... term 2m 4D G ene space S e q u e n c e Abstracts G O annotations } ,..., ,..., { 1 1 11 1 n i T T T G } ,..., ,..., { 2 2 21 2 m j T T T G ) , ( 2 1 G G s {T 11i } Abstract1,g(A 11 ) Abstract2,g(A 12 ) {T 21i } Abstract1,g(A 21 ) {T 22i } Abstract2,g(A 22 ) G ene 1 G ene 2 s(A 11 ,A 21 ) c(A 11 ,A 21 ) {T 12i } Set similarity measures • s(ATM, STK11)=? (Abstract dimension) Algorithm: • Retrieve PubMed abstracts for ATM, STK11 • Calculate all the pair-wise distances based on the MeSH indexing • Keep the 4 best-matching pairs Find the impact factor for each journal: g(A i ), i=1…8 ATM 12917635- Oncogene (6.737) 12970738- Oncogene (6.737) 14500819- Nucleic Acids Res. (6.373) 14499692- Science (23.329) STK11 12183403 Cancer Res (8.30) 12234250 Biochem J (4.326) 12805220 - EMBO J. (12.459) 11853558- Biochem J (4.326) Calculate the confidence of the pair g(A 1, A 2 ) =g(A 1 )*g(A 2 ) and normalize using maximum value: 35 . 0 09 . 0 10 . 0 10 . 0 00 . 1 35 . 0 67 . 0 27 . 0 09 . 0 18 . 0 29 . 0 10 . 0 19 . 0 29 . 0 10 . 0 19 . 0 i g • The pair-wise similarity values calculated using FMS are: 24 . 0 16 . 0 20 . 0 00 . 0 32 . 0 26 . 0 13 . 0 00 . 0 11 . 0 1 . 0 29 . 0 07 . 0 00 . 0 00 . 0 0 . 0 44 . 0 ) ( FMS k A s • Similarity calculation: •Using weighted average: s(ATM, STK11)=0.37 •Using Choquet integral: s(ATM, STK11)=0.53 5524 4674 6468 16740 4674 1.12(0.67)[0.1] 4.93(1)[0.44] 0 3.69(0.975)[0.33] 3677 2.21(0.89)[0.2] 1.12(0.67)[0.1] 0 1.12(0.67)[0.1] 4428 1.12(0.67)[0.1] 4.3(0.986)[0.38] 0 3.69(0.975)[0.33] 7131 0 0 2.12(0.88)[0.19] 0 6281 0 0 2.12(0.88)[0.19] 0 7165 0 0 0.86(0.58)[0.08] 0 5634 0 0 0 0 16740 1.12(0.67)[0.1] 3.69(0.975)[0.33] 0 3.69(1)[0.33] 45786 0 0 1.33(0.74)[0.12] 0 •3. Compute the similarity: FM S FM S Average M aximum OW A (normalized)(depth) (normalized)(normalized) (normalized) 0.18 0.31 0.64 0.36 0.09 0.44 0.37 Jaccard Dice Conclusions • For the GO dimension, the best method of assigning densities was normalizing the information content [4] by the maximum value • The proposed fuzzy similarity measure (FMS) agrees better with our intuition of similarity: if the common elements have a high confidence, then the similarity is stronger. In addition, the non common terms have also a contribution to the similarity since the measure is computed apriori for each term set. •The Choquet similarity measure is much more general, depending only on the fuzzy measure. In addition the optimal fuzzy measure can be learned from examples.

Set similarity: given two gene products, G 1 and G 2, we can consider them as being represented by collections of terms: Based on the two sets, the goal

Embed Size (px)

Citation preview

Page 1: Set similarity: given two gene products, G 1 and G 2, we can consider them as being represented by collections of terms: Based on the two sets, the goal

• Set similarity: given two gene products, G1 and G2, we can consider them as being represented by collections of terms:

Based on the two sets, the goal is to define a natural similarity between G1 and G2 and , denoted as :

• Two types of set similarity:

• element based (Dice, Jaccard, Cosine, fuzzy measure)

• pair of elements based (Maximum, Average, OWA, Choquet)

• Expression dimension: real measures (Euclidian measure, etc…).

• Sequence dimension: sequence similarity measure (Smith-Waterman, Needleman-Wunsch, etc…)

• GO and Abstract dimension: set similarity.

Set Similarity Measures for Gene MatchingMihail Popescu#, James Keller+, Joyce Mitchell#

# Department of Health Management and Informatics;+Department of Electrical and Computer Engineering;University of Missouri-Columbia, Columbia, MO 65211

Why Similarity Measures? • For a unified clustering approach in a 4D gene space

• Gene space dimensions (4D): sequence, microarray expression, literature abstracts (articles), gene ontology (GO)

• Two dimensions are numeric (sequence, expression) and two symbolic

• The existent symbolic measures are not adequate:

• Dice, Jaccard: do not consider the weight of the elements

• Maximum and average usually overestimates the or underestimates the similarity, respectively

• Example: ATM (human ataxia telangiectasia mutated) and STK11 (serine/threonine kinase 11.) The geneticist assessed these two genes as quasi-similar (similarity ~0.5) because:

•they both have protein serine/threonine kinase enzyme activity (they share a kinase domain)

•They both cause cancers when mutated, including breast cancer.

Possible similarity measures

Example of Similarity Calculation for the Gene Ontology (GO) Dimension

• s(ATM, STK11)=? (GO dimension)• Algorithm:

•1. Retrieve LocusLink GO annotations:•ATM={4674: “ protein serine/threonine kinase activity”, 3677: ” DNA binding”, 4428 ” inositol/phosphatidylinositol kinase activity”, 7131 : ” meiotic recombination”, 6281 : ” DNA repair”, 7165: ” signal transduction”, 5634: ” nucleus”, 16740: ” transferase activity”, 45786: ” negative regulation of cell cycle”} •STK11={5524: “ ATP binding”, 4674: ” protein serine/threonine kinase activity”, 6468: ” protein amino acid phosphorylation”,

16740: ” transferase activity”} •2. Compute GO term densities using the Resnik formula [4], the normalized version [.] or the depth in the hierarchy (.)

Example of Similarity Calculation for the Retrieved Abstracts Dimension

AcknowledgementsThis research was supported by National Library of Medicine Biomedical and Health Informatics Research Training grant 2-T15-LM07089-11.

References[1] C.D. Manning, H. Schutze, Foundations of Statistical Natural Language Processing, MIT Press, 2001.

[2] R. Yager, “Criteria Aggregation Functions Using Fuzzy Measures and the Choquet Integral”, Int. Jour. of Fuzzy Systems, Vol.1, No. 2, December 1999.

[3] J.J. Jiang, D.W. Conrath, “Semantic Similarity Based on Corpus Statistics and Lexical Ontology”, Proc. of Int. Conf. Research on Comp. Linguistics X, 1997, Taiwan.

[4] P.W. Lord, R.D. Stevens, A. Brass, C.A. Goble, “Semantic similarity measure as a tool for exploring the gene ontology”, In Pacific Symposium on Biocomputing, pages 601-612, 2003.

[5] M. Sugeno, Fuzzy measures and fuzzy integrals: a survey, (M.M. Gupta, G. N. Saridis, and B.R. Gaines, editors) Fuzzy Automata and Decision Processes, pp. 89-102, North-Holland, New York, 1977.

[6] S. Raychaduri, R.B. Altman, “A literature-based method for assessing the functional coherence of a gene group”, Bioinformatics, 19(3), pp. 396:401, Feb. 2003.

[7]. M. Grabisch, T. Murofushi, and M. Sugeno (eds.), Fuzzy Measures and Integrals: Theory and Applications, Springer-Verlag, 2000.

[8]. Hvidsten TR, Komorowski J, Sandvik AK, Laegreid A. Predicting gene function from gene expressions and ontologies. Pac Symp Biocomput. 2001;:299-310.

[9]. Trupti Joshi. Cellular function prediction for hypothetical proteins using high-throughput data. MS thesis, University of Tennessee, Knoxville, 2003.

[10]. Keller J, Popescu M, Mitchell J. Soft Computing Tools for Gene Similarity Measures in Bioinformatics, FLINT-CIBI 2003, Berkeley, Dec 15-18, 2003.

ATMSTK11

Expertassessedsimilarity

~0.5

- sequence: ACAC...- expression: 195- abstracts: abstract11

... abstract1n

- GO annotations: term11...

term1m

- sequence: CCAT...- expression: 300- abstracts: abstract21

... abstract2n

- GO annotations: term21 ...

term2m

4D Gene space

Sequence

AbstractsGO annotations

},...,,...,{ 11111 ni TTTG },...,,...,{ 22212 mj TTTG

),( 21 GGs

{T11i}Abstract 1, g(A11)

Abstract 2, g(A12)

{T21i}Abstract 1,g(A21)

{T22i}Abstract 2, g(A22)

Gene 1 Gene 2

s(A11,A21)

c(A11,A21)

{T12i}

Set similarity measures

• s(ATM, STK11)=? (Abstract dimension)• Algorithm:

• Retrieve PubMed abstracts for ATM, STK11• Calculate all the pair-wise distances based on the MeSH indexing• Keep the 4 best-matching pairs• Find the impact factor for each journal: g(Ai), i=1…8

ATM 12917635- Oncogene (6.737)

12970738-Oncogene (6.737)

14500819-Nucleic Acids Res. (6.373)

14499692-Science (23.329)

STK11 12183403 – Cancer Res (8.30)

12234250 – Biochem J (4.326)

12805220 - EMBO J. (12.459)

11853558- Biochem J (4.326)

• Calculate the confidence of the pair g(A1, A2) =g(A1)*g(A2) and normalize using maximum value:

35.0

09.0

10.0

10.0

00.135.067.0

27.009.018.0

29.010.019.0

29.010.019.0

ig

• The pair-wise similarity values calculated using FMS are:

24.016.020.000.0

32.026.013.000.0

11.01.029.007.0

00.000.00.044.0

)( FMSkAs

• Similarity calculation:•Using weighted average: s(ATM, STK11)=0.37•Using Choquet integral: s(ATM, STK11)=0.53

5524 4674 6468 16740

4674 1.12(0.67)[0.1] 4.93(1)[0.44] 0 3.69(0.975)[0.33]

3677 2.21(0.89)[0.2] 1.12(0.67)[0.1] 0 1.12(0.67)[0.1]

4428 1.12(0.67)[0.1] 4.3(0.986)[0.38] 0 3.69(0.975)[0.33]

7131 0 0 2.12(0.88)[0.19] 0

6281 0 0 2.12(0.88)[0.19] 0

7165 0 0 0.86(0.58)[0.08] 0

5634 0 0 0 0

16740 1.12(0.67)[0.1] 3.69(0.975)[0.33] 0 3.69(1)[0.33]

45786 0 0 1.33(0.74)[0.12] 0

•3. Compute the similarity:

FMS FMS Average Maximum OWA (normalized) (depth) (normalized) (normalized) (normalized)

0.18 0.31 0.64 0.36 0.09 0.44 0.37

Jaccard Dice

Conclusions• For the GO dimension, the best method of assigning densities was normalizing the information content [4] by the maximum value • The proposed fuzzy similarity measure (FMS) agrees better with our intuition of similarity: if the common elements have a high confidence, then the similarity is stronger. In addition, the non common terms have also a contribution to the similarity since the measure is computed apriori for each term set.

•The Choquet similarity measure is much more general, depending only on the fuzzy measure. In addition the optimal fuzzy measure can be learned from examples.