Similarity Search in Large Datasets using Gene Ontology
COMPUTATIONAL INFORMATICS
Heiko Müller, David Rozado, Mat Cook, Ashfaqur Rahman
Gene01: ACGGTAGGCTAGACTAGATATTAACG
Gene02: CCTGAGTACCTGGACTAGATAC
Gene03: GATGCGGTTACGTACGATCCATGGA
Gene04: CATTTATTATATATACGCGCGCGA
Gene05: TTTCGATAGGGGATATATTAACGCCG
Gene06: GTAGGTAGGTGGAGGCCCGCAGACGC
Gene07: GATAGACTCGCGCCGATATATAG
Gene08: ATATATTTCCTAGATCGAGAGATAC
Gene09: GATAGGTTAATTAATTTCCTATAT
Gene10: TGGATTGGATAGCGCGATAGATC
Gene11: AAAAGTCGATAAGGCTAGAGCTAG
Gene12: GGATATAGATATATCTAGATATC
Gene13: CGATATAGCCCTCTAGAGATACTTT
Gene14: GATACCCGCGATATATCAT
Gene15: TAGATCCCCGAGATAGAGACT
Gene16: CACCATAGAAGACTGATCGAGATAG
Gene01: GGCTAGACTAGATATTAACGACGGTA
Gene02: AGTACCTGGACTAGCCTGTAC
Gene03: GATGCGGTTACGCCATTACGAT
Gene04: GATATATATATATACGCGCGCGA
Gene05: CATTTATGGGATATATTAACGCCG
Gene06: GTAGGTAGGTGGAGGCCCGCAGACGC
Gene07: GATAGACTCGCGCCGATATATAG
Gene08: TCCTAGATCAGATCGAGAGATAC
Gene09: GATAGGTTAATTAATTTCCTATAT
Gene10: GCGATCCTATGGATAGCAGATC
Gene11: AAAAGTCGATAAGGCTAGAGCTAG
Gene12: GGATATAGATATATCTAGATATC
Gene13: CGATATAGCCAGAAGTCGAACTTT
Gene14: GATACCCGCGCTCTATATATCAT
Gene15: TAGATCCCCGAGATAGAGACT
Gene16: CACCATAGAAGACTGATCGAGATAG
N. perurans N. pemaquidensis
Compare sets of genes and gene products to discover:
1. Similarities between them. 2. The most dissimilar genes in each dataset.
3 |
Semantic Similarity Search
Algorithms for Comparing Large Datasets
Results
Semantic Similarity Search in Large Datasets | Heiko Müller
Gene Ontology (GO)
Semantic Similarity Search in Large Datasets | Heiko Müller 4 |
Example from Molecular Function ontology
GO Annotations
Semantic Similarity Search in Large Datasets | Heiko Müller 5 |
GOA(g1) = {GO:0055100, GO:0070122}
“[...] the pathway from a child term all the way up to its top-level parent(s) must always be true“.
True Path Rule
Semantic Similarity
Semantic Similarity Search in Large Datasets | Heiko Müller 6 |
GOA(g1) = {GO:0055100, GO:0070122}
GOA(g2) = {GO:0030332, GO:0070012}
• Annotations provide an objective representation to compare genes on functional aspects.
• Semantic similarity measure quantifies relationships between (sets of) GO terms.
sim(g1, g2) = ?
Term Specificity
less similar
more similar
))(log()( tPtic
Corpus-based
Structure-based
)_log(
)1)(log(1)()(
termstotal
tdesctdepthtic
Quantify semantics or information content (ic) of GO terms.
Group-wise Semantic Similarity
Semantic Similarity Search in Large Datasets | Heiko Müller 8 |
GOA(g1) = {GO:0055100, GO:0070122}
GOA(g2) = {GO:0030332, GO:0070012}
IC(g1) = 10.6609
IC(g2) = 9.7925
IC(g1 g2) = 2.7925
sim(g1, g2) = 0.2736
)(
)(
)(
)(
2
1),(
2
21
1
2121
gIC
ggIC
gIC
ggICggsim
Group-wise Similarity
X. Chen et al., Gene, 509 (2012)
10 |
Semantic Similarity Search
Algorithms for Comparing Large Datasets
Results
Semantic Similarity Search in Large Datasets | Heiko Müller
Gene Identifier Sets
1 = g11: GO:0003824, GO:0005488 2 = g12: GO:0016787, GO:0042562 3 = g13: GO:0008233, GO:0031406 4 = g14: GO:0005515, GO:0016787 5 = g15: GO:0055100, GO:0070122
D1
1 = g21: GO:0003824, GO:0005488 2 = g22: GO:0016829, GO:0042562 3 = g23: GO:0043168, GO:0008233 4 = g24: GO:0055100, GO:0070012 5 = g25: GO:0004325, GO:0043177
D2
5 4
1-5 1-5
2-5 3-4
5
3,5
4
3-4
1-5 1-5
Exhaustive Search
Term IC GIDS-D1 GIDS-D2
GO:0070012 4.0000 4
GO:0070122 3.5212 5
GO:0055100 3.0000 5 4
GO:0004325 2.7734 5
GO:0031406 2.2228 3
GO:0043177 1.6616 3 5
GO:0008233 1.6305 3,5 3-4
GO:0043168 1.3777 3 3
GO:0042562 1.3472 2,5 2,4
GO:0036094 0.8873 3 5
GO:0043167 0.8624 3 3
GO:0016829 0.6347 2,5
GO:0005515 0.5123 4-5 4
GO:0016787 0.4144 2-5 3-4
GO:0005488 0.1898 1-5 1-5
GO:0003824 0.0455 1-5 1-5
1 2 3 4 5
1
2
3
4
5
IC-D1
IC-D2
IC-D12
4
3.52 3
7
6.52
Similarity-based Ranking
Semantic Similarity Search in Large Datasets | Heiko Müller 13 |
sim(g1,g2) = 1
sim(g3,g4) = 0.82
simrank(g1,g2)
simrank(g1,g2) = 0.2353
simrank(g3,g4) = 14.0304
),()( 2121 ggsimggIC
Top-k Search
Term IC GIDS-D1 GIDS-D2
GO:0070012 4.0000 4
GO:0070122 3.5212 5
GO:0055100 3.0000 5 4
GO:0004325 2.7734 5
GO:0031406 2.2228 3
GO:0043177 1.6616 3 5
GO:0008233 1.6305 3,5 3-4
GO:0043168 1.3777 3 3
GO:0042562 1.3472 2,5 2,4
GO:0036094 0.8873 3 5
GO:0043167 0.8624 3 3
GO:0016829 0.6347 2,5
GO:0005515 0.5123 4-5 4
GO:0016787 0.4144 2-5 3-4
GO:0005488 0.1898 1-5 1-5
GO:0003824 0.0455 1-5 1-5
1
2
3
4
5
Top-5
5,4 4.68
5,3 0.82
5,2 0.68
5,1 0.12
5,5 0.01
Step 1
5,4 4.68
3,3 3.36
3,5 1.04
5,3 0.82
5,2 0.68
Step 2
5,4 4.68
3,3 3.36
2,2 1.19
2,4 1.18
3,5 1.04
Step 3
IC-D2
1 2 3 4 5
IC-D1 0.24 2 9.29 1.16 10.7
0.24 2.22 4.52 11.1 6.19
1 2 3 4 5
15 |
Semantic Similarity Search
Algorithms for Comparing Large Datasets
Results
Semantic Similarity Search in Large Datasets | Heiko Müller
Results
Runtime – MF (438.406 entries with GO annotations)
UniProt – Swiss-Prot (Rel. 2014_02)
Baseline Exhaustive Top 10,000 Top 1,000 Top 100
> 2 days ~ 45 min. 2.5 - 4.5 min. 1 – 3.5 min. 15 sec. – 2.5 min.
Semantic Similarity Search in Large Datasets | Heiko Müller 16 |
Results (cont.)
• Compare Top 10,000 matches against results from ‘BLASTing’ Swiss-Prot against itself (e=10-4).
Semantic Similarity Search in Large Datasets | Heiko Müller 17 |
How does it compare to sequence similarity search?
Number of similar pairs in Top 10,000
that are not included in BLAST
results
0
1000
2000
3000
4000
5000
6000
7000
8000
MF-ALL MF-CUR
CORPUS
STRUCTURE