Upload
amanda-conway
View
20
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Comparative genomics, genome context and genome annotation. Nothing in ( computational ) biology makes sense except in the light of evolution. after Theodosius Dobzhansky (1970). Genome context analysis and genome annotation. Using information other than homologous relationships - PowerPoint PPT Presentation
Citation preview
Nothing in (computational) biology makessense except in the light of evolution
after Theodosius Dobzhansky (1970)
Comparative genomics, genome context and genome annotation
Genome context analysis and genome annotation
Using information other than homologous relationshipsbetween individual gene/proteins for functional prediction(guilt by association)
•phyletic patterns•domain fusion (“Rosetta Stone” proteins)•gene order conservation•co-expression•….
Types of context analysis:
Goals: • Using gene sets from complete genomes, delineate families of orthologs and paralogs - Clusters of Orthologous Groups (of genes) (COGsCOGs) • Using COGs, develop an engine for functional annotation of new genomes
• Apply COGs for analysis of phylogenetic patterns
COG:
- group of homologous proteins such that all proteins from different species are orthologs (all proteins from the same species in a COG are paralogs)
Complete set of proteins from the analyzed genomes
FULL SELF-COMPARISON (BLASTPGP, no cut-off)
Collapse obvious paralogs
Merge triangles with common edges
CONSTRUCTION OF COGs FOR 8 COMPLETE GENOMES
Detect all interspecies Best Hits (BeTs) between individual proteins or groups of paralogs
1
2
3
Detect all triangles of consistent BeTs
4
5
Detect groups with multidomain proteins
and isolate domains
REPEAT STEPS 3-5
6
COGs
A TRIANGLE OF BeTs IS A MINIMAL, ELEMENTARYCOG
A RELATIVELY SIMPLE COG PRODUCED BY MERGING ADJACENT TRIANGLES
A COMPLEX COG WITH MULTIPLE PARALOGS
Current status of the COGs
11 Archaea + 1 unicellular eukaryote + 46 bacteria = 58 complete genomes
149,321 proteins 105,861 proteins in 4075 COGs(71%)
4 animals + 1 plant + 2 fungi + 1 microsporidium = 8 complete genomes
142,498 proteins 74,093 proteins in 4822 COGs (52%)
Prokaryotes
Eukaryotes
COGnitor...
…IN ACTION
The Universal COGs
Search for genomic determinants of hyperthermophily
Search for uniquearchaeo-eukaryoticgenes
A complementary pattern:search for unique bacterial genes
Essential function…but holes in the phyleticpattern
Strict complementary pattern
Relaxed complementary pattern
Relaxed complementary pattern with extra restrictions
Conservation of gene order in bacterial species of the same
genus1
101
201
301
401
501
601
1 101 201 301 401
M. genitaliumvs
M. pneumoniae
Conservation of gene order in closely related bacterial genera
C. trachomatisvs
C. pneumoniae
1
101
201
301
401
501
601
701
801
901
1001
1 101 201 301 401 501 601 701 801
Lack of gene order conservation - even in “closely related” bacteria of the same
Proteobacterial subdivision
P. aeruginosavs
E. coli
1101201301401501601701801901
1001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501
1 101
201
301
401
501
601
701
801
901
1001
1101
1201
1301
1401
1501
1601
1701
1801
1901
2001
2101
2201
2301
2401
2501
2601
2701
2801
2901
3001
3101
3201
3301
3401
3501
3601
3701
3801
3901
4001
4101
4201
ecoli
paer
<0.3
0.3-0.8
0.8-1.3
>1.3
Genome Alignments - MethodProtein sets from completely genomes
BLAST cross-comparison
Pairwise Genome AlignmentLocal alignment algorithmLamarck (gap opening penalty,gap extension penalty); statisticswith Monte Carlo simulations
Table of Hits
Template-Anchored Genome Alignment
Genome Alignments - Statistics
0.0
0.1
0.2
0.3
0.4
0.52 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>20
cpneu-ctra
mjan-mthe
bsub-ecoli
drad-aero
Distribution of conserved gene string lengths
Genome Alignments - StatisticsPairwise No. No. % in % inalignments: strings genes Gen1 Gen2
all homologsecoli-hinf 138 566 13% 33%ecoli-bsub 89 322 8% 8%ecoli-mjan 10 30 1% 2%
probable orthologsecoli-hinf 105 482 11% 28%ecoli-bsub 34 168 4% 4%ecoli-mjan 12 33 1% 2%
Genome Alignments - Statistics
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
aero af
ul
mjan
mth
epy
ro
aqua
ebb
urbs
ub cac
cjej
cpne
uct
radr
adec
oli hinf
hpyl
mge
n
mpn
eum
tub
nmen
rpxx
syne
cho
tmar tp
aluu
re
Not in gene strings
In non-conserved gene strings (directons)
In conserved gene strings
Breakdown of genesin the genome
Genome Alignments - StatisticsFraction of the genome in conserved gene strings - from
template-anchored alignments
Minimum Synechocystis sp. 5%
Aquifex aeolicus 10%Archaeoglobus fulgidus 13%Escherichia coli 14%Treponema pallidum 17%
Maximum Thermotoga maritima 23%Mycoplasma genitalium 24%
Context-Based Prediction of Protein Functions
A Novel Translation Factor (COG0536)
L21 L27 GTPase?GTP-bindingtranslation
factor
Context-Based Prediction of Protein Functions
A Novel Translation Factor (COG0012)
TGS domaincontainingGTPase?
Peptidyl-tRNAhydrolase
GTP-bindingtranslation
factor