Large scale proteome comparisons Genome trees Fredj Tekaia Institut Pasteur [email protected]

Large scale proteome comparisons

Genome trees

Fredj TekaiaInstitut Pasteur

[email protected]

mailto:[email protected]

Complete genomes

• 1387 projects

261 published (01-03-05)

• 654 prokaryotes

• 472 eukaryotes

207 21

33

http://www.genomesonline.org/

Tree of life

http://www.genomesonline.or/

GOLD

Cumulated number of available completely sequenced genomes

95 03-0596 97 98 99 00 01 02 03 04

2 512

19 24

42

71

116

165

224

261

0

30

60

90

120

150

180

210

240

270

300

1 2 3 4 5 6 7 8 9 10 11

List and references

Completely sequenced Genomes that span the three domains of life are growing at a rapid rate

http://www.genomesonline.org/

http://www.genomesonline.org/CompleteGenomesList.html

Genome sequencing projects

There are several web-based resources that document the progress of completely sequenced genomes and their reference publication, including:

GOLD Genomes Online Databasehttp://wit.integratedgenomics.com/GOLD/

GNN Genome News Networkhttp://www.genomenewsnetwork.org/index.php

http://wit.integratedgenomics.com/GOLD/



Resources for genomesThere are two main resources for genomes:

EBI European Bioinformatics Institutehttp://www.ebi.ac.uk/genomes/

NCBI National Center for Biotechnology Informationhttp://www.ncbi.nlm.nih.gov

But many others resources from sequencing Institutions:

Sanger The welcome Trust Sanger Instituthttp://www.sanger.ac.uk/

TIGR The Institute for Genomic Researchhttp://www.tigr.org

Genolevures http://cbi.labri.fr/Genolevures/index.php

http://www.sanger.ac.uk/

http://cbi.labri.fr/Genolevures/index.php

http://cbi.labri.fr/Genolevures/index.php

DefinitionsGenomeThe genome of a cell is formed by the collection of the DNA it comprises.The genome size is the total of its DNA bases.

Gene

Is a particular DNA sequence situated in a specific position on a chromosome and that codes for a specific function.

Protein

Is a sequence composed of amino-acids ordered according to the DNA sequences of the gene it codes for.

Proteome

Is the set of proteins in an organism.

Genomics

Is the exhaustive study of genomes: genetic material, genes; their functions, their organization....

Chronology of completely sequenced genomes

• 1977: first viral genome (5386 base pairs; encoding 11 genes). Sanger et al. sequence bacteriophage X174.

• 1981: Human mitochondrial genome. 16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA)

• 1986: Chloroplast genome. 156,000 base pairs (most are 120 kb to 200 kb)

1995: first genome of a free-living organism, the bacterium Haemophilus influenzae, by TIGR, 1830 Kb, 1713 genes.

1996: first genome of an archaeal genome: Methanococcus jannaschii DSM 2661, by TIGR, 1664 Kb, 1773 genes.

1997: first eukaryotic genome : Saccharomyces cerevisiae S288C; International collaboration; 16 Chromosomes; 12,057 Kb, ~6000 genes.

1998: first multicellular organism Nematode Caenorhabditis elegans; 97 Mb; ~19,000 genes.

1999: first human chromosome: Chromosome 22 (49 Mb, 673 genes))

• 2000: Fruitfly Drosophila melanogaster (137 Mb; ~13,000 genes)

•2000 first plant genome: Arabidopsis thaliana (115,428 Mb; 22670 genes

• 2001: draft sequence of the human genome (x Mb; ~28000 genes)

• 2002: plasmodium falciparum (22,9 Mb; 5334 genes)

• 2002: mouse genome (x Mb; ~28000 genes)

• 2004: Fish draft Tetraodon nigroviridis genome (x Mb; ~28000 genes);

How big are genome sizes?

Viral genomes: 1 kb to 350 kb (Mimivirus: 1.2 Mb)

Bacterial genomes: 0.5 Mb to 13 Mb;

Eukaryotic genomes: 8 Mb to 670 Gb;

DOGS: http://www.cbs.dtu.dk/databases/DOGS/abbr_table.bysize.txt

http://www.cbs.dtu.dk/databases/DOGS/abbr_table.bysize.txt

Comparative genomics

Analyses of the genetic material of different species help understanding the similarity and differences between genomes, their evolution and the evolution of their genes.

•Intra-genomic comparisons help understanding the degree of duplication (genome regions; genes) and genes organization,...

•Inter-genomic comparisons help understanding the degree of similarity between genomes; degree of conservation between genes;

•understanding gene and genome evolution

Evolution

Ancestor

species genome

Evolutionary processes include:

Phylogeny*duplication genesis

Expansion*

HGT HGT

Exchange* loss Deletion*

and selection

Gene duplications are traditionally considered to be a major evolutionary source of protein new functions

Understanding how duplications happened and how important is this evolutionary process is a key goal of genome analysis

> Some examples

Kellis et al. Nature, 2004

S. cerevisiae genomeColours reveal Duplications


SpeciationDuplication

Deletion

Actual content of the 2 copies

Reconstruction of the ancestral organization


Nature Reviews Genetics 3; 827-837 (2002);SPLITTING PAIRS: THE DIVERGING FATES OF DUPLICATED GENES

Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206.

Original version

Actual version

Genome duplication.

a, Distribution of Ks values of duplicated genes in Tetraodon (left) and Takifugu (right) genomes. Duplicated genes broadly belong to two categories, depending on their Ks value being below or higher than 0.35 substitutions per site since the divergence between the two puffer fish (arrows).

b, Global distribution of ancient duplicated genes (Ks > 0.35) in the Tetraodon genome. The 21 Tetraodon chromosomes are represented in a circle in numerical order and each line joins duplicated genes at their respective position on a given pair of chromosomes. Jaillon et al. Nature 431, 946-857. 2004.

Jaillon et al. Nature 431, 946-857. 2004.

Inter-genomic comparisons

• Compositional comparisons between species (nuc and aa compositions);

• Gene, protein conservation between species (rate of conservation);• Orthologs; families of orthologs;

• Gene Dictionary;• Gene conservation profiles;• Genome tree construction;• Genome multiple alignments.

• Specific and non-specific genes;

• Genes exclusively conserved in one or in a subset of species (or in domains);

Methodology

•

•

•

••

•

•

•

••

Matrice T kij > 0

Correspondence Analysis

Classification

1 i p1

j

n

kij

sup

F1

Fp

•

•

•

•

•

• •

•

•

•

•

•

•

•

•

•

••

•

••

• orthogonal system;

• use of euclidean distance;

Amino Acid composition

org Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyrsc 5.5 4.4 6.1 5.8 1.3 3.9 6.5 5.0 2.1 6.6 9.6 7.3 2.1 4.6 4.3 9.0 5.8 1.1 3.3ce 6.2 5.2 4.9 5.2 2.1 4.1 6.4 5.3 2.3 6.2 8.7 6.5 2.6 5.0 4.9 8.0 5.8 1.1 3.2dm 7.5 5.6 4.7 5.2 1.9 5.2 6.4 6.2 2.7 4.9 9.2 5.6 2.4 3.6 5.5 8.3 5.6 1.0 3.0ca 5.0 3.7 6.7 5.9 1.1 4.5 6.4 5.1 2.1 7.1 9.2 7.3 1.8 4.4 4.5 9.0 6.2 1.0 3.5sp 6.3 4.9 5.2 5.4 1.5 3.8 6.6 5.0 2.3 6.1 9.9 6.5 2.1 4.6 4.8 9.4 5.4 1.1 3.4ath 6.2 5.5 4.4 5.4 1.9 3.5 6.7 6.3 2.3 5.4 9.5 6.4 2.4 4.3 4.7 9.0 5.1 1.3 2.9hs 7.0 5.6 3.7 4.9 2.2 4.7 7.0 6.6 2.5 4.4 9.8 5.7 2.2 3.7 6.1 8.0 5.3 1.2 2.8

mj 5.5 3.9 5.3 5.5 1.3 1.5 8.7 6.3 1.4 10.4 9.5 10.4 2.2 4.2 3.4 4.5 4.0 0.7 4.4mth 7.3 6.8 3.3 5.9 1.2 1.9 8.1 8.0 1.9 7.7 9.5 4.6 2.9 3.6 4.3 6.1 5.0 0.8 3.2af 7.8 5.8 3.2 4.9 1.2 1.8 8.9 7.2 1.5 7.2 9.5 6.9 2.6 4.6 3.9 5.5 4.2 1.0 3.6ph 6.4 5.5 3.5 4.3 0.6 1.6 8.3 7.0 1.5 8.8 10.3 7.7 2.4 4.6 4.5 5.9 4.5 1.2 3.8pa 6.7 5.7 3.3 4.6 0.6 1.7 8.8 7.3 1.5 8.5 10.2 7.8 2.4 4.4 4.3 5.0 4.2 1.2 3.8ape 9.7 7.8 2.0 4.2 0.8 1.8 7.3 8.8 1.6 5.5 11.0 3.9 2.2 2.9 5.5 6.7 4.3 1.3 3.5ssp2 5.6 4.7 5.0 4.7 0.6 2.1 6.8 6.4 1.3 9.4 10.3 7.7 2.2 4.4 3.8 6.7 4.7 1.1 4.8pfu 6.6 5.3 3.5 4.4 0.6 1.8 8.9 7.1 1.5 8.7 10.1 8.1 2.2 4.4 4.3 4.9 4.4 1.2 4.0sto 5.6 4.2 4.9 4.6 0.7 2.1 7.0 6.3 1.3 9.9 10.3 8.0 2.1 4.5 3.9 6.7 4.8 1.0 4.9pyae 9.9 6.5 2.6 4.3 0.9 2.1 7.0 7.7 1.5 6.3 10.5 5.7 1.9 3.6 5.0 4.9 4.4 1.5 4.3ta 7.0 5.5 4.3 5.7 0.6 2.2 6.0 7.3 1.6 9.0 8.4 5.6 3.2 4.7 4.0 7.6 4.8 0.9 4.6tv 6.4 4.7 4.8 5.5 0.6 2.1 6.4 7.0 1.5 9.2 8.8 6.9 2.7 4.7 3.8 7.5 4.8 0.8 4.8h 13.1 6.5 2.1 9.0 0.7 2.6 6.7 8.5 2.2 3.6 8.3 1.6 1.7 3.1 4.7 5.2 6.8 1.1 2.5

Tekaia, F., Yeramian, E. and Dujon B. (2002) Gene. 297 pp. 51-60.

GC%

Growth t°

org Glu Gln Lys+Argmj 8.7 1.5 14.3mth 8.1 1.9 11.4af 8.9 1.8 12.7ph 8.3 1.6 13.2pa 8.8 1.7 13.5apem 7.3 1.8 11.7ssp2 6.8 2.1 12.4pfu 8.9 1.8 13.4sto 7.0 2.1 12.2pyae 7.0 2.1 12.2ta 6.0 2.2 11.1tv 6.4 2.1 11.6ae 9.6 2.0 14.3tm 8.9 2.0 13.1

•Glu

•Gln

•Arg•Lys

r=0.83

p<1.e-4

QuickTime™ et undécompresseur TIFF (non compressé)sont requis pour visionner cette image.

2005

Growth t°

GC%

PE, PPE families

Dom n org mean std n prot min MaxE 38 443.1 403.6 364538 10 9638A 19 279.9 199.6 42499 10 4436B 53 311.2 233.2 155538 11 7463

Protein size statistics

Proteome comparisons:

Methodology

P1

proteome1

Pn

proteomen

• bestnpp1

• allnpp1

• segmatchnpp1

• bestnppn

• allnppn

• segmatchnppn

• bestp1np

• allp1np

• segmatchp1np

• bestpnnp

• allpnnp

• segmatchpnnp

Species specific comparisons

NP

new proteome

blastp, pam250, SEG filter

bestnppi

np1 size pij e-value1 HS/IS/NS

allnppi

np1 size pij e-value1 HS/IS/NS

np1 size pik e-value HS/IS/NS

The expected number of HSPs with score at least S is given by: E = Kmne-S.

m and n are sequence and database lengths.

• Paralogs • Orthologs

SPECSO

100 species:

E:28, A: 19, B: 53

Dom Code Size Organism Taxonomic class.E SC 5829 S. cerevisiae AscomycotaE SP 4962 S. pombe AscomycotaE NCU 10082 Neurospora crassa AscomycotaE CALBI 6165 C. albicans AscomycotaE MGR 11109 Magnaporthe Grisea AscomycotaE FG 11640 Fusarium Graminearum AscomycotaE AN 9541 Aspergillus nidulans AscomycotaE ECUN 1996 E. cuniculi MicrosporidiaE CE 20844 C. elegans EumetazoaE CBR 14713 Caenorhabditis briggsae EumetazoaE DM 17878 D. melanogaster ArthropodaE AG 16112 Anopheles gambiae ArthropodaE ATH 22671 A. thalina StreptophytaE HS 27625 Homo sapiens EumetazoaE MUS 28097 Mus musculus ChordataE FR 33609 Fugu rubripes EumetazoaE PF 5334 P. falciparum ApicomplexaE CI 15851 Ciona intestinalis EumetazoaE RN 21205 Rattus norvegicus Eumetazoa

A MJ 1773 M. jannaschii MethanococciA MTH 1871 M. thermoautotrophicum EuryarchaeotaA AF 2409 A. fulgidus EuryarchaeotaA PH 2061 P. horikoshii OT3 EuryarchaeotaA PA 1765 P. abyssi EuryarchaeotaA APEM 1865 A. pernix K1) CrenarchaeotaA TA 1478 Thermoplasma acidophilum EuryarchaeotaA TV 1526 Thermoplasma volcanium EuryarchaeotaA H 2058 Halobacterium sp. NRC-1 EuryarchaeotaA SSP2 2977 Sulfolobus solfataricus P2 CrenarchaeotaA PFU 2208 P. furiosis EuryarchaeotaA STO 2826 Sulfolobus tokodaii CrenarchaeotaA PYAE 2605 Pyrobaculum aerophilum CrenarchaeotaA MA 4528 Methanosarcina acetivorans (C2A) EuryarchaeotaA MK 1687 Methanopyrus kandleri AV19 EuryarchaeotaA MMA 3371 Methanosarcina mazei strain Goe1 EuryarchaeotaA MBUR 2676 Methanococcoides burtonii EuryarchaeotaA MFR 2911 Methanogenium frigidum EuryarchaeotaA NEK 563 Nanoarchaeum equitans Nanoarchaeota

B MM 7275 Mesorhizobium loti AlphaproteobacteriaB SM 6205 Sinorhizobium meliloti AlphaproteobacteriaB AGRT 5299 Agrobacterium tumefaciens AlphaproteobacteriaB MB 3953 Mycobacterium Bovis ActinobacteriaB SCO 7810 Streptomyces coelicolor ActinobacteriaB UU 614 Ureaplasma urealyticum MycoplasmatalesB SHFL 4068 Shigella flexneri GammaproteobacteriaB LL 2321 Lactococcus lactis subsp. lactis BacilliB RCO 1374 Rickettsia conorii Malish 7 AlphaproteobacteriaB CCR 3737 Caulobacter crescentus CB15 AlphaproteobacteriaB NOS 5366 Nostoc sp CyanobacteriaB TSE 2475 Thermosynechococcus elongatus BP-1 CyanobacteriaB TTE 2588 Thermoanaerobacter tengcongensis strain MB4T ClostridiaB BFL 583 Candidatus Blochmannia floridanus GammaproteobacteriaB PRO 1882 Prochlorococcus marinus subsp. marinus str. CyanobacteriaB PMT 2265 Prochlorococcus marinus str. MIT 9313 CyanobacteriaB PMM 1712 Prochlorococcus marinus subsp. pastoris str. CyanobacteriaB WS 2044 Wolinella succinogenes EpsilonproteobacteriaB PL 4683 Photorhabdus luminescens subsp. laumondii Gammaproteobacteria

B HI 1713 H. influenzae GammaproteobacteriaB MG 479 M. genitalium MycoplasmatalesB MP 677 M. pneumoniae MycoplasmatalesB Ssp 3168 Synechocystis sp. CyanobacteriaB EC 4290 E. coli GammaproteobacteriaB HP 1577 H. pylori EpsilonproteobacteriaB BS 4100 B. subtilis BacillusB BH 4066 Bascillus halodurans BacillusB BB 1639 B. burgdorferi SpirochaetesB AE 1522 A. aeolicus AquificalesB MT 3996 M. tuberculosis H37R ActinobacteriaB MTC 4203 M. tuberculosis CDC 1551 ActinobacteriaB ML 1604 Mycobacterium leprae ActinobacteriaB TP 1031 T. pallidum SpirochaetesB CT 877 C. trachomatis ChlamydiaeB RP 837 R. prowazekii AlphaproteobacteriaB CJ 1634 C. jejuni EpsilonproteobacteriaB CP 1052 C. pneumoniae ChlamydiaeB TM 1849 T. maritima ThermotogaeB DR 3117 D. radiodurans Deinococcus-ThermusB NM 2081 N. meningitidis BetaproteobacteriaB XF 2830 Xylella fastidiosa GammaproteobacteriaB VC 3837 Vibrio cholerae GammaproteobacteriaB PAE 5570 Pseudomonas aeruginosa GammaproteobacteriaB B 575 Buchnera sp. GammaproteobacteriaB LMO 2846 Listeria monocytogenes BacilliB LIN 2968 Listeria innocua BacilliB STY 4395 Salmonella Typhi GammaproteobacteriaB YP 3895 Yersinia pestis GammaproteobacteriaB SAMU50 2714 Staphylococcus aureus Mu50 BacilliB SAN315 2594 Staphylococcus aureus N315 BacilliB SPY 1696 Streptococcus pyogenes M1 Bacilli

Homolog - Paralog - Ortholog

A1A2B1

B2

Homologs: A1, B1, A2, B2

Paralogs: A1 vs B1 and A2 vs B2

Orthologs: A1 vs A2 and B1 vs B2

S1 S2a b

A

O

B

Species-1 Species-2

A1A2B1

B2

Sequence analysis

Example

Comparing S. cerevisiae (SC) genome with C. elegans (CE) genome

BLASTP 2.2.1 [Apr-13-2001]

............................

Query= YAL005c SSA1 heat shock protein of HSP70 family,cytosolic (642 letters)

Database: S. cerevisiae proteome version 22/05/2002 5829 sequences; 2,798,770 total letters................................................ Score ESequences producing significant alignments: (bits) Value

YAL005c SSA1 heat shock protein of HSP70 family, cyt... 674 0.0YLL024c SSA2 heat shock protein of HSP70 family, cyt... 663 0.0YER103w SSA4 heat shock protein of HSP70 family, cyt... 589 e-169YBL075c SSA3 heat shock protein of HSP70 family, cyt... 588 e-169YJL034w KAR2 nuclear fusion protein 480 e-136YDL229w SSB1 heat shock protein of HSP70 family 428 e-120YNL209w SSB2 heat shock protein of HSP70 family, cyt... 427 e-120YJR045c SSC1 mitochondrial heat shock protein 70-rel... 336 5e-93YEL030w heat shock protein of HSP70 family 324 2e-89YLR369w SSQ1 mitochondrial heat shock protein 70 296 4e-81YBR169c SSE2 heat shock protein of the HSP70 family 173 7e-44YPL106c SSE1 heat shock protein of HSP70 family 172 1e-43YHR064c regulator protein involved in pleiotro... 143 6e-35YKL073w LHS1 chaperone of the ER lumen 100 4e-22YLR135w subunit of SLX1P/Ybr228p-SLX4P complex... 33 0.13...................

SC vs SC

bestscsc ( SC / SC )

YAL002w 1176 - NSYAL003w 206 - NSYAL004w 215 - NSYAL005c 642 YLL024c HS 0.0YAL007c 215 YOR016c HS 1e-44

allscsc ( SC / SC )YAL002w 1176 - NS

YAL003w 206 - NS

YAL004w 215 - NS

YAL005c 642 YLL024c HS 0.0YAL005c 642 YER103w HS 0.0YAL005c 642 YBL075c HS 0.0YAL005c 642 YJL034w HS e-147YAL005c 642 YDL229w HS e-130YAL005c 642 YNL209w HS e-130YAL005c 642 YJR045c HS e-100YAL005c 642 YEL030w HS 2e-96YAL005c 642 YLR369w HS 1e-87YAL005c 642 YBR169c HS 2e-47YAL005c 642 YPL106c HS 4e-47YAL005c 642 YHR064c HS 7e-38YAL005c 642 YKL073w HS 5e-24

YAL007c 215 YOR016c HS 1e-44YAL007c 215 YGL200c IS 5e-05YAL007c 215 YHR110w IS 0.017YAL007c 215 YDL018c IS 0.021

- Paralogs - multiple matches

- Partitions/clustering

Multiple matches of sc in sc

ORF matches in scYAL005c 13YAL007c 1YDR214w 1YDR216w 2YDR399w 1YDR406w 9YDR409w 1YCR040w 1YKL218c 1YKL219w 14YKL220c 6YKL221w 2YKL222c 3YKL223w 5YKL224c 22YKR001c 2YKR003w 5YBR104w 6YBR105c 1YKR013w 2YKR014c 13.................................... ..........................Max : YDR477w 77

bestscce (SC / CE)

YAL002w 1176 C42C1.4 HS 2e-15YAL003w 206 F54H12.6 HS 4e-22YAL004w 215 - NSYAL005c 642 F26D10.3 HS e-172YAL007c 215 F57B10.5 HS 9e-08YAL009w 259 F16D3.7 IS 0.013YAL019w 1131 M03C11.8 HS 7e-92YAL020c 333 F07C3.4 IS 7e-04YAL021c 837 ZC518.3 HS 5e-47

allscce (SC / CE)

YAL002w 1176 C42C1.4 HS 2e-15

YAL003w 206 F54H12.6 HS 4e-22YAL003w 206 Y41E3.10 HS 2e-17

YAL004w 215 - NS

YAL005c 642 F26D10.3 HS e-172YAL005c 642 F44E5.4 HS e-153YAL005c 642 F44E5.5 HS e-153YAL005c 642 C12C8.1 HS e-152YAL005c 642 C15H9.6 HS e-148YAL005c 642 F43E2.8 HS e-144YAL005c 642 C37H5.8 HS e-104YAL005c 642 F11F1.1 HS 1e-77YAL005c 642 F54C9.2 HS 4e-51YAL005c 642 K09C4.3 HS 4e-47YAL005c 642 T28F3.2 HS 2e-45YAL005c 642 C30C11.4 HS 7e-43YAL005c 642 T24H7.2 HS 2e-34YAL005c 642 T14G8.3 HS 8e-33

bestcesc ( CE / SC)

C42C1.4 1259 YAL002w HS 8e-16F54H12.6 213 YAL003w HS 4e-20F26D10.3 640 YER103w HS e-174F26D10.3 640 YER103w HS e-174F57B10.5 203 YAL007c HS 7e-13F16D3.7 516 YHL003c IS 9e-04M03C11.8 1038 YAL019w HS 2e-87AC3.1 356 - NSAC3.2 949 YLR189c IS 0.038AC3.3 425 - NSAC3.4 600 YNL326c HS 1e-12

allcesc (CE / SC )

C42C1.4 1259 YAL002w HS 8e-16

F54H12.6 213 YAL003w HS 4e-20

F26D10.3 640 YER103w HS e-174F26D10.3 640 YBL075c HS e-174F26D10.3 640 YLL024c HS e-172F26D10.3 640 YAL005c HS e-171F26D10.3 640 YJL034w HS e-141F26D10.3 640 YDL229w HS e-129F26D10.3 640 YNL209w HS e-129F26D10.3 640 YJR045c HS e-100F26D10.3 640 YEL030w HS 2e-97F26D10.3 640 YLR369w HS 1e-83F26D10.3 640 YPL106c HS 2e-45F26D10.3 640 YBR169c HS 5e-45F26D10.3 640 YHR064c HS 8e-36F26D10.3 640 YKL073w HS 3e-22

SC/CE CE/SC Orthologs

segmatchSCCE

Test siz Hit siz e-val %id %sim gap Ssiz dT eT dH eHYAL002w 1176 C42C1.4 1259 5e-14 16 44 7 674 438 1111 547 1196

YAL005c 642 F26D10.3 640 1e-159 73 84 0 605 3 607 5 613YAL005c 642 F44E5.5 645 1e-142 63 79 0 605 3 607 5 611YAL005c 642 F44E5.4 645 1e-142 63 79 0 605 3 607 5 611YAL005c 642 C12C8.1 643 1e-141 62 79 0 605 3 607 5 611YAL005c 642 C15H9.6 661 1e-137 60 78 1 603 5 607 36 641YAL005c 642 F43E2.8 657 1e-134 58 76 1 606 1 606 29 637YAL005c 642 C37H5.8 657 1e-96 46 67 2 606 2 607 31 632YAL005c 642 F11F1.1b 607 1e-73 36 60 0 599 4 602 2 600YAL005c 642 F11F1.1a 614 8e-72 36 60 2 599 4 602 2 607YAL005c 642 F54C9.2 469 3e-47 38 66 2 379 2 380 52 433YAL005c 642 K09C4.3 310 2e-43 71 88 0 186 4 189 6 192YAL005c 642 K09C4.3 310 1e-04 54 70 61 327 387 189 249YAL005c 642 C30C11.4 776 1e-39 26 50 8 600 5 604 4 647YAL005c 642 T24H7.2 925 1e-31 24 50 3 506 4 509 26 548YAL005c 642 T14G8.3 926 3e-30 24 51 6 510 4 513 28 560

Partitions/MCL Clustering

•

•

•

•

•

••

• •

A set of genes defines a "partition" if and only ifa) each member of the set has atleast one significant match with another member of the set;b) no member of the set has

significant matches with members not included in the set;

c) the set is minimal.•

•

•

P7.1

P4.1

•• •

••

••

•

P7.1.C4.1

• •

•

P7.1.C3.1

MCL: Markov Cluster algorithm

Stijn van Dongen: A cluster algorithm for graphs. http://micans.org/mcl/

P4.2.C3.1

• Each gene is identified by its partition and its MCL cluster

Markov Cluster (MCL) algorithm

http://micans.org/mcl/• Traditionally, most methods deal with similarity relationships in a pairwise manner, while graph theory allows classification of proteins into families based on a global treatment of all relationships in similarity space simultaneously.

• Similarity between proteins are arranged in a matrix that represents a connection graph.

• Nodes of the graph represent proteins, and edges represent sequence similarity that connects such proteins.

• A weight is assigned to each edge by taking -log10(E-value) obtained by a BLAST comparison.

•These weights are transformed into probabilities associated with a transition from one protein to another within this graph.

•This matrix is passed through iterative rounds of matrix multiplication and matrix inflation until there is little or no net change in the matrix.

The final matrix is then interpreted as a protein family clustering.

• The inflation value parameter of the MCL algorithm is used to control the granularity of these clusters.

blastp proteome specific comparisons

all protein significant hits

Adapted from

Enright et al. NAR 2002.

YKL212w 623 YIL002c 4e-11YIL002c 946 YOR109w 7e-94YIL002c 946 YNL106c 5e-90YIL002c 946 YOL065c 3e-10YNL106c 1183 YIL002c 1e-89YOR109w 1107 YIL002c 1e-90YKL212w 623 YOR109w 2e-34YKL212w 623 YNL106c 3e-34YKL212w 623 YNL325c 8e-29YNL106c 1183 YKL212w 1e-33YNL325c 879 YKL212w 6e-25YOR109w 1107 YKL212w 2e-30YNL106c 1183 YOR109w 0.0YNL106c 1183 YNL325c 2e-22YNL325c 879 YNL106c 1e-22YOL065c 384 YNL106c 4e-10YOR109w 1107 YNL106c 0.0YNL325c 879 YOR109w 4e-20YOR109w 1107 YNL325c 2e-16

Example of Partition/MCL clustering

P6 19Total number of distinct ORFs= 6--------------------

YOL065c P6.9.C6.48 6 6YIL002c P6.9.C6.48 6 6YNL325c P6.9.C6.48 6 6YKL212w P6.9.C6.48 6 6YOR109w P6.9.C6.48 6 6YNL106c P6.9.C6.48 6 6

YMR293c P6.8.C4.88 6 4YBR218c P6.8.C4.88 6 4YGL062w P6.8.C4.88 6 4YBR208c P6.8.C4.88 6 4YMR207c P6.8.C2.782 6 2YNR016c P6.8.C2.782 6 2

YBR208c 1835 YBR218c 2e-53YBR208c 1835 YGL062w 1e-52YBR208c 1835 YMR207c 6e-34YBR208c 1835 YNR016c 5e-33YBR208c 1835 YMR293c 3e-10YBR218c 1180 YBR208c 5e-51YGL062w 1178 YBR208c 1e-47YMR293c 464 YBR208c 1e-11YNR016c 2233 YBR208c 6e-34YBR218c 1180 YGL062w 0.0YGL062w 1178 YBR218c 0.0YBR218c 1180 YNR016c 6e-30YBR218c 1180 YMR207c 4e-29YMR207c 2123 YBR218c 4e-35YNR016c 2233 YBR218c 4e-36YGL062w 1178 YMR207c 2e-27YGL062w 1178 YNR016c 3e-27YMR207c 2123 YGL062w 1e-35YNR016c 2233 YGL062w 3e-35YMR207c 2123 YBR208c 1e-34YMR207c 2123 YNR016c 0.0YNR016c 2233 YMR207c 0.0

Example of Partition/MCL clusteringP6 22Total number of distinct ORFs= 6--------------------

Large scale predicted proteome comparisons

ORF size match Partition size gene S. cerevisiae C. elegans etc..

YAL015c 399 HS P2.140 2 /YOL043c 9e-71 1 1 1 /R10E4.5 6e-26 /

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

thrA 820 HS P3.46 3 thrA /YJR139c 4e-31 2 2 0 / /

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

..... ..... ..... ..... .... . ........ . .

Rv0006 838 IS singleton 1 gyrA / 0 0 0 /K12D12.1 6e-09 /

……. …. …. …. …. …. …. …. .

Table : 541880 predicted proteins x 100 species

Gene Dictionary

E A B S1..............I.............I................Sn

G1,1 100000000000000000000000000000000000000000000000 G2,1 111111111111111111111111111111111111111111111111 G3,1 111111111111111111111111111111111111111111111111 ....................................................... Gn1,1 000001110001000000000000000000000000000000000000 G1,2 000000000000000000010100000000000000000000000000 G2,2 000000000000000000000000000000000111000011100011........................................................ Gn2,2 111111110011111111111111011101110101111111111111........................................................ G1,n 011110100000000000000000001000000000000000000000 G2,n 011111100000000000000000000000000000000000000000 G3,n 011111100011111111100011011011110100111111101111........................................................ Gnp,n 100110000000000000000000000000000000000000000000

Protein conservation profiles (phylogenetic profiles)

Table : 541880 predicted proteins x 100 species

Ancestral weight matrix

i j

i

j

Wii

WijWjj

nsi nsj

Wii: weight of ancestral

duplication;

Wij: weight of ancestral

conservation of i in j;

nsi: nonspecific genes in

species i.• •

•

org SC SP CE DM AG CA ATH HS MUS FR PF ECUNSC 40.5 63.9 17.5 27.1 22.3 65.9 23.4 22.9 27.3 18.0 22.5 35.8SP 58.4 37.4 18.8 29.3 26.3 54.3 25.0 25.0 29.6 20.0 24.6 38.4CE 38.1 46.6 65.2 51.9 50.6 35.5 27.5 44.6 54.4 42.4 24.8 34.8DM 40.5 50.2 39.2 65.8 69.9 37.5 29.5 50.3 62.7 47.9 26.5 36.3AG 40.9 50.2 39.8 73.1 59.5 38.0 30.6 50.2 60.3 48.7 26.5 36.0CA 71.8 65.5 18.4 27.7 25.7 35.8 24.3 23.2 27.8 18.5 22.3 35.7ATH 40.3 47.8 21.7 31.5 30.3 37.0 83.6 25.6 29.7 21.9 26.2 33.4HS 43.0 53.3 40.0 61.3 54.5 39.7 32.1 66.7 90.8 68.8 28.2 37.7MUS 41.7 52.5 39.5 62.1 54.7 39.1 31.5 76.8 77.8 67.7 27.6 37.2FR 42.0 52.6 40.0 60.7 59.9 39.5 32.7 68.7 81.8 63.4 27.6 37.4PF 25.9 31.2 13.1 19.3 15.9 22.2 16.3 17.2 21.0 13.2 28.3 28.9ECUN 19.5 23.4 8.9 13.1 10.8 16.2 11.4 12.0 15.2 9.0 13.6 26.1MJ 11.5 13.3 4.9 6.7 6.0 10.2 6.0 4.8 5.6 3.7 8.7 15.4MTH 13.6 16.2 4.6 7.4 7.6 11.2 8.0 5.1 6.1 4.0 8.3 15.2AF 14.4 16.5 5.9 8.2 8.7 11.8 8.7 5.6 6.6 4.5 8.6 15.4PH 16.3 18.7 5.0 7.1 9.2 11.1 9.7 5.2 6.0 4.1 7.9 15.3PA 14.3 15.2 5.4 7.5 7.3 11.9 7.4 5.5 6.4 4.3 8.3 15.9APEM 15.5 20.1 4.8 7.3 10.6 10.3 9.4 5.2 5.9 3.9 7.2 14.9TA 15.2 17.5 5.9 8.3 8.3 12.7 8.2 5.3 6.3 4.2 8.6 14.8TV 15.4 17.8 6.2 8.3 8.7 13.3 8.3 5.6 6.8 4.4 8.7 15.0H 14.8 17.7 5.8 8.3 9.8 12.0 10.2 5.5 6.6 4.5 8.0 13.9SSP2 16.7 19.4 7.1 9.1 9.4 14.2 9.5 6.2 7.4 4.9 9.5 15.9PFU 17.0 22.8 6.5 9.3 11.1 13.3 12.3 7.0 8.0 5.6 9.1 17.1STO 18.6 23.1 6.8 8.6 11.4 13.7 11.1 5.9 7.1 4.5 9.1 15.7PYAE 15.6 19.5 5.3 8.2 9.9 11.8 9.5 5.8 6.9 4.5 8.1 15.0MA 16.0 18.9 7.1 10.8 12.5 14.7 9.7 7.4 8.7 6.4 9.8 17.0MK 13.0 14.6 4.0 6.2 6.1 10.7 6.9 4.6 5.4 3.5 7.3 14.1MMA 14.8 17.4 6.4 9.2 9.5 13.5 8.1 6.6 7.9 5.3 9.7 15.8HI 13.0 14.3 4.8 7.3 8.5 11.1 8.7 4.4 5.4 4.0 8.2 8.7…..tnsp 74.4 79.2 49.7 76.4 81.0 72.6 58.8 78.7 93.7 72.8 42.3 48.1

Ancestral duplication and ancestral conservation

Wij

Intra-species duplication

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

PF

ECUNNCU

CALBI

SPMGR

SCANFGCIAGFRCEDMCBRHSRN

MUSATHNEKMFRAPEM

MJMKH

TVMTH

TAPHPAPYAE

PFUAF

STOMBURSSP2MMA

MABFLRCO

B

PROPMMMGUUCTTPRP

PMTHPHICJCPAEMLXFDRNMTSESPYCCRWSTMMPLL

VVYJSspVPRTTEVC

MTCBBMBMT

SAMU5SAN315

PLNOS

ECYPBSSTY

SHFLLMOLINBHAGRT

SCOMMPAESM

Species

Percent duplication

E A B

Ancestral duplication

mean= 52.1 30. 38.4

std= 17.8 11.7 11.2

Specific and nonspecific proteins

• Specific proteins (genes) are proteins that have no match outside their own proteome. (no homolog in other species).

• Non-specific proteins (genes) are proteins that are conserved in at least one other species (have homologs outside its own proteome).

Large scale proteome comparisons allow estimation of:

Specific and nonspecific proportions

0

10

20

30

40

50

60

70

80

90

100

CBRRN

MUSANSPFGCEHSSCAGDMCA

MGRFRCI

NCUATHECUN

PFPATAMMA

TVSSP2MTH

AFPHPFU

MBURSTO

MJMAMFRNEKMK

H

APEMPYAE

SAN315

BFLMTMGMBB

RP

SAMU50

LMOSHFLECCT

LINHIMPSspAEML

PMMBS

TSEMTCSTYYPWSSMCJTM

AGRT

VCPAEPROVPR

BHHPSPY

PLVVYJ

CPPMTTTE

LLMMUU

CCRNOSTPNMRCO

DRSCO

XFBB

E A B

mean% 76.2 84.3 87.6

genessame phylumdifferent phylum0

100%

con

serv

atio

nSpecies specific genes

Orthologs

100 species ==> 367143 orthologs

a b

Si Sj

Structural orthologs according to the 3 domains of life (100 species: 367143 orthologous genes)

EB2%

EA0%

AB2%

E69%

B21%

A5%

P.all0%

ACR1%

2

note: ~6% include genes from at least 2 domains of life.

Total Partitions: 37826

Evolution by Module

Evolution by Module

(A. gambiae paralogs)

GST: orthologs

Genome trees

Martin & Embley, Nature 431:152-5.2004

The three-domain proposal based on the ribosomal RNA tree. Woese et al. PNAS. 87:4576-4579. (1990)

The two-empire proposal, separating eukaryotes from prokaryotes and eubacteria from archaebacteria. Mayr, D. PNAS 95:9720-23. (1998).

The three-domain proposal, with continuous lateral gene transfer among domains. Doolittle Science 284:2124-2128. (1999) The ring of life, incorporating lateral gene transfer but

preserving the prokaryote–eukaryote divide. Rivera MC and Lake JA. Nature 431: 152-155. (2004)

The tree was inferred with the use of a maximum likelihood method based on the concatenated sequences of seven universally conserved protein sequences: arginyl-tRNA synthetase, methionyl-tRNA synthetase, tyrosyl-tRNA synthetase, RNA polymerase II largest subunit, RNA polymerase II second largest subunit, PCNA, and 5'-3' exonuclease.

The alignment contains 3164 sites without insertions and deletions. Bootstrap percentages are shown along the branches.

The 1.2-Megabase Genome

Sequence of Mimivirus Didier Raoult, Stéphane Audic, Catherine Robert, Chantal Abergel, Patricia Renesto, Hiroyuki Ogata, Bernard La Scola, Marie Suzan, Jean-Michel Claverie.

Sciences, 306:1344-1350. (2004)

The ring of life provides evidence for a genome fusion origin of eukaryotesRivera, M.C. & Lake, J.A. Nature, 431; 152-155. (2004)

Evolutionary biology: Early evolution comes full circle.Martin W, Embley TM.Nature, 431; 134-137. (2004)

“Our analyses indicate that the eukaryotic genome resulted from a fusion of two diverse prokaryotic genomes, and therefore at the deepest levels linking prokaryotes and eukaryotes, the tree of life is actually a ring of life.”

Genomic Databases and the Tree of LifeKeith A. Crandall and Jennifer E. Buhay

Sciences, 306; 1144-1145. (2004)

Prospects for Building the Tree of Life from Large Sequence Databases Amy C. Driskell, Cécile Ané, J. Gordon Burleigh, Michelle M. McMahon, Brian C. O'Meara, Michael J. Sanderson .

Sciences, 306; 1172-1174. (2004)

Species tree

• 16/18s rRNA tree (Woese 1990);

• main difficulties include extensive incongruence between alternative phylogenies generated from single-gene data sets;

Alternative solutions: integrative methods

• “supertree” (consensus tree from a set of individual gene phylogenetic trees);

• “phylogenomic tree” based on concatenation of a gene sample common to the considered species;

S1

Sn

.

• (these methods suffer difficulties related to the phylogenetic tree construction: sequence global alignment difficulties; substitution variations between species;...)

Genome trees

• The concept of genome tree is based on overall gene content similarity;

• Genome trees consider more than single gene information;

•

•

Time Duplication

Duplication

Speciation

Speciation

A B C

A B C

Species tree

A B C

Gene tree

Gene tree - Species tree

Ancestor

species genome

Evolutionary processes include:

Phylogeny*duplication genesis

Expansion*

HGT HGT

Exchange* loss Deletion*

and selection

Universal tree

(Woese 1990 ):

• 16s rRNA (most conserved sequences)

• main difficulties include extensive incongruence between alternative phylogenies generated from single-gene data sets

• tree that takes into account the whole make up of the species genomes?

Genome trees: data matricesT = {Tij ; i=1,n; j=1,n; n is the number of surveyed species}

Tij is the overall similarity score between species j and i.

• Ancestral duplication and ancestral conservation

T = {Tij = wij = (number of proteins in j conserved in i)/size(j));

i=1,n; j=1,n }. 541880 total proteins• Shared orthologous genes

{sij = (shared orthologs between i and j) }

T = {Tij = sij/size(j); i=1,n; j=1,n }

• Distinct shared conservation profiles{sij = (distinct shared conservation profiles between i and j) }

T = { Tij = sij/sjj ; i=1,n; j=1,n}

442460 non-specific prot.

28365 / 184130 d.c.prof

org SC SP CE DM AG CA ATH HS MUS FR PF ECUNSC 40.5 63.9 17.5 27.1 22.3 65.9 23.4 22.9 27.3 18.0 22.5 35.8SP 58.4 37.4 18.8 29.3 26.3 54.3 25.0 25.0 29.6 20.0 24.6 38.4CE 38.1 46.6 65.2 51.9 50.6 35.5 27.5 44.6 54.4 42.4 24.8 34.8DM 40.5 50.2 39.2 65.8 69.9 37.5 29.5 50.3 62.7 47.9 26.5 36.3AG 40.9 50.2 39.8 73.1 59.5 38.0 30.6 50.2 60.3 48.7 26.5 36.0CA 71.8 65.5 18.4 27.7 25.7 35.8 24.3 23.2 27.8 18.5 22.3 35.7ATH 40.3 47.8 21.7 31.5 30.3 37.0 83.6 25.6 29.7 21.9 26.2 33.4HS 43.0 53.3 40.0 61.3 54.5 39.7 32.1 66.7 90.8 68.8 28.2 37.7MUS 41.7 52.5 39.5 62.1 54.7 39.1 31.5 76.8 77.8 67.7 27.6 37.2FR 42.0 52.6 40.0 60.7 59.9 39.5 32.7 68.7 81.8 63.4 27.6 37.4PF 25.9 31.2 13.1 19.3 15.9 22.2 16.3 17.2 21.0 13.2 28.3 28.9ECUN 19.5 23.4 8.9 13.1 10.8 16.2 11.4 12.0 15.2 9.0 13.6 26.1MJ 11.5 13.3 4.9 6.7 6.0 10.2 6.0 4.8 5.6 3.7 8.7 15.4MTH 13.6 16.2 4.6 7.4 7.6 11.2 8.0 5.1 6.1 4.0 8.3 15.2AF 14.4 16.5 5.9 8.2 8.7 11.8 8.7 5.6 6.6 4.5 8.6 15.4PH 16.3 18.7 5.0 7.1 9.2 11.1 9.7 5.2 6.0 4.1 7.9 15.3PA 14.3 15.2 5.4 7.5 7.3 11.9 7.4 5.5 6.4 4.3 8.3 15.9APEM 15.5 20.1 4.8 7.3 10.6 10.3 9.4 5.2 5.9 3.9 7.2 14.9TA 15.2 17.5 5.9 8.3 8.3 12.7 8.2 5.3 6.3 4.2 8.6 14.8TV 15.4 17.8 6.2 8.3 8.7 13.3 8.3 5.6 6.8 4.4 8.7 15.0H 14.8 17.7 5.8 8.3 9.8 12.0 10.2 5.5 6.6 4.5 8.0 13.9SSP2 16.7 19.4 7.1 9.1 9.4 14.2 9.5 6.2 7.4 4.9 9.5 15.9PFU 17.0 22.8 6.5 9.3 11.1 13.3 12.3 7.0 8.0 5.6 9.1 17.1STO 18.6 23.1 6.8 8.6 11.4 13.7 11.1 5.9 7.1 4.5 9.1 15.7PYAE 15.6 19.5 5.3 8.2 9.9 11.8 9.5 5.8 6.9 4.5 8.1 15.0MA 16.0 18.9 7.1 10.8 12.5 14.7 9.7 7.4 8.7 6.4 9.8 17.0MK 13.0 14.6 4.0 6.2 6.1 10.7 6.9 4.6 5.4 3.5 7.3 14.1MMA 14.8 17.4 6.4 9.2 9.5 13.5 8.1 6.6 7.9 5.3 9.7 15.8HI 13.0 14.3 4.8 7.3 8.5 11.1 8.7 4.4 5.4 4.0 8.2 8.7…..tnsp 74.4 79.2 49.7 76.4 81.0 72.6 58.8 78.7 93.7 72.8 42.3 48.1

Ancestral duplication and ancestral conservation

Wij

Agrobacterium tumefaciensSinorhizobium melilotiMesorhizobium lotiMycobacterium lepraeDeinococcus radioduransSynechosystis sp.Yersinia pestisVibrio choleraeSalmonella TyphiPseudomonas aeruginosaEscherichia coliHaemophilus influenzaeXylella fastidiosaNeisseria meningitidisCampylobacter jejuniAquifex aeolicusThermotoga maritimaBascillus haloduransBacillus subtilisListeria monocytogenes EGDStaphylococcus aureus N315Staphylococcus aureus Mu50Listeria innocuaStreptococcus pyogenes M1Mycobacterium tuberculosis cdc 1551Mycobacterium tuberculosisChlamydia trachomatisHelicobacter pyloriRickettsia prowazekiiTreponema pallidumBuchnera sp.Borrelia burgdorferiChlamydia pneumoniaeMycoplasma pneumoniaeMycoplasma genitaliumPyrococcus FuriosusPyrococcus horikoshiiPyrococcus abyssiMethanopyrus kandleri AV 19Archaeoglobus fulgidusMethanobacterium thermoautotrophicumMethanococcus jannaschiiMethanosarcina acetivorans (C2A)Methanosarcina mazei strain Goe1Sulfolobus solfataricus P2Sulfolobus tokodaiiThermoplasma acidophilumHalobacterium sp. NRC-1Aeropyrum pernix K1Pyrobaculum aerophilumThermoplasma volcaniumhomo sapiensFugu rubripesMus musculusDrosophila melanogasterAnopheles gambiaeCaenorhabditis elegansArabidopsis thalianaPlasmodium falciparumE. cuniculiCandida albicansSchizosaccharomyces pombeSaccharomyces cerevisiae

•B

•A

•E

Genome tree:

Ancestral duplication and conservation

• “whole genome” species clustering tree;

• species are clustered into 3 phylogenetic domains;

• bacterial species cluster with archaeal species;

• similar species cluster together;

• low resolution of deep clustering;

• evolutionary side effects are taken into account;

Tekaia, F., Lazcano, A.,B. Dujon

(1999). Genome Res. 12:17-25.

Shared orthologous genes (partial)org SC SP CE DM AG CA ATH HS MUS FR PF ECUNSC 0 2532 1533 1660 1671 3371 1582 1789 1733 1731 890 600SP 2532 0 1753 1917 1907 2588 1754 2060 2032 2024 1008 645CE 1533 1753 0 3910 3869 1611 1902 4036 3994 4047 1015 580DM 1660 1917 3910 0 7018 1728 2094 5057 5147 5035 1106 616AG 1671 1907 3869 7018 0 1738 2160 5016 5013 5059 1085 617CA 3371 2588 1611 1728 1738 0 1590 1850 1824 1827 873 595ATH 1582 1754 1902 2094 2160 1590 0 2404 2406 2399 1067 539HS 1789 2060 4036 5057 5016 1850 2404 0 14053 10286 1185 638MUS 1733 2032 3994 5147 5013 1824 2406 14053 0 10304 1169 632FR 1731 2024 4047 5035 5059 1827 2399 10286 10304 0 1146 626PF 890 1008 1015 1106 1085 873 1067 1185 1169 1146 0 453ECUN 600 645 580 616 617 595 539 638 632 626 453 0MJ 238 233 214 216 242 230 279 223 216 217 169 142MTH 254 247 237 247 278 245 306 251 248 249 171 141AF 261 255 254 260 303 248 310 260 263 265 182 151PH 251 245 250 259 297 237 281 273 258 271 187 155PA 267 261 255 268 311 256 312 276 273 278 189 156APEM 212 233 228 228 251 215 242 248 237 230 165 136TA 264 260 252 254 279 261 298 268 264 261 182 141TV 263 255 256 249 276 258 296 260 258 270 184 138H 255 264 258 249 284 248 318 271 267 272 173 140SSP2 302 317 293 292 326 300 360 310 309 311 200 155PFU 264 284 256 275 324 286 316 292 274 280 195 150STO 281 291 273 263 313 278 329 293 282 298 196 143PYAE 245 258 236 249 285 238 278 258 246 256 170 143MA 303 316 298 293 368 301 369 329 326 326 200 161MK 210 214 195 204 216 211 244 205 202 195 160 125MMA 289 298 276 280 338 280 349 305 299 297 194 160HI 268 273 231 243 388 268 382 259 259 267 181 86

sij

Salmonella TyphiEscherichia coliYersinia pestisVibrio choleraePseudomonas aeruginosaSynechosystis sp.Thermotoga maritimaAquifex aeolicusDeinococcus radioduransXylella fastidiosaNeisseria meningitidisHaemophilus influenzaeRickettsia prowazekiiBuchnera sp.Helicobacter pyloriCampylobacter jejuniSinorhizobium melilotiMesorhizobium lotiAgrobacterium tumefaciensBorrelia burgdorferiTreponema pallidumChlamydia pneumoniaeChlamydia trachomatisMycoplasma pneumoniaeMycoplasma genitaliumListeria innocuaListeria monocytogenes EGDStreptococcus pyogenes M1Bacillus subtilisBascillus haloduransStaphylococcus aureus Mu50Staphylococcus aureus N315Mycobacterium tuberculosis cdc 1551Mycobacterium tuberculosisMycobacterium leprae

Methanopyrus kandleri AV19Methanococcus jannaschiiMethanobacterium thermoautotrophicumArchaeoglobus fulgidusHalobacterium sp. NRC-1Methanosarcina mazei strain Goe1Methanosarcina acetivorans (C2A)Pyrococcus FuriosusPyrococcus abyssiPyrococcus horikoshiiSulfolobus solfataricus P2Sulfolobus tokodaiiPyrobaculum aerophilumAeropyrum pernix K1Thermoplasma acidophilumThermoplasma volcaniumMus musculushomo sapiensFugu rubripesCaenorhabditis elegansDrosophila melanogasterAnopheles gambiaeArabidopsis thalianaPlasmodium falciparumE. cuniculiSchizosaccharomyces pombeCandida albicansSaccharomyces cerevisiae

•B

•A

•E

Genome tree: shared orthologs:Tij = 100*Sij/size(j)

• 3 phylogenetic domains;

• bacterials cluster with archaeal species;

• similar species cluster together;

• better resolution of deep species clustering;

• Evolutionary side effects (HGT, duplication, loss) are not completely eliminated;

Conservation profiles

p 011111100011111111100011011011110100111111101111

• a “conservation profile” is an n-component vector describing a

protein conservation pattern across n species.

Components are 0 and 1, following absence or presence of homologs.

• Conservation profile is the trace of protein evolutionary histories jointly captured in a set of species (multidimensional feature);

• Conservation profiles are signatures of evolutionary relationships;

• Considering distinct conservation profiles, reduces the effects of noisy evolutionary processes (less noisy phylogenetic signals);• Each conservation profile brings equal amount of information regardless of the size of the set of genes that have identical c. profiles;

• => give evidence of evolutionary history in a set of species

SiSi

Sj

S1

Sn

S1……………………….….Sn

gi,1 01000000000000000000000 gi,2 10000001010100000000000 .Si . . gi,p 01000000000000001000000

S1………………………....Snweight 01000000000000000000000 wi,1

10000001010100000000000 wi,2

Si ……………………………. 01000000000000001000000 wi,k

S1…………………………Snweight 01000000000000000000000 W1

10000001010100000000000 W2

……………………………. 01000000000000001000000 Wt

Wl = ∑{w ,i m; i=1,n; m=1,n} is the weight o f t he conservati on profile l.

St 1ep

St 2ep

St 3ep

St 4ep

Distinct conservation profiles

Distinct conservation profiles

Drastic reduction

541880 proteins

442460

non-specific proteins i.e. conservation

profiels

184130

distinct conservation profiles

100 species ===>

28365 distinct conservation profiles

associated with at least 2 proteins from distinct species

Distribution of distinct conservation profiles according to the three phylogenetic domains

B11%

EA3%

EB11%

AB33%

EAB39%

A2%

E1%

• Tij = sij, where sij is the number of occurrences of distinct shared conservation profiles between species i and j;

• Tij = sij/sjj.

Occurrences of shared conservation profiles

E A B S1..............I.............I................Sn

100000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111 000001110001000000000000000000000000000000000000 000000000000000000000000000000000111000011100011 ................................................

Occurrences of shared distinct conservation profiles

spec SC SP CE DM AG CA ATH HS MUS FR PF ECUNSC 2328 387 239 262 274 400 338 285 299 288 146 96SP 387 2208 267 301 317 351 377 318 334 320 152 102CE 239 267 3153 575 506 284 364 642 656 670 188 116DM 262 301 575 2747 653 305 416 718 729 725 203 124AG 274 317 506 653 4052 269 477 612 657 650 165 107CA 400 351 284 305 269 1906 315 345 362 338 171 107ATH 338 377 364 416 477 315 5762 451 477 469 190 110HS 285 318 642 718 612 345 451 3813 1511 1134 231 127MUS 299 334 656 729 657 362 477 1511 4134 1140 229 133FR 288 320 670 725 650 338 469 1134 1140 4280 215 132PF 146 152 188 203 165 171 190 231 229 215 1251 95ECUN 96 102 116 124 107 107 110 127 133 132 95 572MJ 41 46 32 39 48 45 60 39 41 39 21 13MTH 54 56 40 53 63 53 73 51 54 50 30 21AF 56 52 57 62 78 54 74 64 66 65 31 19PH 41 41 46 45 58 44 59 47 51 47 24 14PA 49 47 51 48 56 53 72 51 52 50 25 16APEM 51 51 48 51 65 51 63 57 60 54 29 17TA 55 59 63 61 72 57 83 66 68 65 31 19TV 58 56 65 59 68 52 82 61 66 65 29 18H 65 68 64 65 77 61 101 71 73 71 34 23SSP2 71 75 73 72 87 70 95 80 87 76 32 20PFU 52 57 57 51 64 57 73 56 62 56 28 18STO 59 59 65 67 71 56 75 65 66 64 28 17PYAE 59 56 48 53 73 53 81 60 67 62 24 15MA 71 75 76 83 102 84 113 85 93 85 44 33MK 43 45 33 40 48 44 56 38 41 36 21 12MMA 77 72 65 73 89 76 105 74 81 66 41 28HI 71 76 70 67 101 79 116 74 74 78 46 23

sij

Profiles Conservation Orthologs

• Tekaia, F. and B. Dujon (1999). Pervasiveness of gene conservation and persistence of duplicates in cellular genomes. Journal of Molecular Evolution, 49:591-600.

• Tekaia, F., Lazcano, A. and B. Dujon (1999). Genome tree as revealed from whole proteome comparisons. Genome Res. 12:17-25.

• Tekaia, F., Gordon, S.V., Garnier, T., Brosch, R., Barrel, B.G. and S.T. Cole (1999). Analysis of the proteome of Mycobacterium tuberculosis in silico. Tubercle and Lung Disease, 79:329-342.

• Genolevures program:- F. Tekaia, G. Blandin, A. Malpertuy, et al. (2000): Methods and strategies used for sequence analysis and annotation. FEBS 487,1:17-30.- A. Malpertuy, F. Tekaia, S. Casaregola, et al. (2000): «Yeast specific» genes. FEBS 487,1:113-121.- G. Blandin, P. Durrens, F. Tekaia, et al. (2000). The genome of Saccharomyces cerevisiae revisited. FEBS 487,1:31-36.

• Tekaia, F., Yeramian, E. and Dujon B. (2002)Amino acid composition of genomes, lifestyle of organisms and evolutionary trends : a global picture with correspondence analysis. Gene. 297 pp. 51-60.

• Tekaia, F., Yeramian, E. in prepGenome tree based on conservation profiles

Systematic analysis of completely sequenced organisms:

http://www.pasteur.fr/~tekaia/sacso.html

Documents

Large scale proteome comparisons Genome trees Fredj Tekaia Institut Pasteur [email protected]