Large scale genomes comparisons Fredj Tekaia Institut Pasteur [email protected] Institut Pasteur/EMBO/CNPq course, UFSC, Florianopolis, 2008

Plan: Completely sequences genomes ; Comparisons programs; Large scale genome comparisons; References.

Large scale genome comparisons -duplication; -conservation; -species-specific genes (proteins); -paralogues, orthologues; -families (clusters) of paralogues, of orthologues; -genomes oraganisations (duplicated, conserved genes); -search for shared motifs in proteins of the same cluster; -protein conservation profiles; -selection pressure analyses (synonymous, non synonymous substitutions,..),.

672 53 94 http://www.genomesonline.org/ Tree of life Complete genomes 3823 projects 819 published (12-06-08) 1848 Bacteria 90 Archaea 936 eukaryotes 130 metagenomes 3 phylogenetic domains; Lifestyles: mesophiles; (hyper)thermophiles; psychrophiles; extreme conditions,...

Number of available completely sequenced genomes GOLD List and references List and references Completely sequenced Genomes that span the three domains of life are growing at a rapid rate 06-2008

Genome sequencing projects There are several web-based resources that document the progress of completely sequenced genomes and their reference publication, including: GOLDGenomes Online Database http://wit.integratedgenomics.com/GOLD/

Resources for genomes There are two main resources for genomes: EBIEuropean Bioinformatics Institute http://www.ebi.ac.uk/genomes/ NCBINational Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/Genomes/ But many others resources from sequencing Institutions: SangerThe welcome Trust Sanger Institut http://www.sanger.ac.uk/ Broad Institut http://www.broad.mit.edu/tools/data/seq.html http://www.broad.mit.edu/tools/data/seq.html Genolevureshttp://cbi.labri.fr/Genolevures/index.phphttp://cbi.labri.fr/Genolevures/index.php

Definitions Genome The genome of a cell is formed by the collection of the DNA it comprises. The genome size is the total of its DNA bases. Gene Is a particular DNA sequence situated in a specific position on a chromosome and that codes for a specific function. Protein Is a sequence composed of amino-acids ordered according to the DNA sequences of the gene it codes for. Proteome Is the set of proteins in an organism. Genomics Is the exhaustive study of genomes: genetic material, genes; their functions, their organization....

Chronology of completely sequenced genomes 1977: first viral genome (5386 base pairs; encoding 11 genes). Sanger et al. sequence bacteriophage X174. 1981: Human mitochondrial genome. 16,500 base pairs ( encodes 13 proteins, 2 rRNA, 22 tRNA ) 1986: Chloroplast genome. 156,000 base pairs (most are 120 kb to 200 kb)

1995: first genome of a free-living organism, the bacterium Haemophilus influenzae, by TIGR, 1830 Kb, 1713 genes. 1996: first genome of an archaeal genome: Methanococcus jannaschii DSM 2661, by TIGR, 1664 Kb, 1773 genes. 1997: first eukaryotic genome : Saccharomyces cerevisiae S288C; International collaboration; 16 Chromosomes; 12,057 Kb, ~6000 genes. 1998: first multicellular organism Nematode Caenorhabditis elegans; 97 Mb; ~19,000 genes.

1999: first human chromosome: Chromosome 22 (49 Mb, 673 genes))

2000: Fruitfly Drosophila melanogaster (137 Mb; ~13,000 genes) 2000 first plant genome: Arabidopsis thaliana (115,428 Mb; 22670 genes 2001: draft sequence of the human genome (x Mb; ~28000 genes) 2002: plasmodium falciparum (22,9 Mb; 5334 genes) 2002: mouse genome (x Mb; ~28000 genes) 2004: Fish draft Tetraodon nigroviridis genome (x Mb; ~28000 genes); 2005: Dog (41Mb, 33651 genes) and chicken genomes ( 18031 genes)

How big are genome sizes? Viral genomes: 1 kb to 360 kb ( Canarypox virus) Note: Mimivirus: 1.2 Mb http://www.giantvirus.org/top.html (Top 100 largest viral genome sequences) Bacterial genomes: 0.5 Mb to 13 Mb; Eukaryotic genomes: 8 Mb to 670 Gb; DOGS - Database Of Genome Sizes : http://www.cbs.dtu.dk/databases/DOGS/ : http://www.cbs.dtu.dk/databases/DOGS/

Comparative genomics Analyses of the genetic material of different species help understanding the similarity and differences between genomes, their evolution and the evolution of their genes. Intra-genomic comparisons help understanding the degree of duplication (genome regions; genes) and genes organization,... Inter-genomic comparisons help understanding the degree of similarity between genomes; degree of conservation between genes; Understanding gene and genome evolution

Evolution

Genomes 2 edition 2002. T.A. Brown Arbre des espces AB C Arbre des gnes A BC Time Duplication Speciation A BC Speciation - Duplication

Ancestor species genome Evolutionary processes include Phylogeny* duplication genesis Expansion* HGT Exchange* loss Deletion*selection* Expansion, Exchange and Deletion. Large scale comparative analysis of predicted proteomes revealed significant evolutionary processes.

Gene duplications are traditionally considered as a major evolutionary source forf protein new functions Understanding how duplications happened and how important is this evolutionary process is a key goal of genome analysis > Some examples

Kellis et al. Nature, 2004 S. cerevisiae genome Colours reveal Duplications

Kellis et al. Nature, 2004 Speciation Duplication Deletion Actual content of the 2 copiesReconstruction of the ancestral organization

Nature Reviews Genetics 3; 827-837 (2002); SPLITTING PAIRS: THE DIVERGING FATES OF DUPLICATED GENES

Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206. Original version Actual version

Genome duplication. a, Distribution of Ks values of duplicated genes in Tetraodon (left) and Takifugu (right) genomes. Duplicated genes broadly belong to two categories, depending on their Ks value being below or higher than 0.35 substitutions per site since the divergence between the two puffer fish (arrows). b, Global distribution of ancient duplicated genes (Ks > 0.35) in the Tetraodon genome. The 21 Tetraodon chromosomes are represented in a circle in numerical order and each line joins duplicated genes at their respective position on a given pair of chromosomes. Jaillon et al. Nature 431, 946-857. 2004.

Intra-genome Comparisons simple description (genes, size distribution, base compositions nucleotides, amino acids,...); specific genes ; gene duplication ; gene families ; gene organization on the genome;.....

Inter-genome Comparaisons base composition, codons, amino acids,... degree of conservation between genomes, orthologues determination, families (clusters) of orthologues. gene dictionary, gene conservation profiles, genome trees construction, genomes multiple alignments.

Simple description

Dommeanstd#numberminMax E12229.211306.32746445000 A2161.3704.7475364540 B3197.61620.85331828702 Statistics : gene number/species/domain

DomMeanstdnombreminMax E39.89.7282263 A45.910.94727.667.9 B48.213.75331687 G+C content

ORF products mean size

Amino acid composition

208 13............... Amino Acid composition Correspondence Analysis was used to explore relationships between species and amino acids.

Eukaryotes Hyperthermophiles PsychrophilesProkaryotes mesophiles Thermophiles Encephalitozoon cuniculi Thermosynechococcus elongatus Tekaia & Yeramian, 2006, BMC Genomics 7:307

GC% growth t Mycoplasma mycoides 23% Nocardia farcinica: 70% Streptomyces coelicolor: 72% Tetrahymena thermophila (Protists) Saccharomyces Entamoeba histolytica (Protists) Cryptosporidium hominisLeishmania major:60% Cyanidioschyzon merolae Aspergilus fumigatus:50% Homo sapiens Methanococcus jannaschii:31% Pyrococcus abyssi:44% Methanopyrus kandleri:61% Thermus-thermophilus:69% Colwellia psychrerythraea Pseudoalteromonas haloplanktis Encephalitozoon cuniculi A. nidulans A. oryzae C. neoformans Mus musculus Rat Candida Glabrata Tekaia & Yeramian, 2006, BMC Genomics 7:307

-0.1 A. fumigatus Area of candidate thermostable proteins

Search for similarity

Methods: Important to know how algorithms that allow sequence comparisons work, There are many comparisons methods, Among most used: BLAST FASTA Smith-Waterman algorithm dynamic programming method HMM (Hidden Markov Model)

Sequence Comparaisons V I T K L G T C V G SV I T K L G T C V G S V I S... T Q V G SV. S K. G T Q V. S Identity Similarity Homology

Comparison of 2 sequences Aims at finding the optimal alignment: the one that shows most similar regions and regions that are less similar. In describing sequence comparisons, three different terms are commonly used : Identity, Similarity and Homology. Need for a score that evaluates: - matches - mismatches - gaps and a method that evaluates the numerous possible alignments.

Identity Refers to the occurence of identical nucleotides or amino acids in the same position in aligned sequences ; Identity is objective and well defined; Identity can be quantified: Percent i.e the number of identical matches divided by the length of the aligned region.

Similarity Sequence similarity takes approximate matches into account, and is meaningful only when such substitutions are scored according to some measure of difference with conservative substitutions assigned more favorable scores than non-conservative ones (substitution matrices). Given a number of parameters (alphabet, scoring matrix, filtering procedure, etc...), the similarity of an aligned region is defined by a score calculated on that region; The score depends on the chosen parameters; Contrarily to homology : expression like significant or weak similarity are often used.

Homology Sequence homology underlies common ancestry and sequence conservation; Homology can be inferred, under suitable conditions from sequence similarity ; The main objective of sequence similarity searching studies aims at inferring homology between sequences; Homology is not a measure. It is an all or none relashionship (i.e homology exits or does not exist. Expressions like : significant or weak homology are meaningless). Sequence similarity is a measure of the matching characters in an alignment, whereas homology is a statement of common evolutionary origin.

Local Alignement Global Alignement

Compare one query sequence to a BLAST formatted database

Amino acid scoring schemes (substitution matrices) All algorithms comparing protein sequences rely on some schemes to score the equivalence of each of the 210 possible pairs of amino acids. As a result : what a local alignment program produces depends strongly upon the scores it uses. implicitly a scheme may represent a particular theory of evolution, choice of a matrix can strongly influence the outcome of an analysis. The scores in the matrix are integer values which assign a positive score to identical or similar character pairs, and a negative value to dissimilar character pairs. S ij = (ln(q ij /p i p j ))/ u ; q ij are target frequencies for aligned pairs of amino acids, the p i and p j are background frequencies, and u is a statistical parameter.

Examples of substitution matrices # PAM250 substitution matrix, scale = ln(2)/3 = 0.231049 # Expected score = -0.844, Entropy = 0.354 bits # Lowest score = -8, Highest score = 17 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 0 0 0 -8 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 -1 0 -1 -8 N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 2 1 0 -8 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 3 3 -1 -8 C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -4 -5 -3 -8 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 1 3 -1 -8 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 3 3 -1 -8 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 0 0 -1 -8 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 1 2 -1 -8 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -2 -2 -1 -8 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -3 -3 -1 -8 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 1 0 -1 -8 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -2 -2 -1 -8 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -4 -5 -2 -8 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 -1 0 -1 -8 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 0 0 0 -8 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 0 -1 0 -8 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -5 -6 -4 -8 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -3 -4 -2 -8 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 -2 -2 -1 -8 B 0 -1 2 3 -4 1 3 0 1 -2 -3 1 -2 -4 -1 0 0 -5 -3 -2 3 2 -1 -8 Z 0 0 1 3 -5 3 3 0 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3 -1 -8 X 0 -1 0 -1 -3 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1 0 0 -4 -2 -1 -1 -1 -1 -8 * -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 1

BLOSUM62 Clustered Scoring Matrix in 1/2 Bit Units # Cluster Percentage: >= 62 # Lowest score = -4, Highest score = 11 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

PAM matrices (Dayhoff et al. (1978)) PAM stands for point accepted mutation. 1 PAM corresponds to 1 amino acid change per 100 residues, 1 PAM ~1% divergence, Extrapolate to predict patterns at longer distances. Assumptions : replacements are independent of surrounding residues, sequences being compared are of average composition, all sites are equally mutable, Source of error : small, globular proteins were used to derive PAM matrices (departure from average composition) errors in PAM1 are magnified up to PAM250,.... does not account for conserved blocks or motifs. Strategy : PAM40short alignments, highly similar PAM120average similarity PAM250longer, weaker local alignments.

BLOSUM matrices (Henikoff, S., and Henikoff, J., G. (1992)) BlosumX denotes a matrix obtained from clustered sequence segments with more than X% identity. Examples : - Blosum62 is obtained from clustered sequences with identity greater than 62%. - Blosum80 is obtained from clustered sequences with identity greater than 80%. Which substitution matrix to choose? Blosum80Blosum62Blosum45 PAM10PAM120PAM250 Less divergent More divergent

Position Specific Scoring Matrix (PSSM) - Conserved motifs are identified and amino acid profile matrix for each motif is calculated. -This matrix (n x 20 aa ) is representative of the relative amino acid probabilities at specific positions and is characteristic of a protein family. -Such matrices are used by the profile database searching programs (including PSI-BLAST and HMM based programs).

Example of a PSSM matrices determined (PSI-BLAST program): A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 3 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 5 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 6 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 7 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 8 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 9 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 10 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 11 G 0 -2 0 -1 -2 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 12 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 13 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 14 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 15 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 16 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 17 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 18 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 19 Q -1 1 0 0 -3 5 3 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 20 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 21 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 22 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 23 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3..................................................................... 573 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 574 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 575 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1

(2) Compare the word list to the database and identify exact matches. Blast algorithm: (3)For each word match, extend alignment in both directions to (1) Query sequence: list of high scoring words of length w. Query Sequence of length L Maximum of L-w+1 words; w=3,11..... List the words that score at least T using a substitution matrix (Bosum62 or PAM250,...)..... DB sequences Extract matches of words from word list. Maximal Segment Pairs (MSPs): HSPs find alignments with scores > S

E-values: Statistics of HSP scores are characterized by two parameters, K and. The expected number of HSPs with score at least S is given by: E = Kmne - S (Karlin & Altschul,1990). m and n are sequence lengths. E is the E-value for the score S. Bit scores: S = ( S lnK)/ln2 The E-value corresponding to a given bit score is : E = mn2 -S. (note mn). P-values: The probability of finding exactly a HSPs with score >= S is given by : P(a) = e -E.E a /a! (Poisson distribution), where E is the E-value of S given by the above equation. Finding zero HSP with score >=S is P(0) = e -E, so the probability of finding at least one such HSP is : P = 1 - e -E.

Large scale predicted proteome comparisons

The expected number of HSPs with score at least S is given by: E = Kmne - S. m and n are sequence and database lengths.

Systematic Analysis of Completely Sequenced Organisms In silico species specific comparisons Degree of ancestral duplication and of ancestral conservation between pairs of species; Families of paralogs (Partition-MCL); Families of orthologs (Partition-MCL); Distribution of orthologous families according to the three domains of life; Determination of the protein dictionary (orthologs); Determination of protein conservation profiles;

Homologs - Paralogs - Orthologs Homologs: A 1, B 1, A 2, B 2 Paralogs : A 1 vs B 1 and A 2 vs B 2 Orthologs: A 1 vs A 2 and B 1 vs B 2 S1S1 S2S2 ab Sequence analysis Species-1Species-2 Duplication Ancester Evolution Speciation A1A1 A2A2 B1B1 B2B2 A B A B A

Time Duplication Speciation A B Duplication G G1 G2 B-G2 1 B-G2 2 A-G2A-G1B-G1 orthologs outparalogs inparalogsoutparalogs Orthologs - inparalogs - outparalogs Sequence similarities between out-paralogs should be larger than those between orthologs and in-paralogs; Orthology assignments are consistent among several genome pairs; Orthologues are present in syntenic order Heger & Ponting (2007) Evolutionary rate analyses of orthologs and paralogs from 12 Drosophila genomes. Genome Res. 1837-49.

Example Comparing S. cerevisiae (SC) genome with C. elegans (CE) genome

SC vs SC

- Paralogs - multiple matches - Partitions/clustering

SC/CECE/SC Orthologs

segmatchSCCE

Descriptions

Table : 541880 predicted proteins x 100 species Gene Dictionary

E AB S 1..............I.............I................S n G 1,1 100000000000000000000000000000000000000000000000 G 2,1 111111111111111111111111111111111111111111111111 G 3,1 111111111111111111111111111111111111111111111111....................................................... G n1,1 000001110001000000000000000000000000000000000000 G 1,2 000000000000000000010100000000000000000000000000 G 2,2 000000000000000000000000000000000111000011100011........................................................ G n2,2 111111110011111111111111011101110101111111111111........................................................ G 1,n 011110100000000000000000001000000000000000000000 G 2,n 011111100000000000000000000000000000000000000000 G 3,n 011111100011111111100011011011110100111111101111........................................................ G np,n 100110000000000000000000000000000000000000000000 Protein conservation profiles (phylogenetic profiles) Table : 541880 predicted proteins x 100 species

ZYRO KLLA KLTHERGO Duplication

Ancestral duplication and ancestral conservation W ij

Shared orthologous genes s ij

E AB Ancestral duplication mean= 52.1 30. 38.4 std= 17.8 11.7 11.2

Specific and nonspecific proteins Specific proteins (genes) are proteins that have no match outside their own proteome. (no homolog in other species). Non-specific proteins (genes) are proteins that are conserved in at least one other species (have homologs outside their own proteome). Large scale proteome comparisons allow estimation of:

Specific and nonspecific proportions E A B mean% 76.2 84.3 87.6

genes same phylumdifferent phylum 0 100% conservation Species specific genes

Domain specific conservation

Domain specific conservation...

Clusters (families) of paralogues and of orthologues

Paralogs: Partitions Paralogs: Reciprocal significant hit proteins; P7.1 P4.1 P4.2 1. Partition Each non-uniq protein is assigned a partition denoted Pn.m, where n is the number of proteins in the partition and m is an arbitrary order;

Paralogs: mcl Clustering Paralogs: Reciprocal significant hit proteins; mcl clustering was performed using: -log(blastp(e-values)) and an inflation index I=3.0 ; 2. mcl clustering C4.1 C3.1 C3.2C1.1 C1.2 C1.3 C2.1

P7.1.C4.1 P7.1.C3.1 P4.2.C3.2 P4.2.C1.1 P4.1.C1.2 P4.1.C1.3 P4.1.C2.1 Paralogs: Partitions/mcl Clustering Each protein is identified by its partition and its mcl cluster: Pn.m.Cp.q

Paralogs: Partition and clustering of duplicated proteins Each non-uniq protein is assigned a partition denoted Pn.m, where n is the number of proteins in the partition and m is an arbitrary order; In parallel, the same set of non-uniq proteins is clustered using the MCL algorithm (Markov Cluster algorithm by Stijn van Dongen); -The clustering was performed using -log(blastp e-values) and an inflation index I=3.0; Result: Each protein belongs to both a partition (Pn.m) and an MCL cluster (Cp.q), which are concatenated to form the final family assignment Pn.m.Cp.q to the loci; The term singleton is assigned to locis that do not have significant matches; Reciprocal best hit protein are considered putative paralogs;

Paralogs: Partitions Paralogs: Reciprocal significant hit (RSH) proteins; P7.1 P4.1 P4.2 1. Partition Each non-uniq protein is assigned to a partition denoted Pn.m, where n is the number of proteins in the partition and m is an arbitrary order; (blastp; pam250; SEG filter; e-value

Documents

Large scale genomes comparisons Fredj Tekaia Institut Pasteur [email protected] Institut Pasteur/EMBO/CNPq course, UFSC, Florianopolis, 2008