Large scale genomes comparisons Fredj Tekaia Institut Pasteur [email protected] Institut Pasteur/EMBO/CNPq course, UFSC, Florianopolis, 2008

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

  • Slide 1
  • Slide 2
  • Large scale genomes comparisons Fredj Tekaia Institut Pasteur [email protected] Institut Pasteur/EMBO/CNPq course, UFSC, Florianopolis, 2008
  • Slide 3
  • Plan: Completely sequences genomes ; Comparisons programs; Large scale genome comparisons; References.
  • Slide 4
  • Large scale genome comparisons -duplication; -conservation; -species-specific genes (proteins); -paralogues, orthologues; -families (clusters) of paralogues, of orthologues; -genomes oraganisations (duplicated, conserved genes); -search for shared motifs in proteins of the same cluster; -protein conservation profiles; -selection pressure analyses (synonymous, non synonymous substitutions,..),.
  • Slide 5
  • 672 53 94 http://www.genomesonline.org/ Tree of life Complete genomes 3823 projects 819 published (12-06-08) 1848 Bacteria 90 Archaea 936 eukaryotes 130 metagenomes 3 phylogenetic domains; Lifestyles: mesophiles; (hyper)thermophiles; psychrophiles; extreme conditions,...
  • Slide 6
  • Number of available completely sequenced genomes GOLD List and references List and references Completely sequenced Genomes that span the three domains of life are growing at a rapid rate 06-2008
  • Slide 7
  • Genome sequencing projects There are several web-based resources that document the progress of completely sequenced genomes and their reference publication, including: GOLDGenomes Online Database http://wit.integratedgenomics.com/GOLD/
  • Slide 8
  • Resources for genomes There are two main resources for genomes: EBIEuropean Bioinformatics Institute http://www.ebi.ac.uk/genomes/ NCBINational Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/Genomes/ But many others resources from sequencing Institutions: SangerThe welcome Trust Sanger Institut http://www.sanger.ac.uk/ Broad Institut http://www.broad.mit.edu/tools/data/seq.html http://www.broad.mit.edu/tools/data/seq.html Genolevureshttp://cbi.labri.fr/Genolevures/index.phphttp://cbi.labri.fr/Genolevures/index.php
  • Slide 9
  • Definitions Genome The genome of a cell is formed by the collection of the DNA it comprises. The genome size is the total of its DNA bases. Gene Is a particular DNA sequence situated in a specific position on a chromosome and that codes for a specific function. Protein Is a sequence composed of amino-acids ordered according to the DNA sequences of the gene it codes for. Proteome Is the set of proteins in an organism. Genomics Is the exhaustive study of genomes: genetic material, genes; their functions, their organization....
  • Slide 10
  • Chronology of completely sequenced genomes 1977: first viral genome (5386 base pairs; encoding 11 genes). Sanger et al. sequence bacteriophage X174. 1981: Human mitochondrial genome. 16,500 base pairs ( encodes 13 proteins, 2 rRNA, 22 tRNA ) 1986: Chloroplast genome. 156,000 base pairs (most are 120 kb to 200 kb)
  • Slide 11
  • 1995: first genome of a free-living organism, the bacterium Haemophilus influenzae, by TIGR, 1830 Kb, 1713 genes. 1996: first genome of an archaeal genome: Methanococcus jannaschii DSM 2661, by TIGR, 1664 Kb, 1773 genes. 1997: first eukaryotic genome : Saccharomyces cerevisiae S288C; International collaboration; 16 Chromosomes; 12,057 Kb, ~6000 genes. 1998: first multicellular organism Nematode Caenorhabditis elegans; 97 Mb; ~19,000 genes.
  • Slide 12
  • 1999: first human chromosome: Chromosome 22 (49 Mb, 673 genes))
  • Slide 13
  • 2000: Fruitfly Drosophila melanogaster (137 Mb; ~13,000 genes) 2000 first plant genome: Arabidopsis thaliana (115,428 Mb; 22670 genes 2001: draft sequence of the human genome (x Mb; ~28000 genes) 2002: plasmodium falciparum (22,9 Mb; 5334 genes) 2002: mouse genome (x Mb; ~28000 genes) 2004: Fish draft Tetraodon nigroviridis genome (x Mb; ~28000 genes); 2005: Dog (41Mb, 33651 genes) and chicken genomes ( 18031 genes)
  • Slide 14
  • How big are genome sizes? Viral genomes: 1 kb to 360 kb ( Canarypox virus) Note: Mimivirus: 1.2 Mb http://www.giantvirus.org/top.html (Top 100 largest viral genome sequences) Bacterial genomes: 0.5 Mb to 13 Mb; Eukaryotic genomes: 8 Mb to 670 Gb; DOGS - Database Of Genome Sizes : http://www.cbs.dtu.dk/databases/DOGS/ : http://www.cbs.dtu.dk/databases/DOGS/
  • Slide 15
  • Comparative genomics Analyses of the genetic material of different species help understanding the similarity and differences between genomes, their evolution and the evolution of their genes. Intra-genomic comparisons help understanding the degree of duplication (genome regions; genes) and genes organization,... Inter-genomic comparisons help understanding the degree of similarity between genomes; degree of conservation between genes; Understanding gene and genome evolution
  • Slide 16
  • Evolution
  • Slide 17
  • Genomes 2 edition 2002. T.A. Brown Arbre des espces AB C Arbre des gnes A BC Time Duplication Speciation A BC Speciation - Duplication
  • Slide 18
  • Ancestor species genome Evolutionary processes include Phylogeny* duplication genesis Expansion* HGT Exchange* loss Deletion*selection* Expansion, Exchange and Deletion. Large scale comparative analysis of predicted proteomes revealed significant evolutionary processes.
  • Slide 19
  • Gene duplications are traditionally considered as a major evolutionary source forf protein new functions Understanding how duplications happened and how important is this evolutionary process is a key goal of genome analysis > Some examples
  • Slide 20
  • Kellis et al. Nature, 2004 S. cerevisiae genome Colours reveal Duplications
  • Slide 21
  • Kellis et al. Nature, 2004 Speciation Duplication Deletion Actual content of the 2 copiesReconstruction of the ancestral organization
  • Slide 22
  • Nature Reviews Genetics 3; 827-837 (2002); SPLITTING PAIRS: THE DIVERGING FATES OF DUPLICATED GENES
  • Slide 23
  • Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206. Original version Actual version
  • Slide 24
  • Genome duplication. a, Distribution of Ks values of duplicated genes in Tetraodon (left) and Takifugu (right) genomes. Duplicated genes broadly belong to two categories, depending on their Ks value being below or higher than 0.35 substitutions per site since the divergence between the two puffer fish (arrows). b, Global distribution of ancient duplicated genes (Ks > 0.35) in the Tetraodon genome. The 21 Tetraodon chromosomes are represented in a circle in numerical order and each line joins duplicated genes at their respective position on a given pair of chromosomes. Jaillon et al. Nature 431, 946-857. 2004.
  • Slide 25
  • Slide 26
  • Intra-genome Comparisons simple description (genes, size distribution, base compositions nucleotides, amino acids,...); specific genes ; gene duplication ; gene families ; gene organization on the genome;.....
  • Slide 27
  • Inter-genome Comparaisons base composition, codons, amino acids,... degree of conservation between genomes, orthologues determination, families (clusters) of orthologues. gene dictionary, gene conservation profiles, genome trees construction, genomes multiple alignments.
  • Slide 28
  • Simple description
  • Slide 29
  • Dommeanstd#numberminMax E12229.211306.32746445000 A2161.3704.7475364540 B3197.61620.85331828702 Statistics : gene number/species/domain
  • Slide 30
  • DomMeanstdnombreminMax E39.89.7282263 A45.910.94727.667.9 B48.213.75331687 G+C content
  • Slide 31
  • ORF products mean size
  • Slide 32
  • Slide 33
  • Amino acid composition
  • Slide 34
  • 208 13............... Amino Acid composition Correspondence Analysis was used to explore relationships between species and amino acids.
  • Slide 35
  • Eukaryotes Hyperthermophiles PsychrophilesProkaryotes mesophiles Thermophiles Encephalitozoon cuniculi Thermosynechococcus elongatus Tekaia & Yeramian, 2006, BMC Genomics 7:307
  • Slide 36
  • GC% growth t Mycoplasma mycoides 23% Nocardia farcinica: 70% Streptomyces coelicolor: 72% Tetrahymena thermophila (Protists) Saccharomyces Entamoeba histolytica (Protists) Cryptosporidium hominisLeishmania major:60% Cyanidioschyzon merolae Aspergilus fumigatus:50% Homo sapiens Methanococcus jannaschii:31% Pyrococcus abyssi:44% Methanopyrus kandleri:61% Thermus-thermophilus:69% Colwellia psychrerythraea Pseudoalteromonas haloplanktis Encephalitozoon cuniculi A. nidulans A. oryzae C. neoformans Mus musculus Rat Candida Glabrata Tekaia & Yeramian, 2006, BMC Genomics 7:307
  • Slide 37
  • -0.1 A. fumigatus Area of candidate thermostable proteins
  • Slide 38
  • Search for similarity
  • Slide 39
  • Methods: Important to know how algorithms that allow sequence comparisons work, There are many comparisons methods, Among most used: BLAST FASTA Smith-Waterman algorithm dynamic programming method HMM (Hidden Markov Model)
  • Slide 40
  • Sequence Comparaisons V I T K L G T C V G SV I T K L G T C V G S V I S... T Q V G SV. S K. G T Q V. S Identity Similarity Homology
  • Slide 41
  • Comparison of 2 sequences Aims at finding the optimal alignment: the one that shows most similar regions and regions that are less similar. In describing sequence comparisons, three different terms are commonly used : Identity, Similarity and Homology. Need for a score that evaluates: - matches - mismatches - gaps and a method that evaluates the numerous possible alignments.
  • Slide 42
  • Identity Refers to the occurence of identical nucleotides or amino acids in the same position in aligned sequences ; Identity is objective and well defined; Identity can be quantified: Percent i.e the number of identical matches divided by the length of the aligned region.
  • Slide 43
  • Similarity Sequence similarity takes approximate matches into account, and is meaningful only when such substitutions are scored according to some measure of difference with conservative substitutions assigned more favorable scores than non-conservative ones (substitution matrices). Given a number of parameters (alphabet, scoring matrix, filtering procedure, etc...), the similarity of an aligned region is defined by a score calculated on that region; The score depends on the chosen parameters; Contrarily to homology : expression like significant or weak similarity are often used.
  • Slide 44
  • Homology Sequence homology underlies common ancestry and sequence conservation; Homology can be inferred, under suitable conditions from sequence similarity ; The main objective of sequence similarity searching studies aims at inferring homology between sequences; Homology is not a measure. It is an all or none relashionship (i.e homology exits or does not exist. Expressions like : significant or weak homology are meaningless). Sequence similarity is a measure of the matching characters in an alignment, whereas homology is a statement of common evolutionary origin.
  • Slide 45
  • Local Alignement Global Alignement
  • Slide 46
  • Compare one query sequence to a BLAST formatted database
  • Slide 47
  • Amino acid scoring schemes (substitution matrices) All algorithms comparing protein sequences rely on some schemes to score the equivalence of each of the 210 possible pairs of amino acids. As a result : what a local alignment program produces depends strongly upon the scores it uses. implicitly a scheme may represent a particular theory of evolution, choice of a matrix can strongly influence the outcome of an analysis. The scores in the matrix are integer values which assign a positive score to identical or similar character pairs, and a negative value to dissimilar character pairs. S ij = (ln(q ij /p i p j ))/ u ; q ij are target frequencies for aligned pairs of amino acids, the p i and p j are background frequencies, and u is a statistical parameter.
  • Slide 48
  • Examples of substitution matrices # PAM250 substitution matrix, scale = ln(2)/3 = 0.231049 # Expected score = -0.844, Entropy = 0.354 bits # Lowest score = -8, Highest score = 17 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 0 0 0 -8 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 -1 0 -1 -8 N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 2 1 0 -8 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 3 3 -1 -8 C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -4 -5 -3 -8 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 1 3 -1 -8 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 3 3 -1 -8 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 0 0 -1 -8 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 1 2 -1 -8 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -2 -2 -1 -8 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -3 -3 -1 -8 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 1 0 -1 -8 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -2 -2 -1 -8 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -4 -5 -2 -8 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 -1 0 -1 -8 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 0 0 0 -8 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 0 -1 0 -8 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -5 -6 -4 -8 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -3 -4 -2 -8 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 -2 -2 -1 -8 B 0 -1 2 3 -4 1 3 0 1 -2 -3 1 -2 -4 -1 0 0 -5 -3 -2 3 2 -1 -8 Z 0 0 1 3 -5 3 3 0 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3 -1 -8 X 0 -1 0 -1 -3 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1 0 0 -4 -2 -1 -1 -1 -1 -8 * -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 1
  • Slide 49
  • BLOSUM62 Clustered Scoring Matrix in 1/2 Bit Units # Cluster Percentage: >= 62 # Lowest score = -4, Highest score = 11 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
  • Slide 50
  • PAM matrices (Dayhoff et al. (1978)) PAM stands for point accepted mutation. 1 PAM corresponds to 1 amino acid change per 100 residues, 1 PAM ~1% divergence, Extrapolate to predict patterns at longer distances. Assumptions : replacements are independent of surrounding residues, sequences being compared are of average composition, all sites are equally mutable, Source of error : small, globular proteins were used to derive PAM matrices (departure from average composition) errors in PAM1 are magnified up to PAM250,.... does not account for conserved blocks or motifs. Strategy : PAM40short alignments, highly similar PAM120average similarity PAM250longer, weaker local alignments.
  • Slide 51
  • BLOSUM matrices (Henikoff, S., and Henikoff, J., G. (1992)) BlosumX denotes a matrix obtained from clustered sequence segments with more than X% identity. Examples : - Blosum62 is obtained from clustered sequences with identity greater than 62%. - Blosum80 is obtained from clustered sequences with identity greater than 80%. Which substitution matrix to choose? Blosum80Blosum62Blosum45 PAM10PAM120PAM250 Less divergent More divergent
  • Slide 52
  • Slide 53
  • Position Specific Scoring Matrix (PSSM) - Conserved motifs are identified and amino acid profile matrix for each motif is calculated. -This matrix (n x 20 aa ) is representative of the relative amino acid probabilities at specific positions and is characteristic of a protein family. -Such matrices are used by the profile database searching programs (including PSI-BLAST and HMM based programs).
  • Slide 54
  • Example of a PSSM matrices determined (PSI-BLAST program): A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 3 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 5 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 6 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 7 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 8 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 9 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 10 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 11 G 0 -2 0 -1 -2 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 12 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 13 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 14 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 15 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 16 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 17 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 18 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 19 Q -1 1 0 0 -3 5 3 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 20 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 21 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 22 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 23 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3..................................................................... 573 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 574 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 575 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1
  • Slide 55
  • (2) Compare the word list to the database and identify exact matches. Blast algorithm: (3)For each word match, extend alignment in both directions to (1) Query sequence: list of high scoring words of length w. Query Sequence of length L Maximum of L-w+1 words; w=3,11..... List the words that score at least T using a substitution matrix (Bosum62 or PAM250,...)..... DB sequences Extract matches of words from word list. Maximal Segment Pairs (MSPs): HSPs find alignments with scores > S
  • Slide 56
  • E-values: Statistics of HSP scores are characterized by two parameters, K and. The expected number of HSPs with score at least S is given by: E = Kmne - S (Karlin & Altschul,1990). m and n are sequence lengths. E is the E-value for the score S. Bit scores: S = ( S lnK)/ln2 The E-value corresponding to a given bit score is : E = mn2 -S. (note mn). P-values: The probability of finding exactly a HSPs with score >= S is given by : P(a) = e -E.E a /a! (Poisson distribution), where E is the E-value of S given by the above equation. Finding zero HSP with score >=S is P(0) = e -E, so the probability of finding at least one such HSP is : P = 1 - e -E.
  • Slide 57
  • Slide 58
  • Slide 59
  • Large scale predicted proteome comparisons
  • Slide 60
  • The expected number of HSPs with score at least S is given by: E = Kmne - S. m and n are sequence and database lengths.
  • Slide 61
  • Systematic Analysis of Completely Sequenced Organisms In silico species specific comparisons Degree of ancestral duplication and of ancestral conservation between pairs of species; Families of paralogs (Partition-MCL); Families of orthologs (Partition-MCL); Distribution of orthologous families according to the three domains of life; Determination of the protein dictionary (orthologs); Determination of protein conservation profiles;
  • Slide 62
  • Homologs - Paralogs - Orthologs Homologs: A 1, B 1, A 2, B 2 Paralogs : A 1 vs B 1 and A 2 vs B 2 Orthologs: A 1 vs A 2 and B 1 vs B 2 S1S1 S2S2 ab Sequence analysis Species-1Species-2 Duplication Ancester Evolution Speciation A1A1 A2A2 B1B1 B2B2 A B A B A
  • Slide 63
  • Time Duplication Speciation A B Duplication G G1 G2 B-G2 1 B-G2 2 A-G2A-G1B-G1 orthologs outparalogs inparalogsoutparalogs Orthologs - inparalogs - outparalogs Sequence similarities between out-paralogs should be larger than those between orthologs and in-paralogs; Orthology assignments are consistent among several genome pairs; Orthologues are present in syntenic order Heger & Ponting (2007) Evolutionary rate analyses of orthologs and paralogs from 12 Drosophila genomes. Genome Res. 1837-49.
  • Slide 64
  • Example Comparing S. cerevisiae (SC) genome with C. elegans (CE) genome
  • Slide 65
  • SC vs SC
  • Slide 66
  • - Paralogs - multiple matches - Partitions/clustering
  • Slide 67
  • SC/CECE/SC Orthologs
  • Slide 68
  • segmatchSCCE
  • Slide 69
  • Descriptions
  • Slide 70
  • Table : 541880 predicted proteins x 100 species Gene Dictionary
  • Slide 71
  • E AB S 1..............I.............I................S n G 1,1 100000000000000000000000000000000000000000000000 G 2,1 111111111111111111111111111111111111111111111111 G 3,1 111111111111111111111111111111111111111111111111....................................................... G n1,1 000001110001000000000000000000000000000000000000 G 1,2 000000000000000000010100000000000000000000000000 G 2,2 000000000000000000000000000000000111000011100011........................................................ G n2,2 111111110011111111111111011101110101111111111111........................................................ G 1,n 011110100000000000000000001000000000000000000000 G 2,n 011111100000000000000000000000000000000000000000 G 3,n 011111100011111111100011011011110100111111101111........................................................ G np,n 100110000000000000000000000000000000000000000000 Protein conservation profiles (phylogenetic profiles) Table : 541880 predicted proteins x 100 species
  • Slide 72
  • ZYRO KLLA KLTHERGO Duplication
  • Slide 73
  • Ancestral duplication and ancestral conservation W ij
  • Slide 74
  • Shared orthologous genes s ij
  • Slide 75
  • E AB Ancestral duplication mean= 52.1 30. 38.4 std= 17.8 11.7 11.2
  • Slide 76
  • Specific and nonspecific proteins Specific proteins (genes) are proteins that have no match outside their own proteome. (no homolog in other species). Non-specific proteins (genes) are proteins that are conserved in at least one other species (have homologs outside their own proteome). Large scale proteome comparisons allow estimation of:
  • Slide 77
  • Specific and nonspecific proportions E A B mean% 76.2 84.3 87.6
  • Slide 78
  • genes same phylumdifferent phylum 0 100% conservation Species specific genes
  • Slide 79
  • Domain specific conservation
  • Slide 80
  • Domain specific conservation...
  • Slide 81
  • Clusters (families) of paralogues and of orthologues
  • Slide 82
  • Paralogs: Partitions Paralogs: Reciprocal significant hit proteins; P7.1 P4.1 P4.2 1. Partition Each non-uniq protein is assigned a partition denoted Pn.m, where n is the number of proteins in the partition and m is an arbitrary order;
  • Slide 83
  • Paralogs: mcl Clustering Paralogs: Reciprocal significant hit proteins; mcl clustering was performed using: -log(blastp(e-values)) and an inflation index I=3.0 ; 2. mcl clustering C4.1 C3.1 C3.2C1.1 C1.2 C1.3 C2.1
  • Slide 84
  • P7.1.C4.1 P7.1.C3.1 P4.2.C3.2 P4.2.C1.1 P4.1.C1.2 P4.1.C1.3 P4.1.C2.1 Paralogs: Partitions/mcl Clustering Each protein is identified by its partition and its mcl cluster: Pn.m.Cp.q
  • Slide 85
  • Paralogs: Partition and clustering of duplicated proteins Each non-uniq protein is assigned a partition denoted Pn.m, where n is the number of proteins in the partition and m is an arbitrary order; In parallel, the same set of non-uniq proteins is clustered using the MCL algorithm (Markov Cluster algorithm by Stijn van Dongen); -The clustering was performed using -log(blastp e-values) and an inflation index I=3.0; Result: Each protein belongs to both a partition (Pn.m) and an MCL cluster (Cp.q), which are concatenated to form the final family assignment Pn.m.Cp.q to the loci; The term singleton is assigned to locis that do not have significant matches; Reciprocal best hit protein are considered putative paralogs;
  • Slide 86
  • Paralogs: Partitions Paralogs: Reciprocal significant hit (RSH) proteins; P7.1 P4.1 P4.2 1. Partition Each non-uniq protein is assigned to a partition denoted Pn.m, where n is the number of proteins in the partition and m is an arbitrary order; (blastp; pam250; SEG filter; e-value