Large scale genome comparisons -duplication; -conservation;
-species-specific genes (proteins); -paralogues, orthologues;
-families (clusters) of paralogues, of orthologues; -genomes
oraganisations (duplicated, conserved genes); -search for shared
motifs in proteins of the same cluster; -protein conservation
profiles; -selection pressure analyses (synonymous, non synonymous
substitutions,..),.
Slide 5
672 53 94 http://www.genomesonline.org/ Tree of life Complete
genomes 3823 projects 819 published (12-06-08) 1848 Bacteria 90
Archaea 936 eukaryotes 130 metagenomes 3 phylogenetic domains;
Lifestyles: mesophiles; (hyper)thermophiles; psychrophiles; extreme
conditions,...
Slide 6
Number of available completely sequenced genomes GOLD List and
references List and references Completely sequenced Genomes that
span the three domains of life are growing at a rapid rate
06-2008
Slide 7
Genome sequencing projects There are several web-based
resources that document the progress of completely sequenced
genomes and their reference publication, including: GOLDGenomes
Online Database http://wit.integratedgenomics.com/GOLD/
Slide 8
Resources for genomes There are two main resources for genomes:
EBIEuropean Bioinformatics Institute http://www.ebi.ac.uk/genomes/
NCBINational Center for Biotechnology Information
http://www.ncbi.nlm.nih.gov/Genomes/ But many others resources from
sequencing Institutions: SangerThe welcome Trust Sanger Institut
http://www.sanger.ac.uk/ Broad Institut
http://www.broad.mit.edu/tools/data/seq.html
http://www.broad.mit.edu/tools/data/seq.html
Genolevureshttp://cbi.labri.fr/Genolevures/index.phphttp://cbi.labri.fr/Genolevures/index.php
Slide 9
Definitions Genome The genome of a cell is formed by the
collection of the DNA it comprises. The genome size is the total of
its DNA bases. Gene Is a particular DNA sequence situated in a
specific position on a chromosome and that codes for a specific
function. Protein Is a sequence composed of amino-acids ordered
according to the DNA sequences of the gene it codes for. Proteome
Is the set of proteins in an organism. Genomics Is the exhaustive
study of genomes: genetic material, genes; their functions, their
organization....
Slide 10
Chronology of completely sequenced genomes 1977: first viral
genome (5386 base pairs; encoding 11 genes). Sanger et al. sequence
bacteriophage X174. 1981: Human mitochondrial genome. 16,500 base
pairs ( encodes 13 proteins, 2 rRNA, 22 tRNA ) 1986: Chloroplast
genome. 156,000 base pairs (most are 120 kb to 200 kb)
Slide 11
1995: first genome of a free-living organism, the bacterium
Haemophilus influenzae, by TIGR, 1830 Kb, 1713 genes. 1996: first
genome of an archaeal genome: Methanococcus jannaschii DSM 2661, by
TIGR, 1664 Kb, 1773 genes. 1997: first eukaryotic genome :
Saccharomyces cerevisiae S288C; International collaboration; 16
Chromosomes; 12,057 Kb, ~6000 genes. 1998: first multicellular
organism Nematode Caenorhabditis elegans; 97 Mb; ~19,000
genes.
Slide 12
1999: first human chromosome: Chromosome 22 (49 Mb, 673
genes))
Slide 13
2000: Fruitfly Drosophila melanogaster (137 Mb; ~13,000 genes)
2000 first plant genome: Arabidopsis thaliana (115,428 Mb; 22670
genes 2001: draft sequence of the human genome (x Mb; ~28000 genes)
2002: plasmodium falciparum (22,9 Mb; 5334 genes) 2002: mouse
genome (x Mb; ~28000 genes) 2004: Fish draft Tetraodon nigroviridis
genome (x Mb; ~28000 genes); 2005: Dog (41Mb, 33651 genes) and
chicken genomes ( 18031 genes)
Slide 14
How big are genome sizes? Viral genomes: 1 kb to 360 kb (
Canarypox virus) Note: Mimivirus: 1.2 Mb
http://www.giantvirus.org/top.html (Top 100 largest viral genome
sequences) Bacterial genomes: 0.5 Mb to 13 Mb; Eukaryotic genomes:
8 Mb to 670 Gb; DOGS - Database Of Genome Sizes :
http://www.cbs.dtu.dk/databases/DOGS/ :
http://www.cbs.dtu.dk/databases/DOGS/
Slide 15
Comparative genomics Analyses of the genetic material of
different species help understanding the similarity and differences
between genomes, their evolution and the evolution of their genes.
Intra-genomic comparisons help understanding the degree of
duplication (genome regions; genes) and genes organization,...
Inter-genomic comparisons help understanding the degree of
similarity between genomes; degree of conservation between genes;
Understanding gene and genome evolution
Slide 16
Evolution
Slide 17
Genomes 2 edition 2002. T.A. Brown Arbre des espces AB C Arbre
des gnes A BC Time Duplication Speciation A BC Speciation -
Duplication
Slide 18
Ancestor species genome Evolutionary processes include
Phylogeny* duplication genesis Expansion* HGT Exchange* loss
Deletion*selection* Expansion, Exchange and Deletion. Large scale
comparative analysis of predicted proteomes revealed significant
evolutionary processes.
Slide 19
Gene duplications are traditionally considered as a major
evolutionary source forf protein new functions Understanding how
duplications happened and how important is this evolutionary
process is a key goal of genome analysis > Some examples
Slide 20
Kellis et al. Nature, 2004 S. cerevisiae genome Colours reveal
Duplications
Slide 21
Kellis et al. Nature, 2004 Speciation Duplication Deletion
Actual content of the 2 copiesReconstruction of the ancestral
organization
Slide 22
Nature Reviews Genetics 3; 827-837 (2002); SPLITTING PAIRS: THE
DIVERGING FATES OF DUPLICATED GENES
Slide 23
Hurles M (2004) Gene Duplication: The Genomic Trade in Spare
Parts. PLoS Biol 2(7): e206. Original version Actual version
Slide 24
Genome duplication. a, Distribution of Ks values of duplicated
genes in Tetraodon (left) and Takifugu (right) genomes. Duplicated
genes broadly belong to two categories, depending on their Ks value
being below or higher than 0.35 substitutions per site since the
divergence between the two puffer fish (arrows). b, Global
distribution of ancient duplicated genes (Ks > 0.35) in the
Tetraodon genome. The 21 Tetraodon chromosomes are represented in a
circle in numerical order and each line joins duplicated genes at
their respective position on a given pair of chromosomes. Jaillon
et al. Nature 431, 946-857. 2004.
Slide 25
Slide 26
Intra-genome Comparisons simple description (genes, size
distribution, base compositions nucleotides, amino acids,...);
specific genes ; gene duplication ; gene families ; gene
organization on the genome;.....
Slide 27
Inter-genome Comparaisons base composition, codons, amino
acids,... degree of conservation between genomes, orthologues
determination, families (clusters) of orthologues. gene dictionary,
gene conservation profiles, genome trees construction, genomes
multiple alignments.
GC% growth t Mycoplasma mycoides 23% Nocardia farcinica: 70%
Streptomyces coelicolor: 72% Tetrahymena thermophila (Protists)
Saccharomyces Entamoeba histolytica (Protists) Cryptosporidium
hominisLeishmania major:60% Cyanidioschyzon merolae Aspergilus
fumigatus:50% Homo sapiens Methanococcus jannaschii:31% Pyrococcus
abyssi:44% Methanopyrus kandleri:61% Thermus-thermophilus:69%
Colwellia psychrerythraea Pseudoalteromonas haloplanktis
Encephalitozoon cuniculi A. nidulans A. oryzae C. neoformans Mus
musculus Rat Candida Glabrata Tekaia & Yeramian, 2006, BMC
Genomics 7:307
Slide 37
-0.1 A. fumigatus Area of candidate thermostable proteins
Slide 38
Search for similarity
Slide 39
Methods: Important to know how algorithms that allow sequence
comparisons work, There are many comparisons methods, Among most
used: BLAST FASTA Smith-Waterman algorithm dynamic programming
method HMM (Hidden Markov Model)
Slide 40
Sequence Comparaisons V I T K L G T C V G SV I T K L G T C V G
S V I S... T Q V G SV. S K. G T Q V. S Identity Similarity
Homology
Slide 41
Comparison of 2 sequences Aims at finding the optimal
alignment: the one that shows most similar regions and regions that
are less similar. In describing sequence comparisons, three
different terms are commonly used : Identity, Similarity and
Homology. Need for a score that evaluates: - matches - mismatches -
gaps and a method that evaluates the numerous possible
alignments.
Slide 42
Identity Refers to the occurence of identical nucleotides or
amino acids in the same position in aligned sequences ; Identity is
objective and well defined; Identity can be quantified: Percent i.e
the number of identical matches divided by the length of the
aligned region.
Slide 43
Similarity Sequence similarity takes approximate matches into
account, and is meaningful only when such substitutions are scored
according to some measure of difference with conservative
substitutions assigned more favorable scores than non-conservative
ones (substitution matrices). Given a number of parameters
(alphabet, scoring matrix, filtering procedure, etc...), the
similarity of an aligned region is defined by a score calculated on
that region; The score depends on the chosen parameters; Contrarily
to homology : expression like significant or weak similarity are
often used.
Slide 44
Homology Sequence homology underlies common ancestry and
sequence conservation; Homology can be inferred, under suitable
conditions from sequence similarity ; The main objective of
sequence similarity searching studies aims at inferring homology
between sequences; Homology is not a measure. It is an all or none
relashionship (i.e homology exits or does not exist. Expressions
like : significant or weak homology are meaningless). Sequence
similarity is a measure of the matching characters in an alignment,
whereas homology is a statement of common evolutionary origin.
Slide 45
Local Alignement Global Alignement
Slide 46
Compare one query sequence to a BLAST formatted database
Slide 47
Amino acid scoring schemes (substitution matrices) All
algorithms comparing protein sequences rely on some schemes to
score the equivalence of each of the 210 possible pairs of amino
acids. As a result : what a local alignment program produces
depends strongly upon the scores it uses. implicitly a scheme may
represent a particular theory of evolution, choice of a matrix can
strongly influence the outcome of an analysis. The scores in the
matrix are integer values which assign a positive score to
identical or similar character pairs, and a negative value to
dissimilar character pairs. S ij = (ln(q ij /p i p j ))/ u ; q ij
are target frequencies for aligned pairs of amino acids, the p i
and p j are background frequencies, and u is a statistical
parameter.
PAM matrices (Dayhoff et al. (1978)) PAM stands for point
accepted mutation. 1 PAM corresponds to 1 amino acid change per 100
residues, 1 PAM ~1% divergence, Extrapolate to predict patterns at
longer distances. Assumptions : replacements are independent of
surrounding residues, sequences being compared are of average
composition, all sites are equally mutable, Source of error :
small, globular proteins were used to derive PAM matrices
(departure from average composition) errors in PAM1 are magnified
up to PAM250,.... does not account for conserved blocks or motifs.
Strategy : PAM40short alignments, highly similar PAM120average
similarity PAM250longer, weaker local alignments.
Slide 51
BLOSUM matrices (Henikoff, S., and Henikoff, J., G. (1992))
BlosumX denotes a matrix obtained from clustered sequence segments
with more than X% identity. Examples : - Blosum62 is obtained from
clustered sequences with identity greater than 62%. - Blosum80 is
obtained from clustered sequences with identity greater than 80%.
Which substitution matrix to choose? Blosum80Blosum62Blosum45
PAM10PAM120PAM250 Less divergent More divergent
Slide 52
Slide 53
Position Specific Scoring Matrix (PSSM) - Conserved motifs are
identified and amino acid profile matrix for each motif is
calculated. -This matrix (n x 20 aa ) is representative of the
relative amino acid probabilities at specific positions and is
characteristic of a protein family. -Such matrices are used by the
profile database searching programs (including PSI-BLAST and HMM
based programs).
(2) Compare the word list to the database and identify exact
matches. Blast algorithm: (3)For each word match, extend alignment
in both directions to (1) Query sequence: list of high scoring
words of length w. Query Sequence of length L Maximum of L-w+1
words; w=3,11..... List the words that score at least T using a
substitution matrix (Bosum62 or PAM250,...)..... DB sequences
Extract matches of words from word list. Maximal Segment Pairs
(MSPs): HSPs find alignments with scores > S
Slide 56
E-values: Statistics of HSP scores are characterized by two
parameters, K and. The expected number of HSPs with score at least
S is given by: E = Kmne - S (Karlin & Altschul,1990). m and n
are sequence lengths. E is the E-value for the score S. Bit scores:
S = ( S lnK)/ln2 The E-value corresponding to a given bit score is
: E = mn2 -S. (note mn). P-values: The probability of finding
exactly a HSPs with score >= S is given by : P(a) = e -E.E a /a!
(Poisson distribution), where E is the E-value of S given by the
above equation. Finding zero HSP with score >=S is P(0) = e -E,
so the probability of finding at least one such HSP is : P = 1 - e
-E.
Slide 57
Slide 58
Slide 59
Large scale predicted proteome comparisons
Slide 60
The expected number of HSPs with score at least S is given by:
E = Kmne - S. m and n are sequence and database lengths.
Slide 61
Systematic Analysis of Completely Sequenced Organisms In silico
species specific comparisons Degree of ancestral duplication and of
ancestral conservation between pairs of species; Families of
paralogs (Partition-MCL); Families of orthologs (Partition-MCL);
Distribution of orthologous families according to the three domains
of life; Determination of the protein dictionary (orthologs);
Determination of protein conservation profiles;
Slide 62
Homologs - Paralogs - Orthologs Homologs: A 1, B 1, A 2, B 2
Paralogs : A 1 vs B 1 and A 2 vs B 2 Orthologs: A 1 vs A 2 and B 1
vs B 2 S1S1 S2S2 ab Sequence analysis Species-1Species-2
Duplication Ancester Evolution Speciation A1A1 A2A2 B1B1 B2B2 A B A
B A
Slide 63
Time Duplication Speciation A B Duplication G G1 G2 B-G2 1 B-G2
2 A-G2A-G1B-G1 orthologs outparalogs inparalogsoutparalogs
Orthologs - inparalogs - outparalogs Sequence similarities between
out-paralogs should be larger than those between orthologs and
in-paralogs; Orthology assignments are consistent among several
genome pairs; Orthologues are present in syntenic order Heger &
Ponting (2007) Evolutionary rate analyses of orthologs and paralogs
from 12 Drosophila genomes. Genome Res. 1837-49.
Slide 64
Example Comparing S. cerevisiae (SC) genome with C. elegans
(CE) genome
Table : 541880 predicted proteins x 100 species Gene
Dictionary
Slide 71
E AB S 1..............I.............I................S n G 1,1
100000000000000000000000000000000000000000000000 G 2,1
111111111111111111111111111111111111111111111111 G 3,1
111111111111111111111111111111111111111111111111.......................................................
G n1,1 000001110001000000000000000000000000000000000000 G 1,2
000000000000000000010100000000000000000000000000 G 2,2
000000000000000000000000000000000111000011100011........................................................
G n2,2
111111110011111111111111011101110101111111111111........................................................
G 1,n 011110100000000000000000001000000000000000000000 G 2,n
011111100000000000000000000000000000000000000000 G 3,n
011111100011111111100011011011110100111111101111........................................................
G np,n 100110000000000000000000000000000000000000000000 Protein
conservation profiles (phylogenetic profiles) Table : 541880
predicted proteins x 100 species
Slide 72
ZYRO KLLA KLTHERGO Duplication
Slide 73
Ancestral duplication and ancestral conservation W ij
Slide 74
Shared orthologous genes s ij
Slide 75
E AB Ancestral duplication mean= 52.1 30. 38.4 std= 17.8 11.7
11.2
Slide 76
Specific and nonspecific proteins Specific proteins (genes) are
proteins that have no match outside their own proteome. (no homolog
in other species). Non-specific proteins (genes) are proteins that
are conserved in at least one other species (have homologs outside
their own proteome). Large scale proteome comparisons allow
estimation of:
Slide 77
Specific and nonspecific proportions E A B mean% 76.2 84.3
87.6
Slide 78
genes same phylumdifferent phylum 0 100% conservation Species
specific genes
Slide 79
Domain specific conservation
Slide 80
Domain specific conservation...
Slide 81
Clusters (families) of paralogues and of orthologues
Slide 82
Paralogs: Partitions Paralogs: Reciprocal significant hit
proteins; P7.1 P4.1 P4.2 1. Partition Each non-uniq protein is
assigned a partition denoted Pn.m, where n is the number of
proteins in the partition and m is an arbitrary order;
Slide 83
Paralogs: mcl Clustering Paralogs: Reciprocal significant hit
proteins; mcl clustering was performed using:
-log(blastp(e-values)) and an inflation index I=3.0 ; 2. mcl
clustering C4.1 C3.1 C3.2C1.1 C1.2 C1.3 C2.1
Slide 84
P7.1.C4.1 P7.1.C3.1 P4.2.C3.2 P4.2.C1.1 P4.1.C1.2 P4.1.C1.3
P4.1.C2.1 Paralogs: Partitions/mcl Clustering Each protein is
identified by its partition and its mcl cluster: Pn.m.Cp.q
Slide 85
Paralogs: Partition and clustering of duplicated proteins Each
non-uniq protein is assigned a partition denoted Pn.m, where n is
the number of proteins in the partition and m is an arbitrary
order; In parallel, the same set of non-uniq proteins is clustered
using the MCL algorithm (Markov Cluster algorithm by Stijn van
Dongen); -The clustering was performed using -log(blastp e-values)
and an inflation index I=3.0; Result: Each protein belongs to both
a partition (Pn.m) and an MCL cluster (Cp.q), which are
concatenated to form the final family assignment Pn.m.Cp.q to the
loci; The term singleton is assigned to locis that do not have
significant matches; Reciprocal best hit protein are considered
putative paralogs;
Slide 86
Paralogs: Partitions Paralogs: Reciprocal significant hit (RSH)
proteins; P7.1 P4.1 P4.2 1. Partition Each non-uniq protein is
assigned to a partition denoted Pn.m, where n is the number of
proteins in the partition and m is an arbitrary order; (blastp;
pam250; SEG filter; e-value