View
223
Download
1
Category
Tags:
Preview:
Citation preview
Comparative bacterial genomics
João Carlos SetubalVBI/Virginia Techfor EMBO course
Florianopolis, July 2008
Contents
• Tree of Life• Basic notions of genomics• Motivation for comparative genomics• Whole replicon alignment: pairwise and
multiple• Gene-centric comparisons• Orthology and Synteny• Exercises
April 21, 2023 JC Setubal 2
April 21, 2023 3JC Setubal
Ciccarelli et al, Science, 2006
April 21, 2023 4JC Setubal
5Williams, Sobral, and DickermanJBAC, 2007
proteobacteria
April 21, 2023 JC Setubal
Genomes
• The entire DNA complement of a single cell• Abstraction
– a string s in the alphabet = {A, C, G, T}– Example
CTTCCAGTTCAACCGGCCGGTCGTCGCGGACGACGCGGCCGCCGGCGCCGCGATGCTGGCGGACGTACCGCACACCCGCCCCATCTCCATCTTCGCTTC
April 21, 2023 6JC Setubal
Genome sizes
• Genomes are measured in – kb (kilo base pairs), Mb (mega), or Gb (giga)
• Viruses: |s| = [5 – 200] kb• Bacteria: |s| = [1 – 10] Mb• Eukaryotes: |s| = [10 Mb – 100 Gb]• Humans: 3 Gb• marbled lungfish: 130 Gb T. Gregory, www.genomesize.com
April 21, 2023 7JC Setubal
Famous bacteria
• Haemophilus influenzae (1.8 Mb)– Human pathogen, first genome to be sequenced (1995)
• Escherichia coli (4.6 Mb)– Human pathogen and model organism (1997)
• Agrobacterium tumefaciens (6 Mb)– Plant pathogen and biotechnology tool (2001)
April 21, 2023 8JC Setubal
What is a gene
• A small substring of s that contains information
• Bacteria generally have 1 gene every 1 kb– 5 Mb genome = 5,000 genes
April 21, 2023 9JC Setubal
>A small section of a genomeAGCTCGCGCTCCGCATCCATCCAGTAGGGTTCGGTGTCGACGAGCGTGCC
GTCCATATCCCAGAAGACGGCGGCCGGCATCGCGTGCGGAGTCAGTTCGG
TCACGGCTGACAAGTCTATCCCGGCGGCCCCGGGCCTATTCTTGAGGGAC
GGCGTCCTGACCGGTCGCCGGATGAAAGGACCAGAACGCCCCGTGACTGA
CGCGAACAGCATCCTCGGAGGGCGCATCCTCGTGGTGGCCTTCGAAGGGT
GGAACGACGCTGGCGAGGCCGCCAGCGGGGCCGTCAAGACGCTCAAGGAC
CAGCTGGATGTCGTCCCGGTCGCCGAGGTCGATCCCGAGCTGTACTTCGA
CTTCCAGTTCAACCGGCCGGTCGTCGCGGACGACGACGGCCGCCGGCGCC
TCATCTGGCCGTCCGCGGAGATCCTGGGCCCAGCTCGCCCCGGCGACACC
GGCGATGCGCGCCTGGACGCCACCGGCGCCAACGCGGGCAATATCTTCCT
TCTCCTCGGCACCGAGCCGTCGCGCAGCTGGCGCAGCTTCACCGCGGAGA
TCATGGATGCGGCCCTGGCCTCCGACATCGGCGCCATCGTCTTCCTCGGT
GCGATGCTGGCGGACGTACCGCACACCCGCCCCATCTCCATCTTCGCTTC
GAGCGAGAACGCGGCCGTCCGTGCGGAGCTCGGCATCGAACGCTCTTCGT
ACGAGGGGCCGGTCGGTATCCTGAGCGCGCTCGCCGAAGGGGCGGAGGAC
GTGGGCATTCCGACCATCTCCATCTGGGCGTCGGTTCCGCACTATGTCCA
CAATGCGCCCAGCCCGAAGGCGGTGCTCGCACTGATCGACAAGCTCGAAG
AGCTGGTGAATGTCACCATCCCGCGTGGCTCGCTGGTGGAGGAGGCCACG
GCCTGGGAAGCCGGGATCGACGCGCTGGCTCTGGACGACGACGAGATGGC
TACGTACATCCAGCAGCTGGAGCAGGCACGCGACACCGTGGACTCCCCTG
AGGCCAGCGGCGAGGCGATCGCCCAGGAGTTCGAGCGCTACCTCCGCCGC
CGCGACGGCCGCGCCGGCGATGACCCCCGCCGTGGCTGACGTCACCCCCT
CTCTGCGTCCGCCGTCCTCTGTTCCCCCCGCTCGGCCTCCCCTGAGGCCG
AGGAGTCGCGCCCACATGCCGGAAACTCCTCCTTTCCTGACTTTCTGGAG
A bacterial gene
April 21, 2023 10JC Setubal
“Central Dogma” of molecular biology
• gene (DNA) messenger (RNA) protein (aminoacids)
transcription translation
Proteins are 3D objectsmade out of a linear sequence
of amino acids
April 21, 2023 11JC Setubal
A protein
www.berkeley.edu/.../ images/ras-rid-protein.gif April 21, 2023 12JC Setubal
Molecular Plant-Microbe Interactions
Sugar cane pathogen
Rattoon-stunting disease
Monteiro-Vitorello et al 2004
April 21, 2023 13JC Setubal
Comparative genomics
• There are currently more than 300 completed sequenced microbial genomes publicly available
• Many are of closely related species• In a few years there will be thousands• Why compare?• How to do it?
April 21, 2023 14JC Setubal
Why comparative genomics?
• To understand the genomic basis of the present– Differences in lifestyle
• pathogen vs. nonpathogen • Obligate vs. free-living
– Host specificity• animals vs. plants, plant X vs. plant Y, etc
– In the case of pathogens: this understanding should help us in fighting disease
• To understand the past– How organisms evolved to be what they are
April 21, 2023 15JC Setubal
Citrus cankerXanthomonas
axonopodis pathovar citri
April 21, 2023 16JC Setubal
Black rot: Xanthomonas campestris pathovar campestris
April 21, 2023 17JC Setubal
What is comparative genomics • Assuming input is the sequence and its annotation• There are many ways that genomes can be compared
– Different resolutions
• Whole genome– Genome alignments– Synteny (gene order conservation)– Anomalous regions
• Gene-centric– Gene families and unique genes– Gene clustering by function
• Gene sequence variations– Codon usage, SNPs, inDels, pseudogenes
April 21, 2023 18JC Setubal
Resolution
• Low resolution– Scope: entire genomes– Example event: rearrangement
• High resolution– Scope: nucleotide sequences– Example event: single mutation
April 21, 2023 JC Setubal 19
Genome-wide evolutionary events
• Replicon rearrangements• Gene/region duplication• Gene/region loss• Chromosome plasmid DNA exchange• Lateral transfer
April 21, 2023 20JC Setubal
Copyright ©2004 by the National Academy of Sciences
Boussau, Bastien et al. (2004) Proc. Natl. Acad. Sci. USA 101, 9722-9727
Fig. 4. Net gene loss or gain throughout the evolution of the {alpha}-proteobacterial species
April 21, 2023 21JC Setubal
Example of a “multipartite genome”
Agrobacterium tumefaciens C58
April 21, 2023 22JC Setubal
Replicon structure in all completely sequenced rhizobiaceae plus M. loti
c58 s4 k84 Retli Rleg Sm Ml
1 2.84 3.73 4.00 4.38 5.06 3.65 7.04
2 2.07 1.28 2.65 0.64 0.87 1.68 0.35
3 0.54 0.63 0.39 0.51 0.68 1.35 0.21
4 0.21 0.26 0.18 0.37 0.49
5 0.21 0.04 0.25 0.35
6 0.13 0.19 0.15
7 0.08 0.18 0.15
Numbers are replicon size in Mbp
replicon
genome
April 21, 2023 23JC Setubal
Whole replicon alignments: the pairwise case
If the sequences were identical we would see
B
AApril 21, 2023 24JC Setubal
an inversion
A B C D
A
C B
D
April 21, 2023 25JC Setubal
A B C D
A
C
D
B
Such inversions seem to happen around the origin or terminus of replication
April 21, 2023 26JC Setubal
Eisen JA, Heidelberg JF, White O, Salzberg SL. Evidence for symmetric chromosomal inversions around the replication origin in bacteria. Genome Biol. 2000;1(6):RESEARCH0011
April 21, 2023 27JC Setubal
Replicon sequence comparisons
• Basic tool: MUMmer – Delcher AL, Phillippy A, Carlton J, Salzberg SL. Fast
algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 2002 Jun 1;30(11):2478-83.
– Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL. Versatile and open software for comparing large genomes. Genome Biol. 2004;5(2):R12
• http://mummer.sourceforge.net
April 21, 2023 28JC Setubal
29
Xanthomonas axonopodis pv citri
E. coli K12 Promer alignment
Both are proteobacteria!Red: direct; green: reverse
April 21, 2023 30JC Setubal
Basics of MUMmer
• It finds Maximal Unique Matches• These are exact matches above a user-specified threshold
that are unique• Exact matches found are clustered and extended (using
dynamic programming)– Result is approximate matches
• Data structure for exact match finding: suffix tree– Difficult to build but very fast
• Nucmer and promer– Both very fast– O(n + #MUMs), n = genome lengths
April 21, 2023 31JC Setubal
sample nucmer output (coords file)• /home/setubal/agro/comp/mummer/../../rhizogenes/v1/ctgs.fasta
/home/setubal/agro/comp/mummer/../../vitis/v3/all.fasta• NUCMER
• [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [TAGS]• =====================================================================================• 73024 73193 | 242351 242181 | 170 171 | 93.60 | Contig789 Contig608• 220 6244 | 38759 32766 | 6025 5994 | 86.64 | Contig791 Contig604• 2798 6297 | 174039 177532 | 3500 3494 | 83.31 | Contig791 Contig606• 3828 6297 | 124183 126645 | 2470 2463 | 81.80 | Contig791 Contig606• 4767 5392 | 551684 551059 | 626 626 | 82.11 | Contig791 Contig607• 8214 8453 | 30747 30508 | 240 240 | 84.65 | Contig791 Contig604• 15408 15987 | 181050 181624 | 580 575 | 86.23 | Contig791 Contig606• 63864 74254 | 191954 181567 | 10391 10388 | 89.08 | Contig791 Contig604• 77203 79534 | 178882 176555 | 2332 2328 | 84.35 | Contig791 Contig604• 157451 158456 | 139804 140812 | 1006 1009 | 82.09 | Contig791 Contig606• 157483 157800 | 58429 58110 | 318 320 | 89.13 | Contig791 Contig604• 163575 166223 | 62781 60133 | 2649 2649 | 78.80 | Contig791 Contig605• 166754 168442 | 49403 47716 | 1689 1688 | 85.79 | Contig791 Contig604• 171247 173701 | 45005 42556 | 2455 2450 | 88.17 | Contig791 Contig604• 171261 172115 | 157617 158476 | 855 860 | 86.30 | Contig791 Contig606• 181828 184458 | 41748 39140 | 2631 2609 | 93.13 | Contig791 Contig604• 184829 185852 | 38838 37821 | 1024 1018 | 91.61 | Contig791 Contig604
April 21, 2023 32JC Setubal
April 21, 2023 JC Setubal 33
A suffix tree for BANANAS
www.somethinkodd.com/.../2006/01/suffixtree.png
Proteome alignment done with LCS (top: Xcc; bottom: Xac )
Blue: BBHs that are in the LCS; dark blue: BBHs not in the LCS; red: Xac specifics; yellow: Xcc specifics
April 21, 2023 34JC Setubal
Whole replicon multiple alignment
• The program MAUVE• Darling AC, Mau B, Blattner FR, Perna NT.
Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004 Jul;14(7):1394-403.
April 21, 2023 35JC Setubal
36
RSA 493
RSA 331
Dugway
Chromosome alignmentMAUVE
April 21, 2023 JC Setubal
37
Genome Alignments MAUVE
April 21, 2023 JC Setubal
How MAUVE works
• Seed-and-extend hashing• Seeds/anchors: Maximal Multiple Unique
Matches of minimum length k• Result: Local collinear blocks (LCBs)• O(G2n + Gn log Gn), G = # genomes, n =
average genome length
April 21, 2023 38JC Setubal
Alignment algorithm
1. Find Multi-MUMs2. Use the multi-MUMs to calculate a phylogenetic
guide tree3. Find LCBs (subset of multi-MUMs; filter out spurious
matches; requires minimum weight)4. Recursive anchoring to identify additional anchors
(extension of LCBs)5. Progressive alignment (CLUSTALW) using guide tree
April 21, 2023 JC Setubal 39
Gene-centric comparisons
• Homologs: genes that have the same ancestor; in general retain the same function
• Orthologs: homologs from different species (arise from speciation)
• Paralogs: homologs from the same species (arise from duplication) – Duplication before speciation (ancient duplication)
• Out-paralogs; may not have the same function
– Duplication after speciation (recent duplication)• In-paralogs; likely to have the same function
April 21, 2023 40JC Setubal
Orthologs
April 21, 2023 41JC Setubal
speciation
Out-paralogs
April 21, 2023 42JC Setubal
April 21, 2023 JC Setubal 43
In-paralogs
44
Published April 16, 2008
10 genomes
Orthology+
Phylogeny
45
AG: ancestral (belli [2], canadensis) TG: typhus (prowasekii, typhi)TRG: transitional (akari, felis) SFG: spotted fever (rickettsii, conorii, sibirica)
46
How to find orthologs
• Desired features of ortholog clustering– Ability to distinguish between in- and out-paralogs
• In-paralogs should be clustered with their orthologs
– Ability to cluster genes that have the same domain architecture, rather than simply sharing just one domain
• Methods– Phylogenetic trees– BLAST– MCL– orthoMCL
April 21, 2023 47JC Setubal
OrthoMCL
• Li L, Stoeckert CJ Jr, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003 Sep;13(9):2178-89
• Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002 Apr 1;30(7):1575-84
April 21, 2023 JC Setubal 48
OrthoMCL
1. BLAST all-against-all2. weighting scheme 3. MCL algorithm• Nota bene: orthoMCL is not perfect!
– Two or more families may be wrongly joined– One family may be wrongly split
April 21, 2023 JC Setubal 49
Li Li et al. Genome Res. 2003; 13: 2178-2189
orthoMCL pipeline
Li Li et al. Genome Res. 2003; 13: 2178-2189
OrthoMCL weighting scheme for similarity graph
52
(Tribe)MCL
• Enright, Van Dongen, Ouzonis [2002]• Adaptation of MCL clustering algorithm of Van Dongen• Markov cluster• Simulates random walks in the graph• Expands and inflates certain matrices until equilibrium is
reached• Expansion: matrix squaring• Inflation: make expanded matrix become stochastic• Has been reasonably validated
Gene Set Computations
• Given a set of genomes, represented by their ‘proteomes’ or sets of protein sequences
• Given homlogous relationships (as given for example by orthoMCL)– Which genes are shared by genomes X and Y?– Which genes are unique to genome Z?– Venn or extended Venn diagrams
April 21, 2023 53JC Setubal
3-way genome comparison
April 21, 2023 JC Setubal 54
AB
C
Brucella gene set computations
April 21, 2023 JC Setubal 55
Joining synteny and homology
April 21, 2023 56JC Setubal
Ortholog setBuilder (orthoMCL)
Genome 1
Genome 2
Genome n
Script 1
HTML Tables
Script 2
OAK: ortholog alignment for prokaryotes
graph
report annotatorsApril 21, 2023 57JC Setubal
R/G S4 C58 K84 R. etliR. leguminosarum
S. melilotiM. loti MAFF
M. loti BNC
12nd chromosome
linear chromosome
2nd chromosome
- - - - -
2 plasmid 630kb AT plasmid plasmid 390kbplasmid F 640kb
plasmid pRL12 870kb
plasmid pSymA
plasmid 1 plasmid 1
3 plasmid 259kb Ti plasmid plasmid 179kbplasmid E 510kb
plasmid pRL11 680kb
plasmid pSymB
plasmid 2 plasmid 2
4 plasmid 210kb - plasmid 44kbplasmid D 370kb
plasmid pRL10 490kb
- - plasmid 3
5 plasmid 130kb - -plasmid C 250kb
plasmid pRL9 350kb
- - -
6 plasmid 79kb - -plasmid A 190kb
plasmid pRL7 150kb
- - -
7 - - -plasmid B 180kb
plasmid pRL8 150kb
- - -
Replicon color key for HTML tables
April 21, 2023 58JC Setubal
April 21, 2023 59JC Setubal
April 21, 2023 60JC Setubal
April 21, 2023 61JC Setubal
What do the tables show
• conserved blocks (aka “microsyntenic regions”), and how these blocks appear in different replicons across the genomes compared
• some of these blocks are not operons (would need to show strand)
• possible block losses
April 21, 2023 62JC Setubal
Polymorphism detection
• inDels, SNPs• pseudogenes
April 21, 2023 63JC Setubal
I
II
Figure 4.
65
Pseudogenes
• Nonfunctional protein coding genes• Mutations introduce “sequence problems”
(frameshifts, stop in frame, absence of stop)• Natural mutation or sequencing error?
66
Pseudogene cases
67
• “Normal” bacterial genomes have 1-5% of pseudogenes [Liu et al]
• Pseudogenes can give interesting clues to evolutionary pathways
Why study pseudogenes?
68
Why study pseudogenes? Cont’d
• High fractions of pseudogenes suggest a “genome degradation” process
• May be cause or effect of niche restriction• Examples
– Mycobacterium leprae: 36% (~1,100 genes)– Leifsonia xyli subsp. xyli: 13% (~300 genes)
• Pseudogenes do not show up in BLAST searches– Ortholog computations will in general not include them!
69
BLASTN
Annotated Pseudogenes vs. Genome Sequences
Previously Known PseudogeneKnown Gene (Homologous to Pseudogene)Newly Identified Pseudogene
Pseudogene Identification by Sequence SimilarityStudy of 8 Brucella Genomes
Brucella Pseudogene Analysis
Identification of New Pseudogenes by Homology
0
100
200
300
400
500
600
Bab9941 BabS19 Bcan23365 Bmel16M Bab2308 Bovi25840 Bsui1330 Bsui23445
PG Count: Initial
Tot. A lignments
Know n Genes
PG Count: Final
Total Alignments 4120 0.98
Gene hits 2627 0.62
pseudogenes 1493 0.35
Genomics is just the beginning
Genomics/proteomics
Interactions between molecules
Cell processes
complexity
Whole organisms
April 21, 2023 70JC Setubal
populations
21st century Biology: integration
April 21, 2023 JC Setubal 71
Acks
• Nalvo Almeida• Chris Lasher• Brett Tyler• Rebecca Wattam
April 21, 2023 JC Setubal 72
Recommended