Upload
jonathan-eisen
View
499
Download
1
Embed Size (px)
Citation preview
Lecture 10:
EVE 161:Microbial Phylogenomics
Lecture 10:
UC Davis, Winter 2016 Instructors: Jonathan Eisen & Holly Ganz
Answer 2 of these. Please make your answers short.
• 1) List 4-5 Steps in a “Whole Genome Shotgun Sequencing” Project
• 2) What is meant by the “Add on Costs of Sequencing”
• 3) Explain one form of evidence used to infer lateral gene transfer and why that evidence sometimes can be misleading
• 4) Give examples of 3 different ways to fragment genomic DNA
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
1st Genome Sequence
Fleischmann et al. 1995
!3
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Complete Genome/Chromosome Progress
Fraser et al. 2000insight progress
NATURE | VOL 406 | 17 AUGUST 2000 | www.nature.com 799
Microbes were the first organisms on Earthand preceded animals and plants by morethan 3 billion years. They are thefoundation of the biosphere, from bothan evolutionary and an environmental
perspective1. It has been estimated that microbial speciescomprise about 60% of the Earth’s biomass. The genetic,metabolic and physiological diversity of microbial speciesis far greater than that found in plants and animals. Butthe diversity of the microbial world is largely unknown,with less than one-half of 1% of the estimated 2–3 billionmicrobial species identified. Of those species that havebeen described, their biological diversity is extraordinary,having adapted to grow under extremes of temperature,pH, salt concentration and oxygen levels.
Perhaps no other area of research has been so energizedby the application of genomic technology than the micro-bial field. It was only five years ago that The Institute forGenomic Research (TIGR) published the first completegenome sequence for a free-living organism, Haemophilusinfluenzae2; since that first report another 27 microbialgenome sequences have been published, with at least 10–20other projects at or near completion (for details seehttp://www.tigr.org/tdb/mdb/mdb.html). This progressrepresents, on average, one completed genome sequenceevery two months and all indications are that this pace willcontinue to accelerate. Included in the first completedmicrobial projects are many important human pathogens,the simplest known free-living organism, ‘model’ organisms, Escherichia coli and Bacillus subtilis, thermophilic bacterial species that might represent some ofthe deepest-branching members of the bacterial lineage, fiverepresentatives of the archaeal domain, and the first eukary-ote, Saccharomyces cerevisiae. All of the organisms that havebeen studied by whole-genome analysis are species that canbe grown either in the laboratory or in animal cells. It isimportant to remember that the vast majority of microbialspecies cannot be cultivated at all, and these organisms,which live in microbial communities, are essential to theoverall ecology of the planet. Nevertheless, the study of ‘laboratory-adapted’ microbes has had a profound impacton our understanding of the biology and the evolutionaryrelationships between microbial species.
Methods for whole-genome analysisThe method that was successfully used to determine thecomplete genome sequence of H. influenzae is a shotgunsequencing strategy (Fig. 1). Before 1995, the largestgenome sequenced with a random strategy was that of bac-teriophage lambda with a genome size of 48,502 base pairs(bp), completed by Sanger et al. in 1982 (ref. 3). Despite
advances in DNA-sequencing technology, the sequencing ofwhole genomes had not progressed beyond lambda-sizedclones (about 40 kbp) because of the lack of sufficient computational approaches that would enable the efficientassembly of a large number of independent randomsequences into a single contig.
For the H. influenzae and subsequent projects, we haveused a computational method that was developed to createassemblies from hundreds of thousands of complementaryDNA sequences 300–500-bp long4. This approach hasproved to be a cost-effective and efficient approach tosequencing megabase-sized segments of genomic DNA.This strategy does not require an ordered set of cosmids orother subclones, thus significantly reducing the overall costper base pair of producing a finished sequence, while providing high redundancy for accuracy and minimizingthe effort required to obtain the whole genome sequence.The availability of improved technologies for longersequence lengths (more than 700 bp) reduces problemsassociated with repetitive elements in the final assembly.
Microbial gene finding and annotationThe identification of genes in prokaryotic genomes hasadvanced to the stage at which nearly all protein-codingregions can be identified with confidence. Computationalgene finders using Markov modelling techniques now routinely find more than 99% of protein-coding regions5
and RNA genes6. Once the protein-coding genes have beenlocated, the most challenging problem is to determine theirfunction. Typically, about 40–60% of the genes in a newlysequenced bacterial genome display a detectable sequencesimilarity to protein sequences whose function is at leasttentatively known. This sequence similarity is the primarybasis for assigning function to new proteins, but the transferof functional assignments is fraught with difficulties.
To illustrate this problem, Table 1 contains an exampleshowing the best matches in the database for a 1,344-bp genefrom Mycobacterium tuberculosis at the time that thegenome was being sequenced. All six of the best matches arekinases, but the specific names differ. A conservative namingstrategy might use a family name that includes all six match-es. Another strategy might use curated protein families (ifthey exist) to assign names; for example, the FGGY familynamed in the fourth line of Table 1 comes from the Pfamdatabase7, a set of 1,815 hidden Markov models based onmultiple alignments. By a closer examination of the litera-ture, one could determine which of these protein nameswere based on laboratory experiments and which onsequence similarity. In any case, the assignment of a function to this protein requires the expertise of a skilledbiologist. The rapidly changing nature of genome databases
Microbial genome sequencingClaire M. Fraser, Jonathan A. Eisen & Steven L. Salzberg
The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA
Complete genome sequences of 30 microbial species have been determined during the past five years, andwork in progress indicates that the complete sequences of more than 100 further microbial species will beavailable in the next two to four years. These results have revealed a tremendous amount of information onthe physiology and evolution of microbial species, and should provide novel approaches to the diagnosis andtreatment of infectious disease.
© 2000 Macmillan Magazines Ltd
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Fraser et al. Shotgun Sequencing 2000 insight progress
NATURE | VOL 406 | 17 AUGUST 2000 | www.nature.com 801
analysis of the genomes of two thermophilic bacterial species,Aquifex aeolicus and Thermotoga maritima, revealed that 20–25% ofthe genes in these species were more similar to genes from archaeathan those from bacteria13,14. This led to the suggestion of possibleextensive gene exchanges between these species and archaeal lineages. But before one jumps to this conclusion it is important toconsider the difficulties in inferring the occurrence of gene transfer.For example, the high percentage of genes with best matches toarchaea in A. aeolicus and T. maritima could also be due to a high rateof evolution in the mesophilic bacteria (which would cause thermophilic and archaeal genes to have high levels of similaritydespite their not having a common ancestry) or the loss of these genesfrom mesophilic bacteria15. For T. maritima, many lines of additionalevidence support the assertion of gene transfer, including the obser-vation that many of the archaeal-like genes occur in clusters in thegenome, are in regions of unusual nucleotide composition, andbranch in phylogenetic trees most closely to archaeal genes14. Most ofthe lines of evidence leading to assertions of horizontal gene transfercan have other causes. For example, unusual nucleotide compositioncan also arise from selection16, and differences in phylogenetic treescan be caused by convergence, inaccurate alignments17, long-branchattraction18 or sampling of different species19. It is therefore important to assess the evidence carefully and to find multiple typesof evidence. This has yet to be done systematically, so we believe that itis too early to assign quantitative values to the extent of gene exchangebetween species.
Despite the apparent occurrence of extensive gene transfers in thehistory of microbes, it does seem that there might be a ‘core’ to eachevolutionary lineage that retains some phylogenetic signal. The bestevidence for this comes from the construction of ‘whole genometrees’ based on the presence and absence of particular homologues ororthologues in different complete genomes20. It is important to notethat gene-content trees are averages of patterns produced by phyloge-ny, gene duplication and loss, and horizontal transfer; they are therefore not real phylogenetic trees. Nevertheless, the fact that thesetrees are very similar to phylogenetic trees of genes such as ribosomalRNA and RecA suggests that although horizontal gene transfer might
be extensive, it is somehow constrained by phylogenetic relation-ships. Other evidence for a ‘core’ of particular lineages comes fromthe finding of a conserved core of euryarchaeal genomes21,22 andanother finding that some types of gene might be more prone to genetransfer than others23. It therefore seems likely that horizontal genetransfer has not completely obliterated the phylogenetic signal inmicrobial genomes. Careful studies in which the phylogenetic trees ofsome of these core genes are compared across all genomes need to bedone to see whether or not the core has a consistent phylogeny. Initialstudies suggest that it does, at least for the major microbial groups14.
Although our ability to resolve patterns of the relationshipsamong microbes is still limited, analysis of the genomes of closelyrelated species is revealing much about genome evolution24,25. Forexample, a comparison of the genomes of four chlamydial species hasrevealed the occurrence of frequent tandem gene duplication andgene loss, as well as large chromosomal inversions25. Comparisons ofclosely related species should also reveal much about mutationprocesses, codon usage and other features that evolve rapidly16.
Design of new antimicrobial agents and vaccinesOne of the expected benefits of genome analysis of pathogenic bacte-ria is in the area of human health, particularly in the design of morerapid diagnostic reagents and the development of new vaccines andantimicrobial agents. These goals have become more urgent with thecontinuing spread of antibiotic resistance in important humanpathogens. Moreover, results from the whole-genome analysis ofhuman pathogens has suggested that there are mechanisms for gen-erating antigenic variation in proteins expressed on the cell surfacethat are encoded within the genomes of these organisms. Thesemechanisms include the following: (1) slipped-strand mispairingwithin DNA sequence repeats found in 5!-intergenic regions andcoding sequences as described for H. influenzae2, Helicobacter pylori26
and M. tuberculosis27, (2) recombination between homologous genesencoding outer-surface proteins as described for Mycoplasma genitalium28, Mycoplasma pneumoniae29 and Treponema pallidum30,and (3) clonal variability in surface-expressed proteins as describedfor Plasmodium falciparum31 and possibly Borrelia burgdorferi32.
2. Random sequencing phase
GGG ACTGTTC...
(i) Isolate DNA
(ii) Fragment DNA
(iii) Clone DNA
3. Closure phase
(i) Assemble sequences(i) Sequence DNA(15,000 sequences per Mb)
(ii) Close gaps
(iv) Annotation
(iii) Edit
237 239
238
4. Completegenome sequence
1. Library construction
–1 –1
1
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
Figure 1 Diagram depicting the steps in a whole-genome shotgun sequencing project.
© 2000 Macmillan Magazines Ltd
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
From http://genomesonline.org
Loman et al. 2012
In bacteriology, the genomic era began in 1995, when the first bacterial genome was sequenced using conventional Sanger sequencing1. Back then, sequencing pro-jects required six-figure budgets and years of effort. A decade later, in 2005, the advent of the first high-throughput (or ‘next-generation’) sequencing technologies signalled a significant advance in the ease and cost of sequencing2, delivering bacterial genome sequences in hours or days rather than months or years. High-throughput sequencing now delivers sequence data thousands of times more cheaply than is possible with Sanger sequencing. The avail-ability of a growing abundance of platforms and instruments presents the user with an embarrassment of choice. Better still, vigor-ous competition between manufacturers has resulted in sustained technical improvements on almost all platforms. This means that in recent years our sequencing capability has been doubling every 6–9 months — much faster than Moore’s law.
Here, we describe the sequencing tech-nologies themselves, examine the practicali-ties of producing a sequence-ready template from bacterial cultures and clinical samples, and weigh up the costs of labour and kits. We look at the types of data that are deliv-ered by each instrument, and describe the approaches, programs and pipelines that can
be used to analyse these data and thus move from draft to complete genomes.
Several high-throughput sequencing platforms are now chasing the US$1,000 human genome3. Given that the average bacterial genome is less than one-thousandth the size of the human genome, a back-of-the- envelope calculation suggests that a $1 bacterial genome sequence is an imminent possibility. In closing, we assess how close to reality the $1 bacterial genome actually is and explore the ways in which high-throughput sequencing might change the way that all microbiologists work.
A variety of approachesHigh-throughput sequencing platforms can be divided into two broad groups depend-ing on the kind of template used for the sequencing reactions. The earliest, and cur-rently most widely used, platforms depend on the production of libraries of clonally amplified templates. These are produced through amplification of immobilized librar-ies made from a single DNA molecule in the initial sample. More recently, we have seen the arrival of single-molecule sequencing platforms, which determine the sequence of single molecules without amplification. Within these broad categories, there is considerable variation in performance — including in throughput, read length and
error rate — as well as in factors affecting usability, such as cost and run time.
Template amplification technologies. In general terms, all of the platforms that are currently on the market rely on a three-stage workflow of library preparation, template amplification and sequencing (FIG. 1). Library preparation begins with the extraction and purification of genomic DNA. Depending on the protocol, the amount of DNA required can vary from a few nanograms to tens of micrograms, meaning that success in this step depends on the ability to grow sufficient biomass. For some microorganisms, obtain-ing suitable DNA — in terms of quantity and quality — can prove difficult. Therefore, before using expensive reagents for library preparation and sequencing, it is advisable to confirm, by fluorometry, that DNA of suffi-cient quantity and quality has been obtained. However, purchasing a suitable instrument to do this adds to the costs of establishing a sequencing capability (BOX 1).
For shotgun sequencing, an initial fragmentation step is required to gener-ate random, overlapping DNA fragments. Depending on the platform and applica-tion, these fragments can range from 150 bp to 800 bp in length; size selection either involves harvesting from agarose gels or exploits paramagnetic-bead-based technol-ogy. The selected fragments must also be sufficiently abundant to provide comprehen-sive and even coverage of the target genome. Two types of fragmentation are widely used: mechanical and enzymatic. Early protocols relied on mechanical methods such as nebulization or ultrasonication. Nebulization is an inexpensive method that can be easily adopted by any laboratory, but it results in large losses of input material and a broad range of fragment sizes, runs the risk of cross-contamination and cannot handle par-allel processing. By contrast, ultrasonication instruments such as systems from Covaris or the Bioruptor systems from Diagenode allow parallel sample processing and minimize hands-on time and sample loss but come at a price that could be prohibitive for small lab-oratories. Mechanically generated fragments require repair and end-polishing before platform-specific adaptors can be ligated to
High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunityNicholas J. Loman1, Chrystala Constantinidou1, Jacqueline Z. M. Chan1, Mihail Halachev1, Martin Sergeant1, Charles W. Penn1, Esther R. Robinson2 and Mark J. Pallen1
Abstract | Here, we take a snapshot of the high-throughput sequencing platforms, together with the relevant analytical tools, that are available to microbiologists in 2012, and evaluate the strengths and weaknesses of these platforms in obtaining bacterial genome sequences. We also scan the horizon of future possibilities, speculating on how the availability of sequencing that is ‘too cheap to metre’ might change the face of microbiology forever.
PROGRESS
NATURE REVIEWS | MICROBIOLOGY VOLUME 10 | SEPTEMBER 2012 | 599
F O C U S O N N E X T- G E N E R AT I O N S E Q U E N C I N G
© 2012 Macmillan Publishers Limited. All rights reserved
Loman et al. Shotgun Sequencing 2014
from the reference genome, or when a closely related reference genome is unavailable.
De novo assembly is more informa-tive when dealing with a new pathogen or a new strain of a well-known pathogen. Sequencing errors can have a significant impact on assembly. When platforms pro-duce random errors, the effect of these errors on assembly can be overcome by
increasing the depth of coverage. However, when errors are systematic and occur in predictable contexts (for example, in homopolymers), increasing the depth of coverage is unlikely to help, and it may be necessary to sequence the troublesome regions using an alternative technology. Very high-quality, near complete references may be obtained by a hybrid approach,
such as in recent studies combining Pacific Biosciences and Illumina data21,22.
A variety of commonly used assem-blers is now available (see Supplementary information S1 (table)), ranging from the platform specific (for example, Newbler from Roche) to the more generally applica-ble (for example, MIRA23, Velvet24, and the CLC Genomics Workbench from CLC Bio).
Table 1 | Comparison of next-generation sequencing platforms
Machine (manufacturer)
Chemistry Modal read length* (bases)
Run time Gb per run Current, approximate cost (US$)‡
Advantages Disadvantages
High-end instruments
454 GS FLX+ (Roche) Pyrosequencing 700–800 hours 0.7 500,000 • Long read lengths • Appreciable hands-on time
• High reagent costs• High error rate in
homopolymers
HiSeq 2000/2500 (Illumina)
Reversible terminator
2 × 100 11 days (regular mode) or
da s rapid run mode)§
600 (regular mode) or 120 (rapid run mode)§
750,000 • Cost-effectiveness• Steadily improving
read lengths• Massive
throughput• Minimal hands-on
time
• Long run time • Short read lengths• HiSeq 2500
instrument upgrade not available at time of writing (available end 2012)
5500xl SOLiD (Life Technologies)
Ligation 75 + 35 da s 150 350,000 • Low error rate• Massive
throughput
• Very short read lengths
• Long run times
PacBio RS (Pacific Biosciences)
Real-time sequencing
3,000 (maximum 15,000)
minutes 3 per day 750,000 • Simple sample preparation
• Low reagent costs• Very long read
lengths
• High error rate• Expensive system• Difficult installation
Bench-top instruments
454 GS Junior (Roche) Pyrosequencing 500 hours 0.035 100,000 • Long read lengths • Appreciable hands-on time
• High reagent costs• High error rate in
homopolymers
Ion Personal Genome Machine (Life Technologies)
Proton detection
100 or 200 hours 0.01–0.1 (314 chip), 0.1–0.5 (316 chip) or up to 1 (318 chip)
80,000 (including OneTouch and server)
• Short run times• Appropriate
throughput for microbial applications
• Appreciable hands-on time
• High error rate in homopolymers
Ion Proton (Life Technologies)
Proton detection
Up to 200 2 hours Up to 10 (Proton I chip) or up to 100 (Proton II chip)
145,000 + 75,000 for compulsory server
• Short run times• Flexible chip
reagents
• Instrument not available at time of writing
MiSeq (Illumina) Reversible terminator
2 × 150 hours 1.5 125,000 • Cost-effectiveness• Short run times• Appropriate
throughput for microbial applications
• Minimal hands-on time
• Read lengths too short for efficient assembly
*Average read length for a fragment-based run. ‡Approximate cost per machine plus additional instrumentation and service contract. See REF. 58. §Available only on the HiSeq 2500.
P R O G R E S S
NATURE REVIEWS | MICROBIOLOGY VOLUME 10 | SEPTEMBER 2012 | 603
F O C U S O N N E X T- G E N E R AT I O N S E Q U E N C I N G
© 2012 Macmillan Publishers Limited. All rights reserved
De novo assemblies can be compared using Mauve25 or Mugsy26, and the assemblies can be manually examined using the Tablet viewer27. For annotation of assemblies, Glimmer28 works well for coding-sequence prediction, while tRNAScan-SE29 and RNAmmer30 work well for stable-RNA prediction. There are numerous pipelines for automatic annotation of de novo assem-blies, including RAST31, IMG/ER32 and the IGS Annotation Engine (developed by the Institute for Genome Sciences, University of Maryland School of Medicine, USA), although care must be taken when inter-preting results from such services, as the public databases used contain annotation errors that are then propagated to newly sequenced genomes33.
For microbial applications, all of the above programs run quickly (in minutes or hours) and are not particularly processor
intensive. Some workflows combine a series of programs and provide an accessible interface for microbiologists who are not bioinformatics specialists. For example, xBASE-NG provides a ‘one-stop shop’ for assembly, annotation and comparison of bacterial genome sequences34. Sophisticated phylogenetic analyses are more demand-ing and may be beyond the capability of the average research group. One particular issue when constructing bacterial whole-genome phylogenies is the clouding of phylogenetic signal by recombination events and homoplasy35. Algorithms such as ClonalFrame36 and ClonalOrigin37 take mul-tiple whole-genome alignments as input and attempt to identify blocks of recombination. These approaches are computationally very expensive, and there is no ‘off the shelf ’ solu-tion to comparing hundreds or thousands of bacterial genomes. There is a growing
Table 2 | The applicability of the major high-throughput sequencing platforms
Example application in bacteriology
Desirable characteristics Machine*
454 GS Junior‡
454 GS FLX+‡
Ion Personal Genome Machine (318 chip)§
MiSeq|| HiSeq 2000||
5500xl SOLiD§
PacBio RS¶
De novo sequencing of novel strains to generate a single-scaffold reference genome
• Long reads• Paired-end protocol and/or
long mate-pair protocol• Even coverage of genome
! !! ! ! ! X !!
Rapid characterization of a novel pathogen (draft de novo assembly of a genome for a single strain)
• Total run time (library preparation plus sequencing) of under hours
• Sufficient coverage of a bacterial genome in a single run
! !! !! !! X X !!
Rough-draft de novo sequencing of small numbers of strains (<20) for comparative analysis of gene content
• Long or paired-end reads• High throughput• Ease of library and sequencing
workflow• Cost-effective
X ! ! !! !! ! !
Re-sequencing of many similar strains (>50) for the discovery of single nucleotide polymorphisms and for phylogenetics
• Very high throughput• Low-cost, high-throughput
sequence library construction• High accuracy
X X ! ! !! ! !
Small-scale transcriptomics-by-sequencing experiments (for example, two strains under four growth conditions with two biological replicates, so 16 strains)
• High per-isolate coverage X ! ! ! !! !! !!
Phylogenetic profiling to genus-level using partial 16S rRNA gene amplicon sequencing
• High coverage• Long amplicon input (≥500 bp)• Long reads• High single-read accuracy
(error rate <1%)
! !! ! !! ! ! X
Whole-genome metagenomics for the reconstruction of multiple genomes in a single sample
• Long reads or paired-end reads
• Very high throughput• Low error rate
X ! ! ! !! ! !
*!!, particularly well suited; !, suitable; X, not suitable. ‡From Roche. §From Life Technologies. ||From Illumina. ¶From Pacific Biosciences.
interest in alignment-free approaches for constructing bacterial phylogenies, as it is thought that these approaches may help address the computational challenges of these analyses38.
A recurring problem with data from high-throughput sequencing is meeting the requirement, as stipulated by journals and funders, that data be lodged in the public domain. Unannotated assembled sequences can be uploaded to conventional sequence databases, such as GenBank, fairly easily. However, submission of anno-tated sequences can be onerous, slowing down the process of publication even further. Submission of sequence reads to short-read archives may be hampered by slow data transfer rates, and it remains uncertain how sustainable such archives will prove to be in the future. There may come a time when the easiest way to
PROGRESS
604 | SEPTEMBER 2012 | VOLUME 10 www.nature.com/reviews/micro
PROGRESS
© 2012 Macmillan Publishers Limited. All rights reserved
Step 1: Get DNA
Step 2: Shotgun Sequence
DNA target sample
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Shotgun DNA Sequencing (1995-2005)
DNA target sample
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
SHEAR
Shotgun DNA Sequencing (1995-2005)
DNA target sample
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
SIZE SELECT
e.g., 10Kbp ± 8% std.dev.
SHEAR
Shotgun DNA Sequencing (1995-2005)
DNA target sample
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
SIZE SELECT
e.g., 10Kbp ± 8% std.dev.
SHEAR
Shotgun DNA Sequencing (1995-2005)
DNA target sample
Vector
LIGATE & CLONE
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
SIZE SELECT
e.g., 10Kbp ± 8% std.dev.
SHEAR
Shotgun DNA Sequencing (1995-2005)
DNA target sample
Vector
LIGATE & CLONE
Primer
End Reads (Mates)
SEQUENCE
550bp
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Short read genome sequencing (2005-current)
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Short read genome sequencing (2005-current)
Genomic DNA
270 bp fragments
Random fragmentation
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Short read genome sequencing (2005-current)
Genomic DNA
270 bp fragments
Random fragmentation
Paired-end short insert reads
(10’s millions)
molecular biology
Sequencing (Illumina)
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Short read genome sequencing (2005-current)
Genomic DNA
270 bp fragments
Random fragmentation
4-8 kb fragments
Paired-end long insert reads
(10’s millions)
Paired-end short insert reads
(10’s millions)
molecular biology
Sequencing (Illumina)
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Short read genome sequencing (2005-current)
How do we assemble this data back into a genome?
Genomic DNA
270 bp fragments
Random fragmentation
4-8 kb fragments
Paired-end long insert reads
(10’s millions)
Paired-end short insert reads
(10’s millions)
molecular biology
Sequencing (Illumina)
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Step 3: Assemble
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Assembly outline
Contigs
Scaffolds
Reads
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Assembly outline
Assembly algorithms
e.g. Allpaths, Velvet,
Meraculous
Contigs
Scaffolds
Reads
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn Graph Assembly
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
“It was the best of times, it was the worst of
times, it was the age of wisdom, it was the
age of foolishness, it was the epoch of belief,
it was the epoch of incredulity,.... “
Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall
Example courtesy of J. Leipzig 2010Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…
Generate random ‘reads’
fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho
hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw
fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe
…etc. to 10’s of millions of reads
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…
How do we assemble?
fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho
hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw
fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe
…etc. to 10’s of millions of reads
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…
How do we assemble?
fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho
hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw
fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe
…etc. to 10’s of millions of reads
Traditional all-vs-all assemblers fail due to immense computational resources (scales with number of reads2) A million (106 ) reads requires a trillion (1012) pairwise alignments
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…
How do we assemble?
fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho
hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw
fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe
…etc. to 10’s of millions of reads
De Bruijn solution: Represent the data as a graph (scales with genome size)
Traditional all-vs-all assemblers fail due to immense computational resources (scales with number of reads2) A million (106 ) reads requires a trillion (1012) pairwise alignments
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 1: Convert reads into “Kmers”Kmer: a substring of defined length
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 1: Convert reads into “Kmers”
Reads: theageofwi
Kmers : (k=3)
the
Kmer: a substring of defined length
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 1: Convert reads into “Kmers”
Reads: theageofwi
Kmers : (k=3)
the
hea
Kmer: a substring of defined length
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 1: Convert reads into “Kmers”
Reads: theageofwi
Kmers : (k=3)
the
hea
eag
Kmer: a substring of defined length
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 1: Convert reads into “Kmers”
Reads: theageofwi
age
geo
eof
ofw
fwi
Kmers : (k=3)
the
hea
eag
Kmer: a substring of defined length
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 1: Convert reads into “Kmers”
Reads: theageofwi
age
geo
eof
ofw
fwi
sthebestof
sth
the
heb
ebe
bes
est
sto
tof
astheageof
ast
sth
the
hea
eag
age
geo
eof
worstoftim
wor
ors
rst
sto
tof
oft
fti
tim
imesitwast
ime
mes
esi
sit
itw
twa
was
ast
…..etc for all reads in the dataset
Kmers : (k=3)
the
hea
eag
Kmer: a substring of defined length
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 2: Build a De-Bruijn graph from the kmers
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 2: Build a De-Bruijn graph from the kmers
age geo eof ofw fwihea eagthe
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 2: Build a De-Bruijn graph from the kmers
age geo eof ofw fwihea eagthe
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 2: Build a De-Bruijn graph from the kmers
age geo eof ofw fwihea eagtheast sththe hea eag age geo eof
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 2: Build a De-Bruijn graph from the kmers
age geo eof ofw fwihea eagthesth the
heb ebe bes est sto tof
ast sththe hea eag age geo eof
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 2: Build a De-Bruijn graph from the kmers
age geo eof ofw fwihea eagthesth the
heb ebe bes est sto tof
ast sththe hea eag age geo eof
wor ors rststo tof
oft fti tim
ime mes
esisititwtwa
was
ast
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 2: Build a De-Bruijn graph from the kmers
age geo eof ofw fwihea eagthesth the
heb ebe bes est sto tof
ast sththe hea eag age geo eof
wor ors rststo tof
oft fti tim
ime mes
esisititwtwa
was
ast
…..etc for all ‘kmers’ in the datasetSlides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 3: Simplify the graph as much as possible:
A De Bruijn Graph
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 3: Simplify the graph as much as possible:
A De Bruijn Graph
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 3: Simplify the graph as much as possible:
A De Bruijn Graph
“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,.... “
De Bruijn assemblies ‘broken’ by repeats longer than kmer
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
No single solution!
Drawback of De Bruijn approach
Break graph to produce final assembly
Step 4: Dump graph into consensus (fasta)
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Kmer size is an important parameter in De Bruijn assembly
The final assembly (k=3)
wor times itwasthe foolishness
incredulity age epoch be
st wisdom
of belief
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Kmer size is an important parameter in De Bruijn assembly
The final assembly (k=3)
wor times itwasthe foolishness
incredulity age epoch be
st wisdom
of belief
A better assembly (k=20)
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…
Repeat with a longer “kmer” length
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Kmer size is an important parameter in De Bruijn assembly
The final assembly (k=3)
wor times itwasthe foolishness
incredulity age epoch be
st wisdom
of belief
A better assembly (k=20)
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…
Repeat with a longer “kmer” length
Why not always use longest ‘k’ possible?
Sequencing errors:
sthebentof
sth theheb
ebeben
entnto
tof
sthebentof
k=3
k=10100% wrong kmer
Mostly unaffected kmers
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Scaffolding
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Scaffolding
Contigs
Scaffolds
(An assembly)
Reads
‘De Bruijn’ assembly
Join contigs using evidence from paired end data
Align reads to DeBruijn contigs
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Scaffolding
Contigs
Scaffolds
(An assembly)
Reads
‘De Bruijn’ assembly
“Captured” gaps caused by repeats. Represented by “NNN” in assembly
Join contigs using evidence from paired end data
Align reads to DeBruijn contigs
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Lander-Waterman statistics
L = read length T = minimum detectable overlap G = genome size N = number of reads c = coverage (NL / G) σ = 1 – T/L
E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ) contig = island with 2 or more reads
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Mis-assembly of repetitive sequence
Schatz M C et al. Brief Bioinform 2013;14:213-224
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Mis-assembled repeats
a b c
a c b
a b c d I II III
I
II
III a
b c
d
b c
a b d c e f
I II III IV
I III II IV
a d b e c f
a
collapsed tandem excision
rearrangement
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Real life assembly is messy!
Assembly in theory
Uniform coverage, no errors, no contamination
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Biased coverage (->gaps)
Assembly in reality
Real life assembly is messy!
Assembly in theory
Uniform coverage, no errors, no contamination
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Biased coverage (->gaps)
Assembly in reality
Real life assembly is messy!
Assembly in theory
Uniform coverage, no errors, no contamination
Sequencing errors (-> fragmented assembly)
*
****
*
*
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Biased coverage (->gaps)
Assembly in reality
Real life assembly is messy!
Assembly in theory
Uniform coverage, no errors, no contamination
Chimeric reads (->mis-joins)
Sequencing errors (-> fragmented assembly)
*
****
*
*
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Biased coverage (->gaps)
Assembly in reality
Real life assembly is messy!
Assembly in theory
Uniform coverage, no errors, no contamination
Contaminant reads (-> incorrect + inflated
assembly)
Chimeric reads (->mis-joins)
Sequencing errors (-> fragmented assembly)
*
****
*
*
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Biased coverage (->gaps)
Assembly in reality
Real life assembly is messy!
Assembly in theory
Uniform coverage, no errors, no contamination
Contaminant reads (-> incorrect + inflated
assembly)
Chimeric reads (->mis-joins)
Sequencing errors (-> fragmented assembly)
*
****
*
*
*
Worse than predicted assemblies!Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Real life assembly is messy!
Theoretical
GC% of 100 base windowsFr
acti
on o
f nor
mal
ized c
overa
geReference position (bp)
Cov
era
ge (
x)
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Genome properties can also make assembly difficult
Biased sequence composition
RESULT: incomplete / fragmented assembly
ACTGTCTAGTCAGCGCGCGCGCGCGCGCCCGCGCGCGCGGGCGGCGGCGCGGGCGGGCGCATGTA
GTGATC
High repeat content
RESULT: misassemblies / collapsed assemblies
r
rr
r
r
Polyploidy
RESULT: fragmented assembly
a a’
Biased sequence abundance
RESULT: Incomplete / fragmented assembly
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
N50
The N50 size of a set of entities (e.g., contigs or scaffolds) represents the largest entity E such that at least half of the total size of the entities is contained in entities larger than E.
For example, given a collection of contigs with sizes 7, 4, 3, 2, 2, 1, and 1 kb (total size = 20kbp), the N50 length is 4 because we can cover 10 kb with contigs bigger than 4kb. (http://www.cbcb.umd.edu/research/castats.shtml)
N50 length is the length ‘x’ such that 50% of the sequence is contained in contigs of length x or greater. (Waterston http://www.pnas.org/cgi/reprint/100/6/3022.pdf)
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014TIGR
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Why Completeness is Important
• Improves characterization of genome features
• Gene order, replication origins
• Better comparative genomics
• Genome duplications, inversions
• Presence and absence of particular genes can be very important
• Missing sequence might be important (e.g., centromere)
• Allows researchers to focus on biology not sequencing
• Facilitates large scale correlation studies
Step 4: Closure
• Physical map information
• PCR and gap spanning
• Other sequencing data
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
General Steps in Analysis of Complete Genomes
• Identification/prediction of genes
• Characterization of gene features
• Characterization of genome features
• Prediction of gene function
• Prediction of pathways
• Integration with known biological data
• Comparative genomics
Step 5: Annotate
• `
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
General Steps in Analysis of Complete Genomes
• Structural Annotation • Identification/prediction of genes • Characterization of gene features • Characterization of genome features
• Functional Annotation • Prediction of gene function • Prediction of pathways • Integration with known biological data
• Evolutionary Annotation • Comparative genomics
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Structural Annotation I: Genes in Genomes
• Protein coding genes. ! In long open reading frames ! ORFs interrupted by introns in eukaryotes ! Take up most of the genome in prokaryotes, but only a
small portion of the eukaryotic genome
• RNA-only genes ! Transfer RNA ! ribosomal RNA ! snoRNAs (guide ribosomal and transfer RNA
maturation) ! intron splicing ! guiding mRNAs to the membrane for translation ! gene regulation—this is a growing list
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Structural Annotation II: Other Features to Find
• Gene control sequences ! Promoters ! Regulatory elements
• Transposable elements, both active and defective ! DNA transposons and retrotransposons ! Many types and sizes
• Other Repeated sequences. ! Centromeres and telomeres ! Many with unknown (or no) function
• Unique sequences that have no obvious function
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Bacteria / Archaeal Protein Coding Genes
• Bacteria use ATG as their main start codon, but GTG and TTG are also fairly common, and a few others are occasionally used.
– Remember that start codons are also used internally: the actual start codon may not be the first one in the ORF.
• The stop codons are the same as in eukaryotes: TGA, TAA, TAG – stop codons are (almost) absolute: except for a few cases of programmed frameshifts and the use
of TGA for selenocysteine, the stop codon at the end of an ORF is the end of protein translation. • Genes can overlap by a small amount. Not much, but a few codons of overlap is common
enough so that you can’t just eliminate overlaps as impossible. • Cross-species homology works well for many genes. It is very unlikely that non-coding
sequence will be conserved. – But, a significant minority of genes (say 20%) are unique to a given species.
• Translation start signals (ribosome binding sites; Shine-Dalgarno sequences) are often found just upstream from the start codon
– however, some aren’t recognizable – genes in operons sometimes don’t always have a separate ribosome binding site for each gene
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Composition Methods
• The frequency of various codons is different in coding regions as compared to non-coding regions. – This extends to G-C content, dinucleotide frequencies, and other
measures of composition. Dicodons (groups of 6 bases) are often used
– Well documented experimentally. • The composition varies between different proteins of course, and
it is affected within a species by the amounts of the various tRNAs present – horizontally transferred genes can also confuse things: they tend to
have compositions that reflect their original species. – A second group with unusual compositions are highly expressed
genes.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Eukaryotic Genes Harder to Find
• Some fundamental differences between prokaryotes and eukaryotes:
• There is lots of non-coding DNA in eukaryotes. – First step: find repeated sequences and RNA
genes – Note that eukaryotes have 3 main RNA
polymerases. RNA polymerase 2 (pol2) transcribes all protein-coding genes, while pol1 and pol3 transcribe various RNA-only genes.
• most eukaryotic genes are split into exons and introns.
• Only 1 gene per transcript in eukaryotes. • No ribosome binding sites: translation starts at
the first ATG in the mRNA – thus, in eukaryotic genomes, searching for the
transcription start site (TSS) makes sense. • Many fewer eukaryotic genomes have been
sequenced
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Exons
• Exon sequences can often be identified by sequence conservation, at least roughly.
• Dicodon statistics, as was used for prokaryotes, also is useful – eukaryotic genomes tend to contain many isochores, regions of
different GC content, and composition statistics can vary between isochores.
• The initial and terminal exons contain untranslated regions, and thus special methods are needed to detect them.
• Predicting splice junctions is a matter of collecting information about the sequences surrounding each possible GT/AC pair, then running this information through some combination of decision tree, Markov models, discriminant analysis, or neural networks, in an attemp to massage the data into giving a reliable score. – In general, sites are more likely to be correct if predicted by multiple
methods – Experimental data from ESTs can be very helpful here.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
How to Find ncRNAs
• The most universal genes, such as tRNA and rRNA, are very conserved and thus easy to detect. Finding them first removes some areas of the genome from further consideration.
• One easy approach to finding common RNA genes is just looking for sequence homology with related species: a BLAST search will find most of them quite easily
• Functional RNAs are characterized by secondary structure caused by base pairing within the molecule.
• Determining the folding pattern is a matter of testing many possibilities to find the one with the minimum free energy, which is the most stable structure.
• The free energy calculations are in turn based on experiments where short synthetic RNA molecules are melted
• Related to this is the concept that paired regions (stems) will be conserved across species lines even if the individual bases aren’t conserved. That is, if there is an A-U pairing on one species, the same position might be occupied by a G-C in another species.
• This is an example of concerted evolution: a deleterious mutation at one site is cancelled by a compensating mutation at another site.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
RNA Structure
• RNA differs from DNA in having fairly common G-U base pairs. Also, many functional RNAs have unusual modified bases such as pseudouridine and inosine.
• The pseudoknot, pairing between a loop and a sequence outside its stem, is especially difficult to detect: computationally intense and not subject to the normal situation that RNA base pairing follows a nested pattern
– But pseudoknots seem to be fairly rare. • Essentially, RNA folding programs start
with all possible short sequences, then build to larger ones, adding the contribution of each structural element.
– There is an element of dynamic programming here as well.
– And, “stochastic context-free grammars”, something I really don’t want to approach right now!
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Finding tRNAs
• tRNAs have a highly conserved structure, with 3 main stem-and-loop structures that form a cloverleaf structure, and several conserved bases. Finding such sequences is a matter of looking in the DNA for the proper features located the proper distance apart.
• Looking for such sequences is well-suited to a decision tree, a series of steps that the sequence must pass.
• In addition, a score is kept, rating how well the sequence passed each step. This allows a more stringent analysis later on, to eliminate false positives.
Step 6: Analyze
insight progress
800 NATURE | VOL 406 | 17 AUGUST 2000 | www.nature.com
means that database searches must be repeated regularly to keepannotation accurate and up to date.
One possible solution to the annotation problem is to bring moreof the resources of the scientific community to bear on each genome.No single centre can annotate all the functions of a living organism;experts from many different areas of biology should be encouraged tocontribute to the annotation process. One possible model would befor geographically separated experts to deposit annotation to a central repository, which might also take on a curatorial or editorialrole. An alternative model is one in which annotation resides in manydifferent locations (as it does today), but in which new electroniclinks are created that allow scientists to locate rapidly all the informa-tion about a gene, genome or function. This latter model scales moreeasily and avoids the problem of overdependence on a single source.
What have we learned from genome analysis?Comparison of the results from 24 completed prokaryotic genomesequences, containing more than 50 Mbp of DNA sequence and54,000 predicted open reading frames (ORFs), has revealed that genedensity in the microbes is consistent across many species, with aboutone gene per kilobase (Table 2). Almost half of the ORFs in eachspecies are of unknown biological function. When the function ofthis large subset of genes begins to be explained, it is likely that entire-ly novel biochemical pathways will be identified that might be rele-vant to medicine and biotechnology. Perhaps even more unexpectedis the observation that about a quarter of the ORFs in each speciesstudied so far are unique, with no significant sequence similarity toany other available protein sequence. Although this might at presentbe an artefact of the small number of microbial species studied bywhole-genome analysis, it nevertheless supports the idea that there istremendous biological diversity between microorganisms. Takentogether, these data indicate that much of microbial biology has yet tobe understood and suggest that the idea of a ‘model’ organism in themicrobial world might not be appropriate, given the vast differencesbetween even related species.
Our molecular picture of evolution for the past 20 years has beendominated by the small-subunit ribosomal RNA phylogentic tree
that proposes three non-overlapping groups of living organisms: thebacteria, the archaea and the eukaryotes8. Although the archaea possess bacterial cell structures, it has been suggested that they sharea common ancestor exclusive of bacteria.
Analysis of complete genome sequences is beginning to providegreat insight into many questions about the evolution of microbes.One such area has encompassed the occurrence of genetic exchangesbetween different evolutionary lineages, a phenomenon known ashorizontal, or lateral, gene transfer. The occurrence of horizontalgene transfer, such as that involving genes from organellar genomesto the nucleus, or of antibiotic resistance genes between bacterialspecies, has been well established for many years (see, for example,ref. 9). This phenomenon causes problems for studying the evolutionof species because it means that some species are chimaeric, with different histories for different genes. Before the availability of complete genome sequences, studies of horizontal gene transfer hadbeen limited because of the incompleteness of the data sets beinganalysed. Analyses of complete genome sequences have led to manyrecent suggestions that the extent of horizontal gene exchange ismuch greater than was previously realized10–12. For example, an
Table 1 Results of a BLAST search of a newly sequenced M. tuberculosisgene against a comprehensive protein database
Gene ID Similarity (%) Length (bp) Gene name E-value*
GP:2905647 44.8 1,191 D-Arabinitol kinase 6.2e!15(Klebsiella pneumoniae)
EGAD:22614 46.2 1,191 Gluconokinase 1.4e!13(Bacillus subtilis)
EGAD:20418 43.0 1,302 Xylulose kinase 4.8e!13(Lactobacillus pentosus)
EGAD:105114 43.4 1,320 Carbohydrate kinase, 4.7e!12FGGY family (Archaeoglobus fulgidus)
GP:2895855 42.7 1,263 Xylulokinase 1.0e!07(Lactobacillus brevis)
EGAD:10899 45.4 1,296 Xylulose kinase 2.1e!06(Escherichia coli)
*E-value is a statistical measure of the significance of a BLAST search result.
Table 2 Genome features from 24 microbial genome sequencing projects
Organism Genome No. of ORFs Unknown Unique size (Mbp) (% coding) function ORFs
Aeropyrum pernix K1 1.67 1,885 (89%)
A. aeolicus VF5 1.50 1,749 (93%) 663 (44%) 407 (27%)
A. fulgidus 2.18 2,437 (92%) 1,315 (54%) 641 (26%)
B. subtilis 4.20 4,779 (87%) 1,722 (42%) 1,053 (26%)
B. burgdorferi 1.44 1,738 (88%) 1,132 (65%) 682 (39%)
Chlamydia pneumoniae AR39 1.23 1,134 (90%) 543 (48%) 262 (23%)
Chlamydia trachomatis MoPn 1.07 936 (91%) 353 (38%) 77 (8%)
C. trachomatis serovar D 1.04 928 (92%) 290 (32%) 255 (29%)
Deinococcus radiodurans 3.28 3,187 (91%) 1,715 (54%) 1,001 (31%)
E. coli K-12-MG1655 4.60 5,295 (88%) 1,632 (38%) 1,114 (26%)
H. influenzae 1.83 1,738 (88%) 592 (35%) 237 (14%)
H. pylori 26695 1.66 1,589 (91%) 744 (45%) 539 (33%)
Methanobacterium thermotautotrophicum 1.75 2,008 (90%) 1,010 (54%) 496 (27%)
Methanococcus jannaschii 1.66 1,783 (87%) 1,076 (62%) 525 (30%)
M. tuberculosis CSU#93 4.41 4,275 (92%) 1,521 (39%) 606 (15%)
M. genitalium 0.58 483 (91%) 173 (37%) 7 (2%)
M. pneumoniae 0.81 680 (89%) 248 (37%) 67 (10%)
N. meningitidis MC58 2.24 2,155 (83%) 856 (40%) 517 (24%)
Pyrococcus horikoshii OT3 1.74 1,994 (91%) 859 (42%) 453 (22%)
Rickettsia prowazekii Madrid E 1.11 878 (75%) 311 (37%) 209 (25%)
Synechocystis sp. 3.57 4,003 (87%) 2,384 (75%) 1,426 (45%)
T. maritima MSB8 1.86 1,879 (95%) 863 (46%) 373 (26%)
T. pallidum 1.14 1,039 (93%) 461 (44%) 280 (27%)
Vibrio cholerae El Tor N1696 4.03 3,890 (88%) 1,806 (46%) 934 (24%)
50.60 52,462 (89%) 22,358 (43%) 12,161 (23%)
© 2000 Macmillan Magazines Ltd
insight progress
802 NATURE | VOL 406 | 17 AUGUST 2000 | www.nature.com
Experimental evidence from studies of clinical isolates of somespecies has demonstrated phenotypic variation in the relevant cell-surface proteins33, suggesting that, at least for human pathogens,the evolution of antigenic proteins probably occurs in real time, ascell populations divide. The ability of human pathogens to alter theirantigenic potential and thereby evade the immune system has thepotential to hinder vaccine development by conventional methods.
Progress during the past year has supported the idea that complete genome sequence information can be exploited in thedesign of new vaccines and antimicrobial compounds. As an exam-ple, the identification of new vaccine candidates against serogroup BNeisseria meningitidis (MenB) was reported by Pizza et al. using agenomics-based approach34 (Fig. 2). With the use of the entiregenome sequence of a virulent serogroup B strain35 , 570 putative cell-surface-expressed or secreted proteins were identified; the corresponding DNA sequences were cloned and expressed in E. coli.Of the putative targets, 61% were expressed successfully and used toimmunize mice. Immune sera were screened for bactericidal activityand for the ability to bind to the surface of MenB cells. Seven repre-sentative proteins were selected for further study and were evaluatedfor their degree of sequence variability among multiple isolates andserogroups of N. meningitidis. Two highly conserved vaccine candidates emerged from this large-scale screening effort, whichoccurred in parallel with the completion of the genome sequence ofN. meningitidis. These results provide the first definitive demonstra-tion of the potential of genomic information to expand and acceler-ate the development of vaccines against pathogenic organisms.
Another example illustrates the potential of genomics to acceler-ate the development of novel antimicrobial agents. Jomaa et al.36
identified two genes in P. falciparum from sequence data from themalaria genome consortium that encode key enzymes in the 1-deoxy-D-xylulose-5-phosphate (DOXP) pathway that are requiredfor the synthesis of isoprenoids such as cholesterol37. The DOXPpathway functions in some bacteria, algae and higher plants to
produce isopentenyl diphosphate, a precursor of isoprenoids. In P.falciparum, the enzymes of the DOXP pathway are probably associat-ed with a specialized organelle derived from algae called the apicoplast; they are expressed when the parasite is growing within redblood cells. Inhibitors of one of the key enzymes in the DOXP pathway, DOXP reductoisomerase, had previously been identifiedand had been shown to inhibit the bacterial enzyme and the growthof some bacterial species. Jomaa et al.36 demonstrated that twoinhibitors of DOXP reductoisomerase, fosidomycin and FR900098,were able to inhibit the growth of P. falciparum in vitro and cure miceinfected with a related species of Plasmodium. Both of these compounds exhibit low toxicity and high stability and are relativelyinexpensive to produce, suggesting that they might be the basis of apotentially important new class of anti-malarial drugs.
ConclusionsSo far, studies in genomics have only scratched the surface of micro-bial diversity and have revealed how little is known about microbialspecies. In the next few years, more than 100 projects for sequencingmicrobial genomes should be completed, providing the scientificcommunity with information on more than 300,000 predicted genes.A significant number of these genes will be novel and of unknownfunction. These novel genes represent exciting new opportunities forfuture research and potential sources of biological resources to beexplored and exploited. The benefits of comparative genomics inunderstanding biochemical diversity, virulence and pathogenesis,and the evolution of species has been unequivocally demonstratedand the usefulness of comparative techniques will improve as moregenomes become available. One of the major challenges is to developtechniques for assessing the function of novel genes on a large scaleand integrating information on how genes and proteins interact atthe cellular level to create and maintain a living organism. It is notunreasonable to expect that, by expanding our understanding ofmicrobial biology and biodiversity, great strides can be made in the
A total of 570 putative secretedproteins or surface proteins
Protein expression
3–12 months
few months
N. meningitidis
hours
Immune serascreening
• Bactericidal activity• Binding to surface
of MenB cells
Seven proteinsselected for follow-upbased on high titres
Final candidate selectionTwo proteins were found to exhibit
no sequence variability➞clinical trials
Selection of vaccine targets
A total of ~350 recombinant proteinsexpressed in E. coli and used to
immunize mice
1
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
All potential antigens
Figure 2 Diagram depicting how complete microbial genome sequence data can accelerate vaccine development.
© 2000 Macmillan Magazines Ltd
LGT
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Annotation
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Classification I: GO
• The Gene Ontology (GO) consortium (http://www.geneontology.org/) is an attempt describe gene products with a structured controlled vocabulary, a set of invariant terms that have a known relationship to each other.
• Each GO term is given a number of the form GO:nnnnnnn (7 digits), as well as a term name. For example, GO:0005102 is “receptor binding”.
• There are 3 root terms: biological process, cellular component, and molecular function. A gene product will probably be described by GO terms from each of these “ontologies”. (ontology is a branch of philosophy concerned with the nature of being, and the basic categories of being and their relationships.)
– For instance, cytochrome c is described with the molecular function term “oxidoreductase activity”, the biological process terms “oxidative phosphorylation” and “induction of cell death”, and the cellular component terms “mitochondrial matrix” and “mitochondrial inner membrane”
• The terms are arranged in a hierarchy that is a “directed acyclic graph” and not a tree. This means simply that each term can have more than one parent term, but the direction of parent to child (i.e. less specific to more specific) is always maintained.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Classification II: Enzyme Nomenclature
• Enzyme functions: which reactants are converted to which products – Across many species, the enzymes that perform a specific function are usually
evolutionarily related. However, this isn’t necessarily true. There are cases of two entirely different enzymes evolving similar functions.
– Often, two or more gene products in a genome will have the same E.C. number. • Enzyme functions are given unique numbers by the Enzyme Commission.
– E.C. numbers are four integers separated by dots. The left-most number is the least specific
– For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4", whose components indicate the following groups of enzymes:
• EC 3 enzymes are hydrolases (enzymes that use water to break up some other molecule) • EC 3.4 are hydrolases that act on peptide bonds • EC 3.4.11 are those hydrolases that cleave off the amino-terminal amino acid from a
polypeptide • EC 3.4.11.4 are those that cleave off the amino-terminal end from a tripeptide
• Top level E.C. numbers: – E.C. 1: oxidoreductases (often dehydrogenases): electron transfer – E.C. 2: transferases: transfer of functional groups (e.g. phosphate) between
molecules. – E.C. 3: hydrolases: splitting a molecule by adding water to a bond. – E.C. 4: lyases: non-hydrolytic addition or removal of groups from a molecule – E.C. 5: isomerases: rearrangements of atoms within a molecule – E.C. 6: ligases: joining two molecules using energy from ATP
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction
• BLAST searches • HMM models of specific genes or gene families (Pfam, TIGRfam,
FIGfam). • Sequence motifs and domains. If the gene is not a good match to
previously known genes, these provide useful clues. • Cellular location predictions, especially for transmembrane proteins. • Genomic neighbors, especially in bacteria, where related functions
are often found together in operons and divergons (genes transcribed in opposite directions that use a common control region).
• Biochemical pathway/subsystem information. If an organism has most of the genes needed to perform a function, any missing functions are probably present too. – Also, experimental data about an organism’s capacities can be used to
decide whether the relevant functions are present in the genome.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction II: Membrane Spanning
• Integral membrane proteins contain amino acid sequences that go through the membrane one or several times. – There are also peripheral membrane proteins that stick
to the hydrophilic head groups by ionic and polar interactions
– There are also some that have covalently bound hydrophobic groups, such as myristoylate, a 14 carbon saturated fatty acid that is attached to the N-terminal amino group.
• There are 2 main protein structures that cross membranes. – Most are alpha helices, and in proteins that span
multiple times, these alpha helices are packed together in a coiled-coil. Length = 15-30 amino acids.
– Less commonly, there are proteins with membrane spanning “beta barrels”, composed of beta sheets wrapped into a cylinder. An example: porins, which transport water across the membrane.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction by Phylogeny
• Key step in genome projects
• More accurate predictions help guide experimental and computational analyses
• Many diverse approaches
• All improved both by “phylogenomic” type analyses that integrate evolutionary reconstructions and understanding of how new functions evolve
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction
• Identification of motifs ! Short regions of sequence similarity that are indicative
of general activity ! e.g., ATP binding
• Homology/similarity based methods ! Gene sequence is searched against a databases of
other sequences ! If significant similar genes are found, their functional
information is used
• Problem ! Genes frequently have similarity to hundreds of motifs
and multiple genes, not all with the same function
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Helicobacter pylori
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
H. pylori genome - 1997
“The ability of H. pylori to perform mismatch repair is suggested by the presence of methyl transferases, mutS and uvrD. However, orthologues of MutH and MutL were not identified.”
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
MutL ??
From http://asajj.roswellpark.org/huberman/dna_repair/mmr.html
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Phylogenetic Tree of MutS Family
Aquae Trepa
FlyXenlaRatMouseHumanYeastNeucr
Arath
BorbuStrpyBacsu
SynspEcoliNeigo
ThemaTheaqDeiraChltr
SpombeYeast
YeastSpombeMouseHumanArath
YeastHumanMouseArath
StrpyBacsu
CelegHumanYeast MetthBorbu
AquaeSynspDeira Helpy
mSaco
YeastCelegHuman
Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300. 65
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
MutS Subfamilies
Aquae Trepa
FlyXenlaRatMouse
HumanYeastNeucr
Arath
BorbuStrpy
BacsuSynspEcoli
Neigo
ThemaTheaqDeira
Chltr
SpombeYeast
YeastSpombe
MouseHumanArath
YeastHumanMouseArath
StrpyBacsu
CelegHumanYeastMetthBorbu
AquaeSynsp
Deira Helpy
mSaco
YeastCeleg
Human
MSH4
MSH5 MutS2
MutS1
MSH1
MSH3
MSH6
MSH2
Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300. 66
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Overlaying Functions onto Tree
Aquae Trepa
Rat
FlyXenla
MouseHumanYeastNeucr
Arath
BorbuSynsp
Neigo
ThemaStrpy
Bacsu
Ecoli
TheaqDeiraChltr
SpombeYeast
YeastSpombe
MouseHuman
Arath
YeastHumanMouseArath
StrpyBacsu
HumanCelegYeast
MetthBorbu
AquaeSynsp
Deira Helpy
mSaco
YeastCeleg
Human
MSH4
MSH5MutS2
MutS1
MSH1
MSH3
MSH6
MSH2
Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300. 67
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014TIGR
MutS Subfamilies
• MutS1 Bacterial MMR• MSH1 Euk - mitochondrial MMR• MSH2 Euk - all MMR in nucleus• MSH3 Euk - loop MMR in nucleus• MSH6 Euk - base:base MMR in nucleus
• MutS2 Bacterial - function unknown• MSH4 Euk - meiotic crossing-over• MSH5 Euk - meiotic crossing-over
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction Using Tree
Aquae Trepa
FlyXenlaRatMouse
HumanYeastNeucr
Arath
BorbuStrpy
BacsuSynspEcoli Neigo
ThemaTheaqDeira
Chltr
SpombeYeast
YeastSpombe
MouseHumanArath
YeastHumanMouseArath
MSH1MitochondrialRepair
MSH3 - Nuclear RepairOf Loops
MSH6 - Nuclear RepairOf Mismatches
MutS1 - Bacterial Mismatch and Loop Repair
StrpyBacsu
CelegHumanYeastMetthBorbu
AquaeSynsp
Deira Helpy
mSaco
YeastCeleg
Human
MSH4 - Meiotic CrossingOver
MSH5 - Meiotic Crossing Over MutS2 - Unknown Functions
MSH2 - Eukaryotic NuclearMismatch and Loop Repair
Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300. 69
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014TIGR
Table 3. Presence of MutS Homologs in Complete Genomes Sequences
Species # of MutSHomologs
WhichSubfamilies?
MutLHomologs
BacteriaEscherichia coli K12 1 MutS1 1Haemophilus influenzae Rd KW20 1 MutS1 1Neisseria gonorrhoeae 1 MutS1 1Helicobacter pylori 26695 1 MutS2 -Mycoplasma genitalium G-37 - - -Mycoplasma pneumoniae M129 - - -Bacillus subtilis 169 2 MutS1,MutS2 1Streptococcus pyogenes 2 MutS1,MutS2 1Mycobacterium tuberculosis - - -Synechocystis sp. PCC6803 2 MutS1,MutS2 1Treponema pallidum Nichols 1 MutS1 1Borrelia burgdorferi B31 2 MutS1,MutS2 1Aquifex aeolicus 2 MutS1,MutS2 1Deinococcus radiodurans R1 2 MutS1,MutS2 1
ArchaeaArchaeoglobus fulgidus VC-16, DSM4304 - - -Methanococcus janasscii DSM 2661 - - -Methanobacterium thermoautotrophicum ΔH 1 MutS2 -
EukaryotesSaccharomyces cerevisiae 6 MSH1-6 3+Homo sapiens 5 MSH2-6 3+
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Blast Search of H. pylori “MutS”
Score E Sequences producing significant alignments: (bits) Value sp|P73625|MUTS_SYNY3 DNA MISMATCH REPAIR PROTEIN 117 3e-25 sp|P74926|MUTS_THEMA DNA MISMATCH REPAIR PROTEIN 69 1e-10 sp|P44834|MUTS_HAEIN DNA MISMATCH REPAIR PROTEIN 64 3e-09 sp|P10339|MUTS_SALTY DNA MISMATCH REPAIR PROTEIN 62 2e-08 sp|O66652|MUTS_AQUAE DNA MISMATCH REPAIR PROTEIN 57 4e-07 sp|P23909|MUTS_ECOLI DNA MISMATCH REPAIR PROTEIN 57 4e-07
• Blast search pulls up Syn. sp MutS#2 with much higher p value than other MutS homologs
• Based on this TIGR predicted this species had mismatch repair
Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
High Mutation Rate in H. pylori
Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWNFUNCTIONS ONTO TREE
INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
12
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on Eisen, 1998 Genome Res 8: 163-167.
Phylogenomics
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
2
3
14
5
6
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Chemosynthetic Symbionts
Eisen et al. 1992
Eisen et al. 1992. J. Bact.174: 3416
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Carboxydothermus hydrogenoformans
• Isolated from a Russian hotspring • Thermophile (grows at 80°C) • Anaerobic • Grows very efficiently on CO (Carbon
Monoxide) • Produces hydrogen gas • Low GC Gram positive (Firmicute) • Genome Determined (Wu et al. 2005 PLoS
Genetics 1: e65. )
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Homologs of Sporulation Genes
Wu et al. 2005 PLoS Genetics 1: e65.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Carboxydothermus sporulates
Wu et al. 2005 PLoS Genetics 1: e65.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Non-Homology Predictions: Phylogenetic Profiling
• Step 1: Search all genes in organisms of interest against all other genomes
• Ask: Yes or No, is each gene found in each other species
• Cluster genes by distribution patterns (profiles)
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Sporulation Gene Profile
Wu et al. 2005 PLoS Genetics 1: e65.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
B. subtilis new sporulation genes
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction III: Colocalization
• Operon structure is often maintained over fairly large taxonomic regions.
– Sometimes gene order is altered, and sometimes one or more enzymes are missing.
– But in general, this phenomenon allows recognition or verification that widely diverged enzymes do in fact have the same function.
• This is an operon that contains part of the glycolytic pathway.
– 1: phosphoclycerate mutase – 2: triosephosphate isomerase – 3: enolase – 4: phosphoglycerate kinase – 5: glyceraldehyde 3-phosphate
dehydrogenase – 6: central glycolytic gene regulator
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Metabolic Predictions
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Comparative Genomics
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !85
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Using the Core
!86
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
insight progress
800 NATURE | VOL 406 | 17 AUGUST 2000 | www.nature.com
means that database searches must be repeated regularly to keepannotation accurate and up to date.
One possible solution to the annotation problem is to bring moreof the resources of the scientific community to bear on each genome.No single centre can annotate all the functions of a living organism;experts from many different areas of biology should be encouraged tocontribute to the annotation process. One possible model would befor geographically separated experts to deposit annotation to a central repository, which might also take on a curatorial or editorialrole. An alternative model is one in which annotation resides in manydifferent locations (as it does today), but in which new electroniclinks are created that allow scientists to locate rapidly all the informa-tion about a gene, genome or function. This latter model scales moreeasily and avoids the problem of overdependence on a single source.
What have we learned from genome analysis?Comparison of the results from 24 completed prokaryotic genomesequences, containing more than 50 Mbp of DNA sequence and54,000 predicted open reading frames (ORFs), has revealed that genedensity in the microbes is consistent across many species, with aboutone gene per kilobase (Table 2). Almost half of the ORFs in eachspecies are of unknown biological function. When the function ofthis large subset of genes begins to be explained, it is likely that entire-ly novel biochemical pathways will be identified that might be rele-vant to medicine and biotechnology. Perhaps even more unexpectedis the observation that about a quarter of the ORFs in each speciesstudied so far are unique, with no significant sequence similarity toany other available protein sequence. Although this might at presentbe an artefact of the small number of microbial species studied bywhole-genome analysis, it nevertheless supports the idea that there istremendous biological diversity between microorganisms. Takentogether, these data indicate that much of microbial biology has yet tobe understood and suggest that the idea of a ‘model’ organism in themicrobial world might not be appropriate, given the vast differencesbetween even related species.
Our molecular picture of evolution for the past 20 years has beendominated by the small-subunit ribosomal RNA phylogentic tree
that proposes three non-overlapping groups of living organisms: thebacteria, the archaea and the eukaryotes8. Although the archaea possess bacterial cell structures, it has been suggested that they sharea common ancestor exclusive of bacteria.
Analysis of complete genome sequences is beginning to providegreat insight into many questions about the evolution of microbes.One such area has encompassed the occurrence of genetic exchangesbetween different evolutionary lineages, a phenomenon known ashorizontal, or lateral, gene transfer. The occurrence of horizontalgene transfer, such as that involving genes from organellar genomesto the nucleus, or of antibiotic resistance genes between bacterialspecies, has been well established for many years (see, for example,ref. 9). This phenomenon causes problems for studying the evolutionof species because it means that some species are chimaeric, with different histories for different genes. Before the availability of complete genome sequences, studies of horizontal gene transfer hadbeen limited because of the incompleteness of the data sets beinganalysed. Analyses of complete genome sequences have led to manyrecent suggestions that the extent of horizontal gene exchange ismuch greater than was previously realized10–12. For example, an
Table 1 Results of a BLAST search of a newly sequenced M. tuberculosisgene against a comprehensive protein database
Gene ID Similarity (%) Length (bp) Gene name E-value*
GP:2905647 44.8 1,191 D-Arabinitol kinase 6.2e!15(Klebsiella pneumoniae)
EGAD:22614 46.2 1,191 Gluconokinase 1.4e!13(Bacillus subtilis)
EGAD:20418 43.0 1,302 Xylulose kinase 4.8e!13(Lactobacillus pentosus)
EGAD:105114 43.4 1,320 Carbohydrate kinase, 4.7e!12FGGY family (Archaeoglobus fulgidus)
GP:2895855 42.7 1,263 Xylulokinase 1.0e!07(Lactobacillus brevis)
EGAD:10899 45.4 1,296 Xylulose kinase 2.1e!06(Escherichia coli)
*E-value is a statistical measure of the significance of a BLAST search result.
Table 2 Genome features from 24 microbial genome sequencing projects
Organism Genome No. of ORFs Unknown Unique size (Mbp) (% coding) function ORFs
Aeropyrum pernix K1 1.67 1,885 (89%)
A. aeolicus VF5 1.50 1,749 (93%) 663 (44%) 407 (27%)
A. fulgidus 2.18 2,437 (92%) 1,315 (54%) 641 (26%)
B. subtilis 4.20 4,779 (87%) 1,722 (42%) 1,053 (26%)
B. burgdorferi 1.44 1,738 (88%) 1,132 (65%) 682 (39%)
Chlamydia pneumoniae AR39 1.23 1,134 (90%) 543 (48%) 262 (23%)
Chlamydia trachomatis MoPn 1.07 936 (91%) 353 (38%) 77 (8%)
C. trachomatis serovar D 1.04 928 (92%) 290 (32%) 255 (29%)
Deinococcus radiodurans 3.28 3,187 (91%) 1,715 (54%) 1,001 (31%)
E. coli K-12-MG1655 4.60 5,295 (88%) 1,632 (38%) 1,114 (26%)
H. influenzae 1.83 1,738 (88%) 592 (35%) 237 (14%)
H. pylori 26695 1.66 1,589 (91%) 744 (45%) 539 (33%)
Methanobacterium thermotautotrophicum 1.75 2,008 (90%) 1,010 (54%) 496 (27%)
Methanococcus jannaschii 1.66 1,783 (87%) 1,076 (62%) 525 (30%)
M. tuberculosis CSU#93 4.41 4,275 (92%) 1,521 (39%) 606 (15%)
M. genitalium 0.58 483 (91%) 173 (37%) 7 (2%)
M. pneumoniae 0.81 680 (89%) 248 (37%) 67 (10%)
N. meningitidis MC58 2.24 2,155 (83%) 856 (40%) 517 (24%)
Pyrococcus horikoshii OT3 1.74 1,994 (91%) 859 (42%) 453 (22%)
Rickettsia prowazekii Madrid E 1.11 878 (75%) 311 (37%) 209 (25%)
Synechocystis sp. 3.57 4,003 (87%) 2,384 (75%) 1,426 (45%)
T. maritima MSB8 1.86 1,879 (95%) 863 (46%) 373 (26%)
T. pallidum 1.14 1,039 (93%) 461 (44%) 280 (27%)
Vibrio cholerae El Tor N1696 4.03 3,890 (88%) 1,806 (46%) 934 (24%)
50.60 52,462 (89%) 22,358 (43%) 12,161 (23%)
© 2000 Macmillan Magazines Ltd
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
After the Genomes
• Better analysis and annotation
• Comparative genomics
• Functional genomics (Experimental analysis of gene function on a genome scale)
• Genome-wide gene expression studies
• Proteomics
• Genome wide genetic experiments