Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing

Lecture 10:

EVE 161:Microbial Phylogenomics

Lecture 10:

UC Davis, Winter 2016 Instructors: Jonathan Eisen & Holly Ganz

Answer 2 of these. Please make your answers short.

• 1) List 4-5 Steps in a “Whole Genome Shotgun Sequencing” Project

• 2) What is meant by the “Add on Costs of Sequencing”

• 3) Explain one form of evidence used to infer lateral gene transfer and why that evidence sometimes can be misleading

• 4) Give examples of 3 different ways to fragment genomic DNA

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

1st Genome Sequence

Fleischmann et al. 1995

!3


Complete Genome/Chromosome Progress

Fraser et al. 2000insight progress

NATURE | VOL 406 | 17 AUGUST 2000 | www.nature.com 799

Microbes were the first organisms on Earthand preceded animals and plants by morethan 3 billion years. They are thefoundation of the biosphere, from bothan evolutionary and an environmental

perspective1. It has been estimated that microbial speciescomprise about 60% of the Earth’s biomass. The genetic,metabolic and physiological diversity of microbial speciesis far greater than that found in plants and animals. Butthe diversity of the microbial world is largely unknown,with less than one-half of 1% of the estimated 2–3 billionmicrobial species identified. Of those species that havebeen described, their biological diversity is extraordinary,having adapted to grow under extremes of temperature,pH, salt concentration and oxygen levels.

Perhaps no other area of research has been so energizedby the application of genomic technology than the micro-bial field. It was only five years ago that The Institute forGenomic Research (TIGR) published the first completegenome sequence for a free-living organism, Haemophilusinfluenzae2; since that first report another 27 microbialgenome sequences have been published, with at least 10–20other projects at or near completion (for details seehttp://www.tigr.org/tdb/mdb/mdb.html). This progressrepresents, on average, one completed genome sequenceevery two months and all indications are that this pace willcontinue to accelerate. Included in the first completedmicrobial projects are many important human pathogens,the simplest known free-living organism, ‘model’ organisms, Escherichia coli and Bacillus subtilis, thermophilic bacterial species that might represent some ofthe deepest-branching members of the bacterial lineage, fiverepresentatives of the archaeal domain, and the first eukary-ote, Saccharomyces cerevisiae. All of the organisms that havebeen studied by whole-genome analysis are species that canbe grown either in the laboratory or in animal cells. It isimportant to remember that the vast majority of microbialspecies cannot be cultivated at all, and these organisms,which live in microbial communities, are essential to theoverall ecology of the planet. Nevertheless, the study of ‘laboratory-adapted’ microbes has had a profound impacton our understanding of the biology and the evolutionaryrelationships between microbial species.

Methods for whole-genome analysisThe method that was successfully used to determine thecomplete genome sequence of H. influenzae is a shotgunsequencing strategy (Fig. 1). Before 1995, the largestgenome sequenced with a random strategy was that of bac-teriophage lambda with a genome size of 48,502 base pairs(bp), completed by Sanger et al. in 1982 (ref. 3). Despite

advances in DNA-sequencing technology, the sequencing ofwhole genomes had not progressed beyond lambda-sizedclones (about 40 kbp) because of the lack of sufficient computational approaches that would enable the efficientassembly of a large number of independent randomsequences into a single contig.

For the H. influenzae and subsequent projects, we haveused a computational method that was developed to createassemblies from hundreds of thousands of complementaryDNA sequences 300–500-bp long4. This approach hasproved to be a cost-effective and efficient approach tosequencing megabase-sized segments of genomic DNA.This strategy does not require an ordered set of cosmids orother subclones, thus significantly reducing the overall costper base pair of producing a finished sequence, while providing high redundancy for accuracy and minimizingthe effort required to obtain the whole genome sequence.The availability of improved technologies for longersequence lengths (more than 700 bp) reduces problemsassociated with repetitive elements in the final assembly.

Microbial gene finding and annotationThe identification of genes in prokaryotic genomes hasadvanced to the stage at which nearly all protein-codingregions can be identified with confidence. Computationalgene finders using Markov modelling techniques now routinely find more than 99% of protein-coding regions5

and RNA genes6. Once the protein-coding genes have beenlocated, the most challenging problem is to determine theirfunction. Typically, about 40–60% of the genes in a newlysequenced bacterial genome display a detectable sequencesimilarity to protein sequences whose function is at leasttentatively known. This sequence similarity is the primarybasis for assigning function to new proteins, but the transferof functional assignments is fraught with difficulties.

To illustrate this problem, Table 1 contains an exampleshowing the best matches in the database for a 1,344-bp genefrom Mycobacterium tuberculosis at the time that thegenome was being sequenced. All six of the best matches arekinases, but the specific names differ. A conservative namingstrategy might use a family name that includes all six match-es. Another strategy might use curated protein families (ifthey exist) to assign names; for example, the FGGY familynamed in the fourth line of Table 1 comes from the Pfamdatabase7, a set of 1,815 hidden Markov models based onmultiple alignments. By a closer examination of the litera-ture, one could determine which of these protein nameswere based on laboratory experiments and which onsequence similarity. In any case, the assignment of a function to this protein requires the expertise of a skilledbiologist. The rapidly changing nature of genome databases

Microbial genome sequencingClaire M. Fraser, Jonathan A. Eisen & Steven L. Salzberg

The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA

Complete genome sequences of 30 microbial species have been determined during the past five years, andwork in progress indicates that the complete sequences of more than 100 further microbial species will beavailable in the next two to four years. These results have revealed a tremendous amount of information onthe physiology and evolution of microbial species, and should provide novel approaches to the diagnosis andtreatment of infectious disease.

© 2000 Macmillan Magazines Ltd


Fraser et al. Shotgun Sequencing 2000 insight progress

NATURE | VOL 406 | 17 AUGUST 2000 | www.nature.com 801

analysis of the genomes of two thermophilic bacterial species,Aquifex aeolicus and Thermotoga maritima, revealed that 20–25% ofthe genes in these species were more similar to genes from archaeathan those from bacteria13,14. This led to the suggestion of possibleextensive gene exchanges between these species and archaeal lineages. But before one jumps to this conclusion it is important toconsider the difficulties in inferring the occurrence of gene transfer.For example, the high percentage of genes with best matches toarchaea in A. aeolicus and T. maritima could also be due to a high rateof evolution in the mesophilic bacteria (which would cause thermophilic and archaeal genes to have high levels of similaritydespite their not having a common ancestry) or the loss of these genesfrom mesophilic bacteria15. For T. maritima, many lines of additionalevidence support the assertion of gene transfer, including the obser-vation that many of the archaeal-like genes occur in clusters in thegenome, are in regions of unusual nucleotide composition, andbranch in phylogenetic trees most closely to archaeal genes14. Most ofthe lines of evidence leading to assertions of horizontal gene transfercan have other causes. For example, unusual nucleotide compositioncan also arise from selection16, and differences in phylogenetic treescan be caused by convergence, inaccurate alignments17, long-branchattraction18 or sampling of different species19. It is therefore important to assess the evidence carefully and to find multiple typesof evidence. This has yet to be done systematically, so we believe that itis too early to assign quantitative values to the extent of gene exchangebetween species.

Despite the apparent occurrence of extensive gene transfers in thehistory of microbes, it does seem that there might be a ‘core’ to eachevolutionary lineage that retains some phylogenetic signal. The bestevidence for this comes from the construction of ‘whole genometrees’ based on the presence and absence of particular homologues ororthologues in different complete genomes20. It is important to notethat gene-content trees are averages of patterns produced by phyloge-ny, gene duplication and loss, and horizontal transfer; they are therefore not real phylogenetic trees. Nevertheless, the fact that thesetrees are very similar to phylogenetic trees of genes such as ribosomalRNA and RecA suggests that although horizontal gene transfer might

be extensive, it is somehow constrained by phylogenetic relation-ships. Other evidence for a ‘core’ of particular lineages comes fromthe finding of a conserved core of euryarchaeal genomes21,22 andanother finding that some types of gene might be more prone to genetransfer than others23. It therefore seems likely that horizontal genetransfer has not completely obliterated the phylogenetic signal inmicrobial genomes. Careful studies in which the phylogenetic trees ofsome of these core genes are compared across all genomes need to bedone to see whether or not the core has a consistent phylogeny. Initialstudies suggest that it does, at least for the major microbial groups14.

Although our ability to resolve patterns of the relationshipsamong microbes is still limited, analysis of the genomes of closelyrelated species is revealing much about genome evolution24,25. Forexample, a comparison of the genomes of four chlamydial species hasrevealed the occurrence of frequent tandem gene duplication andgene loss, as well as large chromosomal inversions25. Comparisons ofclosely related species should also reveal much about mutationprocesses, codon usage and other features that evolve rapidly16.

Design of new antimicrobial agents and vaccinesOne of the expected benefits of genome analysis of pathogenic bacte-ria is in the area of human health, particularly in the design of morerapid diagnostic reagents and the development of new vaccines andantimicrobial agents. These goals have become more urgent with thecontinuing spread of antibiotic resistance in important humanpathogens. Moreover, results from the whole-genome analysis ofhuman pathogens has suggested that there are mechanisms for gen-erating antigenic variation in proteins expressed on the cell surfacethat are encoded within the genomes of these organisms. Thesemechanisms include the following: (1) slipped-strand mispairingwithin DNA sequence repeats found in 5!-intergenic regions andcoding sequences as described for H. influenzae2, Helicobacter pylori26

and M. tuberculosis27, (2) recombination between homologous genesencoding outer-surface proteins as described for Mycoplasma genitalium28, Mycoplasma pneumoniae29 and Treponema pallidum30,and (3) clonal variability in surface-expressed proteins as describedfor Plasmodium falciparum31 and possibly Borrelia burgdorferi32.

2. Random sequencing phase

GGG ACTGTTC...

(i) Isolate DNA

(ii) Fragment DNA

(iii) Clone DNA

3. Closure phase

(i) Assemble sequences(i) Sequence DNA(15,000 sequences per Mb)

(ii) Close gaps

(iv) Annotation

(iii) Edit

237 239

238

4. Completegenome sequence

1. Library construction

–1 –1

1

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

Figure 1 Diagram depicting the steps in a whole-genome shotgun sequencing project.



From http://genomesonline.org

Loman et al. 2012

In bacteriology, the genomic era began in 1995, when the first bacterial genome was sequenced using conventional Sanger sequencing1. Back then, sequencing pro-jects required six-figure budgets and years of effort. A decade later, in 2005, the advent of the first high-throughput (or ‘next-generation’) sequencing technologies signalled a significant advance in the ease and cost of sequencing2, delivering bacterial genome sequences in hours or days rather than months or years. High-throughput sequencing now delivers sequence data thousands of times more cheaply than is possible with Sanger sequencing. The avail-ability of a growing abundance of platforms and instruments presents the user with an embarrassment of choice. Better still, vigor-ous competition between manufacturers has resulted in sustained technical improvements on almost all platforms. This means that in recent years our sequencing capability has been doubling every 6–9 months — much faster than Moore’s law.

Here, we describe the sequencing tech-nologies themselves, examine the practicali-ties of producing a sequence-ready template from bacterial cultures and clinical samples, and weigh up the costs of labour and kits. We look at the types of data that are deliv-ered by each instrument, and describe the approaches, programs and pipelines that can

be used to analyse these data and thus move from draft to complete genomes.

Several high-throughput sequencing platforms are now chasing the US$1,000 human genome3. Given that the average bacterial genome is less than one-thousandth the size of the human genome, a back-of-the- envelope calculation suggests that a $1 bacterial genome sequence is an imminent possibility. In closing, we assess how close to reality the $1 bacterial genome actually is and explore the ways in which high-throughput sequencing might change the way that all microbiologists work.

A variety of approachesHigh-throughput sequencing platforms can be divided into two broad groups depend-ing on the kind of template used for the sequencing reactions. The earliest, and cur-rently most widely used, platforms depend on the production of libraries of clonally amplified templates. These are produced through amplification of immobilized librar-ies made from a single DNA molecule in the initial sample. More recently, we have seen the arrival of single-molecule sequencing platforms, which determine the sequence of single molecules without amplification. Within these broad categories, there is considerable variation in performance — including in throughput, read length and

error rate — as well as in factors affecting usability, such as cost and run time.

Template amplification technologies. In general terms, all of the platforms that are currently on the market rely on a three-stage workflow of library preparation, template amplification and sequencing (FIG. 1). Library preparation begins with the extraction and purification of genomic DNA. Depending on the protocol, the amount of DNA required can vary from a few nanograms to tens of micrograms, meaning that success in this step depends on the ability to grow sufficient biomass. For some microorganisms, obtain-ing suitable DNA — in terms of quantity and quality — can prove difficult. Therefore, before using expensive reagents for library preparation and sequencing, it is advisable to confirm, by fluorometry, that DNA of suffi-cient quantity and quality has been obtained. However, purchasing a suitable instrument to do this adds to the costs of establishing a sequencing capability (BOX 1).

For shotgun sequencing, an initial fragmentation step is required to gener-ate random, overlapping DNA fragments. Depending on the platform and applica-tion, these fragments can range from 150 bp to 800 bp in length; size selection either involves harvesting from agarose gels or exploits paramagnetic-bead-based technol-ogy. The selected fragments must also be sufficiently abundant to provide comprehen-sive and even coverage of the target genome. Two types of fragmentation are widely used: mechanical and enzymatic. Early protocols relied on mechanical methods such as nebulization or ultrasonication. Nebulization is an inexpensive method that can be easily adopted by any laboratory, but it results in large losses of input material and a broad range of fragment sizes, runs the risk of cross-contamination and cannot handle par-allel processing. By contrast, ultrasonication instruments such as systems from Covaris or the Bioruptor systems from Diagenode allow parallel sample processing and minimize hands-on time and sample loss but come at a price that could be prohibitive for small lab-oratories. Mechanically generated fragments require repair and end-polishing before platform-specific adaptors can be ligated to

High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunityNicholas J. Loman1, Chrystala Constantinidou1, Jacqueline Z. M. Chan1, Mihail Halachev1, Martin Sergeant1, Charles W. Penn1, Esther R. Robinson2 and Mark J. Pallen1

Abstract | Here, we take a snapshot of the high-throughput sequencing platforms, together with the relevant analytical tools, that are available to microbiologists in 2012, and evaluate the strengths and weaknesses of these platforms in obtaining bacterial genome sequences. We also scan the horizon of future possibilities, speculating on how the availability of sequencing that is ‘too cheap to metre’ might change the face of microbiology forever.

PROGRESS

NATURE REVIEWS | MICROBIOLOGY VOLUME 10 | SEPTEMBER 2012 | 599

F O C U S O N N E X T- G E N E R AT I O N S E Q U E N C I N G

© 2012 Macmillan Publishers Limited. All rights reserved

Loman et al. Shotgun Sequencing 2014

from the reference genome, or when a closely related reference genome is unavailable.

De novo assembly is more informa-tive when dealing with a new pathogen or a new strain of a well-known pathogen. Sequencing errors can have a significant impact on assembly. When platforms pro-duce random errors, the effect of these errors on assembly can be overcome by

increasing the depth of coverage. However, when errors are systematic and occur in predictable contexts (for example, in homopolymers), increasing the depth of coverage is unlikely to help, and it may be necessary to sequence the troublesome regions using an alternative technology. Very high-quality, near complete references may be obtained by a hybrid approach,

such as in recent studies combining Pacific Biosciences and Illumina data21,22.

A variety of commonly used assem-blers is now available (see Supplementary information S1 (table)), ranging from the platform specific (for example, Newbler from Roche) to the more generally applica-ble (for example, MIRA23, Velvet24, and the CLC Genomics Workbench from CLC Bio).

Table 1 | Comparison of next-generation sequencing platforms

Machine (manufacturer)

Chemistry Modal read length* (bases)

Run time Gb per run Current, approximate cost (US$)‡

Advantages Disadvantages

High-end instruments

454 GS FLX+ (Roche) Pyrosequencing 700–800 hours 0.7 500,000 • Long read lengths • Appreciable hands-on time

• High reagent costs• High error rate in

homopolymers

HiSeq 2000/2500 (Illumina)

Reversible terminator

2 × 100 11 days (regular mode) or

da s rapid run mode)§

600 (regular mode) or 120 (rapid run mode)§

750,000 • Cost-effectiveness• Steadily improving

read lengths• Massive

throughput• Minimal hands-on

time

• Long run time • Short read lengths• HiSeq 2500

instrument upgrade not available at time of writing (available end 2012)

5500xl SOLiD (Life Technologies)

Ligation 75 + 35 da s 150 350,000 • Low error rate• Massive

throughput

• Very short read lengths

• Long run times

PacBio RS (Pacific Biosciences)

Real-time sequencing

3,000 (maximum 15,000)

minutes 3 per day 750,000 • Simple sample preparation

• Low reagent costs• Very long read

lengths

• High error rate• Expensive system• Difficult installation

Bench-top instruments

454 GS Junior (Roche) Pyrosequencing 500 hours 0.035 100,000 • Long read lengths • Appreciable hands-on time

• High reagent costs• High error rate in

homopolymers

Ion Personal Genome Machine (Life Technologies)

Proton detection

100 or 200 hours 0.01–0.1 (314 chip), 0.1–0.5 (316 chip) or up to 1 (318 chip)

80,000 (including OneTouch and server)

• Short run times• Appropriate

throughput for microbial applications

• Appreciable hands-on time

• High error rate in homopolymers

Ion Proton (Life Technologies)

Proton detection

Up to 200 2 hours Up to 10 (Proton I chip) or up to 100 (Proton II chip)

145,000 + 75,000 for compulsory server

• Short run times• Flexible chip

reagents

• Instrument not available at time of writing

MiSeq (Illumina) Reversible terminator

2 × 150 hours 1.5 125,000 • Cost-effectiveness• Short run times• Appropriate

throughput for microbial applications

• Minimal hands-on time

• Read lengths too short for efficient assembly

*Average read length for a fragment-based run. ‡Approximate cost per machine plus additional instrumentation and service contract. See REF. 58. §Available only on the HiSeq 2500.

P R O G R E S S

NATURE REVIEWS | MICROBIOLOGY VOLUME 10 | SEPTEMBER 2012 | 603

F O C U S O N N E X T- G E N E R AT I O N S E Q U E N C I N G


De novo assemblies can be compared using Mauve25 or Mugsy26, and the assemblies can be manually examined using the Tablet viewer27. For annotation of assemblies, Glimmer28 works well for coding-sequence prediction, while tRNAScan-SE29 and RNAmmer30 work well for stable-RNA prediction. There are numerous pipelines for automatic annotation of de novo assem-blies, including RAST31, IMG/ER32 and the IGS Annotation Engine (developed by the Institute for Genome Sciences, University of Maryland School of Medicine, USA), although care must be taken when inter-preting results from such services, as the public databases used contain annotation errors that are then propagated to newly sequenced genomes33.

For microbial applications, all of the above programs run quickly (in minutes or hours) and are not particularly processor

intensive. Some workflows combine a series of programs and provide an accessible interface for microbiologists who are not bioinformatics specialists. For example, xBASE-NG provides a ‘one-stop shop’ for assembly, annotation and comparison of bacterial genome sequences34. Sophisticated phylogenetic analyses are more demand-ing and may be beyond the capability of the average research group. One particular issue when constructing bacterial whole-genome phylogenies is the clouding of phylogenetic signal by recombination events and homoplasy35. Algorithms such as ClonalFrame36 and ClonalOrigin37 take mul-tiple whole-genome alignments as input and attempt to identify blocks of recombination. These approaches are computationally very expensive, and there is no ‘off the shelf ’ solu-tion to comparing hundreds or thousands of bacterial genomes. There is a growing

Table 2 | The applicability of the major high-throughput sequencing platforms

Example application in bacteriology

Desirable characteristics Machine*

454 GS Junior‡

454 GS FLX+‡

Ion Personal Genome Machine (318 chip)§

MiSeq|| HiSeq 2000||

5500xl SOLiD§

PacBio RS¶

De novo sequencing of novel strains to generate a single-scaffold reference genome

• Long reads• Paired-end protocol and/or

long mate-pair protocol• Even coverage of genome

! !! ! ! ! X !!

Rapid characterization of a novel pathogen (draft de novo assembly of a genome for a single strain)

• Total run time (library preparation plus sequencing) of under hours

• Sufficient coverage of a bacterial genome in a single run

! !! !! !! X X !!

Rough-draft de novo sequencing of small numbers of strains (<20) for comparative analysis of gene content

• Long or paired-end reads• High throughput• Ease of library and sequencing

workflow• Cost-effective

X ! ! !! !! ! !

Re-sequencing of many similar strains (>50) for the discovery of single nucleotide polymorphisms and for phylogenetics

• Very high throughput• Low-cost, high-throughput

sequence library construction• High accuracy

X X ! ! !! ! !

Small-scale transcriptomics-by-sequencing experiments (for example, two strains under four growth conditions with two biological replicates, so 16 strains)

• High per-isolate coverage X ! ! ! !! !! !!

Phylogenetic profiling to genus-level using partial 16S rRNA gene amplicon sequencing

• High coverage• Long amplicon input (≥500 bp)• Long reads• High single-read accuracy

(error rate <1%)

! !! ! !! ! ! X

Whole-genome metagenomics for the reconstruction of multiple genomes in a single sample

• Long reads or paired-end reads

• Very high throughput• Low error rate

X ! ! ! !! ! !

*!!, particularly well suited; !, suitable; X, not suitable. ‡From Roche. §From Life Technologies. ||From Illumina. ¶From Pacific Biosciences.

interest in alignment-free approaches for constructing bacterial phylogenies, as it is thought that these approaches may help address the computational challenges of these analyses38.

A recurring problem with data from high-throughput sequencing is meeting the requirement, as stipulated by journals and funders, that data be lodged in the public domain. Unannotated assembled sequences can be uploaded to conventional sequence databases, such as GenBank, fairly easily. However, submission of anno-tated sequences can be onerous, slowing down the process of publication even further. Submission of sequence reads to short-read archives may be hampered by slow data transfer rates, and it remains uncertain how sustainable such archives will prove to be in the future. There may come a time when the easiest way to

PROGRESS

604 | SEPTEMBER 2012 | VOLUME 10 www.nature.com/reviews/micro

PROGRESS


Step 1: Get DNA

Step 2: Shotgun Sequence

DNA target sample

Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt

http://genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt

Shotgun DNA Sequencing (1995-2005)

DNA target sample



SHEAR


DNA target sample



SIZE SELECT

e.g., 10Kbp ± 8% std.dev.

SHEAR


DNA target sample



SIZE SELECT


SHEAR


DNA target sample

Vector

LIGATE & CLONE



SIZE SELECT


SHEAR


DNA target sample

Vector

LIGATE & CLONE

Primer

End Reads (Mates)

SEQUENCE

550bp



Short read genome sequencing (2005-current)




Genomic DNA

270 bp fragments

Random fragmentation




Genomic DNA

270 bp fragments


Paired-end short insert reads

(10’s millions)

molecular biology

Sequencing (Illumina)




Genomic DNA

270 bp fragments


4-8 kb fragments

Paired-end long insert reads

(10’s millions)


(10’s millions)

molecular biology





How do we assemble this data back into a genome?

Genomic DNA

270 bp fragments


4-8 kb fragments

Paired-end long insert reads

(10’s millions)


(10’s millions)

molecular biology






Step 3: Assemble



Assembly outline

Contigs

Scaffolds

Reads



Assembly outline

Assembly algorithms

e.g. Allpaths, Velvet,

Meraculous

Contigs

Scaffolds

Reads



De Bruijn Graph Assembly



De Bruijn example

“It was the best of times, it was the worst of

times, it was the age of wisdom, it was the

age of foolishness, it was the epoch of belief,

it was the epoch of incredulity,.... “

Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall

Example courtesy of J. Leipzig 2010Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt


De Bruijn example

itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…



De Bruijn example


Generate random ‘reads’

fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho

hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw

fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe

…etc. to 10’s of millions of reads



De Bruijn example


How do we assemble?







De Bruijn example


How do we assemble?





Traditional all-vs-all assemblers fail due to immense computational resources (scales with number of reads2) A million (106 ) reads requires a trillion (1012) pairwise alignments



De Bruijn example


How do we assemble?





De Bruijn solution: Represent the data as a graph (scales with genome size)

Traditional all-vs-all assemblers fail due to immense computational resources (scales with number of reads2) A million (106 ) reads requires a trillion (1012) pairwise alignments



De Bruijn example

Step 1: Convert reads into “Kmers”Kmer: a substring of defined length



De Bruijn example

Step 1: Convert reads into “Kmers”

Reads: theageofwi

Kmers : (k=3)

the

Kmer: a substring of defined length



De Bruijn example


Reads: theageofwi

Kmers : (k=3)

the

hea




De Bruijn example


Reads: theageofwi

Kmers : (k=3)

the

hea

eag




De Bruijn example


Reads: theageofwi

age

geo

eof

ofw

fwi

Kmers : (k=3)

the

hea

eag




De Bruijn example


Reads: theageofwi

age

geo

eof

ofw

fwi

sthebestof

sth

the

heb

ebe

bes

est

sto

tof

astheageof

ast

sth

the

hea

eag

age

geo

eof

worstoftim

wor

ors

rst

sto

tof

oft

fti

tim

imesitwast

ime

mes

esi

sit

itw

twa

was

ast

…..etc for all reads in the dataset

Kmers : (k=3)

the

hea

eag




De Bruijn example

Step 2: Build a De-Bruijn graph from the kmers



De Bruijn example


age geo eof ofw fwihea eagthe



De Bruijn example


age geo eof ofw fwihea eagthe



De Bruijn example


age geo eof ofw fwihea eagtheast sththe hea eag age geo eof



De Bruijn example


age geo eof ofw fwihea eagthesth the

heb ebe bes est sto tof

ast sththe hea eag age geo eof



De Bruijn example





wor ors rststo tof

oft fti tim

ime mes

esisititwtwa

was

ast



De Bruijn example





wor ors rststo tof

oft fti tim

ime mes

esisititwtwa

was

ast

…..etc for all ‘kmers’ in the datasetSlides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt


De Bruijn example

Step 3: Simplify the graph as much as possible:

A De Bruijn Graph



De Bruijn example


A De Bruijn Graph



De Bruijn example


A De Bruijn Graph

“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness,

it was the epoch of belief, it was the epoch of incredulity,.... “

De Bruijn assemblies ‘broken’ by repeats longer than kmer



No single solution!

Drawback of De Bruijn approach

Break graph to produce final assembly

Step 4: Dump graph into consensus (fasta)



Kmer size is an important parameter in De Bruijn assembly

The final assembly (k=3)

wor times itwasthe foolishness

incredulity age epoch be

st wisdom

of belief







st wisdom

of belief

A better assembly (k=20)

itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…

Repeat with a longer “kmer” length







st wisdom

of belief

A better assembly (k=20)

itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…

Repeat with a longer “kmer” length

Why not always use longest ‘k’ possible?

Sequencing errors:

sthebentof

sth theheb

ebeben

entnto

tof

sthebentof

k=3

k=10100% wrong kmer

Mostly unaffected kmers



Scaffolding



Scaffolding

Contigs

Scaffolds

(An assembly)

Reads

‘De Bruijn’ assembly

Join contigs using evidence from paired end data

Align reads to DeBruijn contigs



Scaffolding

Contigs

Scaffolds

(An assembly)

Reads

‘De Bruijn’ assembly

“Captured” gaps caused by repeats. Represented by “NNN” in assembly

Join contigs using evidence from paired end data

Align reads to DeBruijn contigs



Lander-Waterman statistics

L = read length T = minimum detectable overlap G = genome size N = number of reads c = coverage (NL / G) σ = 1 – T/L

E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ) contig = island with 2 or more reads



Mis-assembly of repetitive sequence

Schatz M C et al. Brief Bioinform 2013;14:213-224



Mis-assembled repeats

a b c

a c b

a b c d I II III

I

II

III a

b c

d

b c

a b d c e f

I II III IV

I III II IV

a d b e c f

a

collapsed tandem excision

rearrangement



Real life assembly is messy!

Assembly in theory

Uniform coverage, no errors, no contamination



Biased coverage (->gaps)

Assembly in reality


Assembly in theory





Assembly in reality


Assembly in theory


Sequencing errors (-> fragmented assembly)

*

****

*

*




Assembly in reality


Assembly in theory


Chimeric reads (->mis-joins)


*

****

*

*




Assembly in reality


Assembly in theory


Contaminant reads (-> incorrect + inflated

assembly)



*

****

*

*




Assembly in reality


Assembly in theory


Contaminant reads (-> incorrect + inflated

assembly)



*

****

*

*

*

Worse than predicted assemblies!Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt



Theoretical

GC% of 100 base windowsFr

acti

on o

f nor

mal

ized c

overa

geReference position (bp)

Cov

era

ge (

x)



Genome properties can also make assembly difficult

Biased sequence composition

RESULT: incomplete / fragmented assembly

ACTGTCTAGTCAGCGCGCGCGCGCGCGCCCGCGCGCGCGGGCGGCGGCGCGGGCGGGCGCATGTA

GTGATC

High repeat content

RESULT: misassemblies / collapsed assemblies

r

rr

r

r

Polyploidy

RESULT: fragmented assembly

a a’

Biased sequence abundance

RESULT: Incomplete / fragmented assembly



N50

The N50 size of a set of entities (e.g., contigs or scaffolds) represents the largest entity E such that at least half of the total size of the entities is contained in entities larger than E.

For example, given a collection of contigs with sizes 7, 4, 3, 2, 2, 1, and 1 kb (total size = 20kbp), the N50 length is 4 because we can cover 10 kb with contigs bigger than 4kb. (http://www.cbcb.umd.edu/research/castats.shtml)

N50 length is the length ‘x’ such that 50% of the sequence is contained in contigs of length x or greater. (Waterston http://www.pnas.org/cgi/reprint/100/6/3022.pdf)

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014TIGR


Why Completeness is Important

• Improves characterization of genome features

• Gene order, replication origins

• Better comparative genomics

• Genome duplications, inversions

• Presence and absence of particular genes can be very important

• Missing sequence might be important (e.g., centromere)

• Allows researchers to focus on biology not sequencing

• Facilitates large scale correlation studies

Step 4: Closure

• Physical map information

• PCR and gap spanning

• Other sequencing data


General Steps in Analysis of Complete Genomes

• Identification/prediction of genes

• Characterization of gene features

• Characterization of genome features

• Prediction of gene function

• Prediction of pathways

• Integration with known biological data

• Comparative genomics

Step 5: Annotate

• `


General Steps in Analysis of Complete Genomes

• Structural Annotation • Identification/prediction of genes • Characterization of gene features • Characterization of genome features

• Functional Annotation • Prediction of gene function • Prediction of pathways • Integration with known biological data

• Evolutionary Annotation • Comparative genomics


Structural Annotation I: Genes in Genomes

• Protein coding genes. ! In long open reading frames ! ORFs interrupted by introns in eukaryotes ! Take up most of the genome in prokaryotes, but only a

small portion of the eukaryotic genome

• RNA-only genes ! Transfer RNA ! ribosomal RNA ! snoRNAs (guide ribosomal and transfer RNA

maturation) ! intron splicing ! guiding mRNAs to the membrane for translation ! gene regulation—this is a growing list


Structural Annotation II: Other Features to Find

• Gene control sequences ! Promoters ! Regulatory elements

• Transposable elements, both active and defective ! DNA transposons and retrotransposons ! Many types and sizes

• Other Repeated sequences. ! Centromeres and telomeres ! Many with unknown (or no) function

• Unique sequences that have no obvious function


Bacteria / Archaeal Protein Coding Genes

• Bacteria use ATG as their main start codon, but GTG and TTG are also fairly common, and a few others are occasionally used.

– Remember that start codons are also used internally: the actual start codon may not be the first one in the ORF.

• The stop codons are the same as in eukaryotes: TGA, TAA, TAG – stop codons are (almost) absolute: except for a few cases of programmed frameshifts and the use

of TGA for selenocysteine, the stop codon at the end of an ORF is the end of protein translation. • Genes can overlap by a small amount. Not much, but a few codons of overlap is common

enough so that you can’t just eliminate overlaps as impossible. • Cross-species homology works well for many genes. It is very unlikely that non-coding

sequence will be conserved. – But, a significant minority of genes (say 20%) are unique to a given species.

• Translation start signals (ribosome binding sites; Shine-Dalgarno sequences) are often found just upstream from the start codon

– however, some aren’t recognizable – genes in operons sometimes don’t always have a separate ribosome binding site for each gene


Composition Methods

• The frequency of various codons is different in coding regions as compared to non-coding regions. – This extends to G-C content, dinucleotide frequencies, and other

measures of composition. Dicodons (groups of 6 bases) are often used

– Well documented experimentally. • The composition varies between different proteins of course, and

it is affected within a species by the amounts of the various tRNAs present – horizontally transferred genes can also confuse things: they tend to

have compositions that reflect their original species. – A second group with unusual compositions are highly expressed

genes.


Eukaryotic Genes Harder to Find

• Some fundamental differences between prokaryotes and eukaryotes:

• There is lots of non-coding DNA in eukaryotes. – First step: find repeated sequences and RNA

genes – Note that eukaryotes have 3 main RNA

polymerases. RNA polymerase 2 (pol2) transcribes all protein-coding genes, while pol1 and pol3 transcribe various RNA-only genes.

• most eukaryotic genes are split into exons and introns.

• Only 1 gene per transcript in eukaryotes. • No ribosome binding sites: translation starts at

the first ATG in the mRNA – thus, in eukaryotic genomes, searching for the

transcription start site (TSS) makes sense. • Many fewer eukaryotic genomes have been

sequenced


Exons

• Exon sequences can often be identified by sequence conservation, at least roughly.

• Dicodon statistics, as was used for prokaryotes, also is useful – eukaryotic genomes tend to contain many isochores, regions of

different GC content, and composition statistics can vary between isochores.

• The initial and terminal exons contain untranslated regions, and thus special methods are needed to detect them.

• Predicting splice junctions is a matter of collecting information about the sequences surrounding each possible GT/AC pair, then running this information through some combination of decision tree, Markov models, discriminant analysis, or neural networks, in an attemp to massage the data into giving a reliable score. – In general, sites are more likely to be correct if predicted by multiple

methods – Experimental data from ESTs can be very helpful here.


How to Find ncRNAs

• The most universal genes, such as tRNA and rRNA, are very conserved and thus easy to detect. Finding them first removes some areas of the genome from further consideration.

• One easy approach to finding common RNA genes is just looking for sequence homology with related species: a BLAST search will find most of them quite easily

• Functional RNAs are characterized by secondary structure caused by base pairing within the molecule.

• Determining the folding pattern is a matter of testing many possibilities to find the one with the minimum free energy, which is the most stable structure.

• The free energy calculations are in turn based on experiments where short synthetic RNA molecules are melted

• Related to this is the concept that paired regions (stems) will be conserved across species lines even if the individual bases aren’t conserved. That is, if there is an A-U pairing on one species, the same position might be occupied by a G-C in another species.

• This is an example of concerted evolution: a deleterious mutation at one site is cancelled by a compensating mutation at another site.


RNA Structure

• RNA differs from DNA in having fairly common G-U base pairs. Also, many functional RNAs have unusual modified bases such as pseudouridine and inosine.

• The pseudoknot, pairing between a loop and a sequence outside its stem, is especially difficult to detect: computationally intense and not subject to the normal situation that RNA base pairing follows a nested pattern

– But pseudoknots seem to be fairly rare. • Essentially, RNA folding programs start

with all possible short sequences, then build to larger ones, adding the contribution of each structural element.

– There is an element of dynamic programming here as well.

– And, “stochastic context-free grammars”, something I really don’t want to approach right now!


Finding tRNAs

• tRNAs have a highly conserved structure, with 3 main stem-and-loop structures that form a cloverleaf structure, and several conserved bases. Finding such sequences is a matter of looking in the DNA for the proper features located the proper distance apart.

• Looking for such sequences is well-suited to a decision tree, a series of steps that the sequence must pass.

• In addition, a score is kept, rating how well the sequence passed each step. This allows a more stringent analysis later on, to eliminate false positives.

Step 6: Analyze

insight progress

800 NATURE | VOL 406 | 17 AUGUST 2000 | www.nature.com

means that database searches must be repeated regularly to keepannotation accurate and up to date.

One possible solution to the annotation problem is to bring moreof the resources of the scientific community to bear on each genome.No single centre can annotate all the functions of a living organism;experts from many different areas of biology should be encouraged tocontribute to the annotation process. One possible model would befor geographically separated experts to deposit annotation to a central repository, which might also take on a curatorial or editorialrole. An alternative model is one in which annotation resides in manydifferent locations (as it does today), but in which new electroniclinks are created that allow scientists to locate rapidly all the informa-tion about a gene, genome or function. This latter model scales moreeasily and avoids the problem of overdependence on a single source.

What have we learned from genome analysis?Comparison of the results from 24 completed prokaryotic genomesequences, containing more than 50 Mbp of DNA sequence and54,000 predicted open reading frames (ORFs), has revealed that genedensity in the microbes is consistent across many species, with aboutone gene per kilobase (Table 2). Almost half of the ORFs in eachspecies are of unknown biological function. When the function ofthis large subset of genes begins to be explained, it is likely that entire-ly novel biochemical pathways will be identified that might be rele-vant to medicine and biotechnology. Perhaps even more unexpectedis the observation that about a quarter of the ORFs in each speciesstudied so far are unique, with no significant sequence similarity toany other available protein sequence. Although this might at presentbe an artefact of the small number of microbial species studied bywhole-genome analysis, it nevertheless supports the idea that there istremendous biological diversity between microorganisms. Takentogether, these data indicate that much of microbial biology has yet tobe understood and suggest that the idea of a ‘model’ organism in themicrobial world might not be appropriate, given the vast differencesbetween even related species.

Our molecular picture of evolution for the past 20 years has beendominated by the small-subunit ribosomal RNA phylogentic tree

that proposes three non-overlapping groups of living organisms: thebacteria, the archaea and the eukaryotes8. Although the archaea possess bacterial cell structures, it has been suggested that they sharea common ancestor exclusive of bacteria.

Analysis of complete genome sequences is beginning to providegreat insight into many questions about the evolution of microbes.One such area has encompassed the occurrence of genetic exchangesbetween different evolutionary lineages, a phenomenon known ashorizontal, or lateral, gene transfer. The occurrence of horizontalgene transfer, such as that involving genes from organellar genomesto the nucleus, or of antibiotic resistance genes between bacterialspecies, has been well established for many years (see, for example,ref. 9). This phenomenon causes problems for studying the evolutionof species because it means that some species are chimaeric, with different histories for different genes. Before the availability of complete genome sequences, studies of horizontal gene transfer hadbeen limited because of the incompleteness of the data sets beinganalysed. Analyses of complete genome sequences have led to manyrecent suggestions that the extent of horizontal gene exchange ismuch greater than was previously realized10–12. For example, an

Table 1 Results of a BLAST search of a newly sequenced M. tuberculosisgene against a comprehensive protein database

Gene ID Similarity (%) Length (bp) Gene name E-value*

GP:2905647 44.8 1,191 D-Arabinitol kinase 6.2e!15(Klebsiella pneumoniae)

EGAD:22614 46.2 1,191 Gluconokinase 1.4e!13(Bacillus subtilis)

EGAD:20418 43.0 1,302 Xylulose kinase 4.8e!13(Lactobacillus pentosus)

EGAD:105114 43.4 1,320 Carbohydrate kinase, 4.7e!12FGGY family (Archaeoglobus fulgidus)

GP:2895855 42.7 1,263 Xylulokinase 1.0e!07(Lactobacillus brevis)

EGAD:10899 45.4 1,296 Xylulose kinase 2.1e!06(Escherichia coli)

*E-value is a statistical measure of the significance of a BLAST search result.

Table 2 Genome features from 24 microbial genome sequencing projects

Organism Genome No. of ORFs Unknown Unique size (Mbp) (% coding) function ORFs

Aeropyrum pernix K1 1.67 1,885 (89%)

A. aeolicus VF5 1.50 1,749 (93%) 663 (44%) 407 (27%)

A. fulgidus 2.18 2,437 (92%) 1,315 (54%) 641 (26%)

B. subtilis 4.20 4,779 (87%) 1,722 (42%) 1,053 (26%)

B. burgdorferi 1.44 1,738 (88%) 1,132 (65%) 682 (39%)

Chlamydia pneumoniae AR39 1.23 1,134 (90%) 543 (48%) 262 (23%)

Chlamydia trachomatis MoPn 1.07 936 (91%) 353 (38%) 77 (8%)

C. trachomatis serovar D 1.04 928 (92%) 290 (32%) 255 (29%)

Deinococcus radiodurans 3.28 3,187 (91%) 1,715 (54%) 1,001 (31%)

E. coli K-12-MG1655 4.60 5,295 (88%) 1,632 (38%) 1,114 (26%)

H. influenzae 1.83 1,738 (88%) 592 (35%) 237 (14%)

H. pylori 26695 1.66 1,589 (91%) 744 (45%) 539 (33%)

Methanobacterium thermotautotrophicum 1.75 2,008 (90%) 1,010 (54%) 496 (27%)

Methanococcus jannaschii 1.66 1,783 (87%) 1,076 (62%) 525 (30%)

M. tuberculosis CSU#93 4.41 4,275 (92%) 1,521 (39%) 606 (15%)

M. genitalium 0.58 483 (91%) 173 (37%) 7 (2%)

M. pneumoniae 0.81 680 (89%) 248 (37%) 67 (10%)

N. meningitidis MC58 2.24 2,155 (83%) 856 (40%) 517 (24%)

Pyrococcus horikoshii OT3 1.74 1,994 (91%) 859 (42%) 453 (22%)

Rickettsia prowazekii Madrid E 1.11 878 (75%) 311 (37%) 209 (25%)

Synechocystis sp. 3.57 4,003 (87%) 2,384 (75%) 1,426 (45%)

T. maritima MSB8 1.86 1,879 (95%) 863 (46%) 373 (26%)

T. pallidum 1.14 1,039 (93%) 461 (44%) 280 (27%)

Vibrio cholerae El Tor N1696 4.03 3,890 (88%) 1,806 (46%) 934 (24%)

50.60 52,462 (89%) 22,358 (43%) 12,161 (23%)


insight progress


Experimental evidence from studies of clinical isolates of somespecies has demonstrated phenotypic variation in the relevant cell-surface proteins33, suggesting that, at least for human pathogens,the evolution of antigenic proteins probably occurs in real time, ascell populations divide. The ability of human pathogens to alter theirantigenic potential and thereby evade the immune system has thepotential to hinder vaccine development by conventional methods.

Progress during the past year has supported the idea that complete genome sequence information can be exploited in thedesign of new vaccines and antimicrobial compounds. As an exam-ple, the identification of new vaccine candidates against serogroup BNeisseria meningitidis (MenB) was reported by Pizza et al. using agenomics-based approach34 (Fig. 2). With the use of the entiregenome sequence of a virulent serogroup B strain35 , 570 putative cell-surface-expressed or secreted proteins were identified; the corresponding DNA sequences were cloned and expressed in E. coli.Of the putative targets, 61% were expressed successfully and used toimmunize mice. Immune sera were screened for bactericidal activityand for the ability to bind to the surface of MenB cells. Seven repre-sentative proteins were selected for further study and were evaluatedfor their degree of sequence variability among multiple isolates andserogroups of N. meningitidis. Two highly conserved vaccine candidates emerged from this large-scale screening effort, whichoccurred in parallel with the completion of the genome sequence ofN. meningitidis. These results provide the first definitive demonstra-tion of the potential of genomic information to expand and acceler-ate the development of vaccines against pathogenic organisms.

Another example illustrates the potential of genomics to acceler-ate the development of novel antimicrobial agents. Jomaa et al.36

identified two genes in P. falciparum from sequence data from themalaria genome consortium that encode key enzymes in the 1-deoxy-D-xylulose-5-phosphate (DOXP) pathway that are requiredfor the synthesis of isoprenoids such as cholesterol37. The DOXPpathway functions in some bacteria, algae and higher plants to

produce isopentenyl diphosphate, a precursor of isoprenoids. In P.falciparum, the enzymes of the DOXP pathway are probably associat-ed with a specialized organelle derived from algae called the apicoplast; they are expressed when the parasite is growing within redblood cells. Inhibitors of one of the key enzymes in the DOXP pathway, DOXP reductoisomerase, had previously been identifiedand had been shown to inhibit the bacterial enzyme and the growthof some bacterial species. Jomaa et al.36 demonstrated that twoinhibitors of DOXP reductoisomerase, fosidomycin and FR900098,were able to inhibit the growth of P. falciparum in vitro and cure miceinfected with a related species of Plasmodium. Both of these compounds exhibit low toxicity and high stability and are relativelyinexpensive to produce, suggesting that they might be the basis of apotentially important new class of anti-malarial drugs.

ConclusionsSo far, studies in genomics have only scratched the surface of micro-bial diversity and have revealed how little is known about microbialspecies. In the next few years, more than 100 projects for sequencingmicrobial genomes should be completed, providing the scientificcommunity with information on more than 300,000 predicted genes.A significant number of these genes will be novel and of unknownfunction. These novel genes represent exciting new opportunities forfuture research and potential sources of biological resources to beexplored and exploited. The benefits of comparative genomics inunderstanding biochemical diversity, virulence and pathogenesis,and the evolution of species has been unequivocally demonstratedand the usefulness of comparative techniques will improve as moregenomes become available. One of the major challenges is to developtechniques for assessing the function of novel genes on a large scaleand integrating information on how genes and proteins interact atthe cellular level to create and maintain a living organism. It is notunreasonable to expect that, by expanding our understanding ofmicrobial biology and biodiversity, great strides can be made in the

A total of 570 putative secretedproteins or surface proteins

Protein expression

3–12 months

few months

N. meningitidis

hours

Immune serascreening

• Bactericidal activity• Binding to surface

of MenB cells

Seven proteinsselected for follow-upbased on high titres

Final candidate selectionTwo proteins were found to exhibit

no sequence variability➞clinical trials

Selection of vaccine targets

A total of ~350 recombinant proteinsexpressed in E. coli and used to

immunize mice

1

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

All potential antigens

Figure 2 Diagram depicting how complete microbial genome sequence data can accelerate vaccine development.


LGT


Functional Annotation


Functional Classification I: GO

• The Gene Ontology (GO) consortium (http://www.geneontology.org/) is an attempt describe gene products with a structured controlled vocabulary, a set of invariant terms that have a known relationship to each other.

• Each GO term is given a number of the form GO:nnnnnnn (7 digits), as well as a term name. For example, GO:0005102 is “receptor binding”.

• There are 3 root terms: biological process, cellular component, and molecular function. A gene product will probably be described by GO terms from each of these “ontologies”. (ontology is a branch of philosophy concerned with the nature of being, and the basic categories of being and their relationships.)

– For instance, cytochrome c is described with the molecular function term “oxidoreductase activity”, the biological process terms “oxidative phosphorylation” and “induction of cell death”, and the cellular component terms “mitochondrial matrix” and “mitochondrial inner membrane”

• The terms are arranged in a hierarchy that is a “directed acyclic graph” and not a tree. This means simply that each term can have more than one parent term, but the direction of parent to child (i.e. less specific to more specific) is always maintained.


Functional Classification II: Enzyme Nomenclature

• Enzyme functions: which reactants are converted to which products – Across many species, the enzymes that perform a specific function are usually

evolutionarily related. However, this isn’t necessarily true. There are cases of two entirely different enzymes evolving similar functions.

– Often, two or more gene products in a genome will have the same E.C. number. • Enzyme functions are given unique numbers by the Enzyme Commission.

– E.C. numbers are four integers separated by dots. The left-most number is the least specific

– For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4", whose components indicate the following groups of enzymes:

• EC 3 enzymes are hydrolases (enzymes that use water to break up some other molecule) • EC 3.4 are hydrolases that act on peptide bonds • EC 3.4.11 are those hydrolases that cleave off the amino-terminal amino acid from a

polypeptide • EC 3.4.11.4 are those that cleave off the amino-terminal end from a tripeptide

• Top level E.C. numbers: – E.C. 1: oxidoreductases (often dehydrogenases): electron transfer – E.C. 2: transferases: transfer of functional groups (e.g. phosphate) between

molecules. – E.C. 3: hydrolases: splitting a molecule by adding water to a bond. – E.C. 4: lyases: non-hydrolytic addition or removal of groups from a molecule – E.C. 5: isomerases: rearrangements of atoms within a molecule – E.C. 6: ligases: joining two molecules using energy from ATP


Functional Prediction

• BLAST searches • HMM models of specific genes or gene families (Pfam, TIGRfam,

FIGfam). • Sequence motifs and domains. If the gene is not a good match to

previously known genes, these provide useful clues. • Cellular location predictions, especially for transmembrane proteins. • Genomic neighbors, especially in bacteria, where related functions

are often found together in operons and divergons (genes transcribed in opposite directions that use a common control region).

• Biochemical pathway/subsystem information. If an organism has most of the genes needed to perform a function, any missing functions are probably present too. – Also, experimental data about an organism’s capacities can be used to

decide whether the relevant functions are present in the genome.


Functional Prediction II: Membrane Spanning

• Integral membrane proteins contain amino acid sequences that go through the membrane one or several times. – There are also peripheral membrane proteins that stick

to the hydrophilic head groups by ionic and polar interactions

– There are also some that have covalently bound hydrophobic groups, such as myristoylate, a 14 carbon saturated fatty acid that is attached to the N-terminal amino group.

• There are 2 main protein structures that cross membranes. – Most are alpha helices, and in proteins that span

multiple times, these alpha helices are packed together in a coiled-coil. Length = 15-30 amino acids.

– Less commonly, there are proteins with membrane spanning “beta barrels”, composed of beta sheets wrapped into a cylinder. An example: porins, which transport water across the membrane.


Functional Prediction by Phylogeny

• Key step in genome projects

• More accurate predictions help guide experimental and computational analyses

• Many diverse approaches

• All improved both by “phylogenomic” type analyses that integrate evolutionary reconstructions and understanding of how new functions evolve


Functional Prediction

• Identification of motifs ! Short regions of sequence similarity that are indicative

of general activity ! e.g., ATP binding

• Homology/similarity based methods ! Gene sequence is searched against a databases of

other sequences ! If significant similar genes are found, their functional

information is used

• Problem ! Genes frequently have similarity to hundreds of motifs

and multiple genes, not all with the same function


Helicobacter pylori


H. pylori genome - 1997

“The ability of H. pylori to perform mismatch repair is suggested by the presence of methyl transferases, mutS and uvrD. However, orthologues of MutH and MutL were not identified.”


MutL ??

From http://asajj.roswellpark.org/huberman/dna_repair/mmr.html

http://asajj.roswellpark.org/huberman/dna_repair/mmr.html


Phylogenetic Tree of MutS Family

Aquae Trepa

FlyXenlaRatMouseHumanYeastNeucr

Arath

BorbuStrpyBacsu

SynspEcoliNeigo

ThemaTheaqDeiraChltr

SpombeYeast

YeastSpombeMouseHumanArath

YeastHumanMouseArath

StrpyBacsu

CelegHumanYeast MetthBorbu

AquaeSynspDeira Helpy

mSaco

YeastCelegHuman

Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300. 65

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=9722651&query_hl=2


MutS Subfamilies

Aquae Trepa

FlyXenlaRatMouse

HumanYeastNeucr

Arath

BorbuStrpy

BacsuSynspEcoli

Neigo

ThemaTheaqDeira

Chltr

SpombeYeast

YeastSpombe

MouseHumanArath


StrpyBacsu

CelegHumanYeastMetthBorbu

AquaeSynsp

Deira Helpy

mSaco

YeastCeleg

Human

MSH4

MSH5 MutS2

MutS1

MSH1

MSH3

MSH6

MSH2




Overlaying Functions onto Tree

Aquae Trepa

Rat

FlyXenla

MouseHumanYeastNeucr

Arath

BorbuSynsp

Neigo

ThemaStrpy

Bacsu

Ecoli

TheaqDeiraChltr

SpombeYeast

YeastSpombe

MouseHuman

Arath


StrpyBacsu

HumanCelegYeast

MetthBorbu

AquaeSynsp

Deira Helpy

mSaco

YeastCeleg

Human

MSH4

MSH5MutS2

MutS1

MSH1

MSH3

MSH6

MSH2




MutS Subfamilies

• MutS1 Bacterial MMR• MSH1 Euk - mitochondrial MMR• MSH2 Euk - all MMR in nucleus• MSH3 Euk - loop MMR in nucleus• MSH6 Euk - base:base MMR in nucleus

• MutS2 Bacterial - function unknown• MSH4 Euk - meiotic crossing-over• MSH5 Euk - meiotic crossing-over


Functional Prediction Using Tree

Aquae Trepa

FlyXenlaRatMouse

HumanYeastNeucr

Arath

BorbuStrpy

BacsuSynspEcoli Neigo

ThemaTheaqDeira

Chltr

SpombeYeast

YeastSpombe

MouseHumanArath


MSH1MitochondrialRepair

MSH3 - Nuclear RepairOf Loops

MSH6 - Nuclear RepairOf Mismatches

MutS1 - Bacterial Mismatch and Loop Repair

StrpyBacsu

CelegHumanYeastMetthBorbu

AquaeSynsp

Deira Helpy

mSaco

YeastCeleg

Human

MSH4 - Meiotic CrossingOver

MSH5 - Meiotic Crossing Over MutS2 - Unknown Functions

MSH2 - Eukaryotic NuclearMismatch and Loop Repair




Table 3. Presence of MutS Homologs in Complete Genomes Sequences

Species # of MutSHomologs

WhichSubfamilies?

MutLHomologs

BacteriaEscherichia coli K12 1 MutS1 1Haemophilus influenzae Rd KW20 1 MutS1 1Neisseria gonorrhoeae 1 MutS1 1Helicobacter pylori 26695 1 MutS2 -Mycoplasma genitalium G-37 - - -Mycoplasma pneumoniae M129 - - -Bacillus subtilis 169 2 MutS1,MutS2 1Streptococcus pyogenes 2 MutS1,MutS2 1Mycobacterium tuberculosis - - -Synechocystis sp. PCC6803 2 MutS1,MutS2 1Treponema pallidum Nichols 1 MutS1 1Borrelia burgdorferi B31 2 MutS1,MutS2 1Aquifex aeolicus 2 MutS1,MutS2 1Deinococcus radiodurans R1 2 MutS1,MutS2 1

ArchaeaArchaeoglobus fulgidus VC-16, DSM4304 - - -Methanococcus janasscii DSM 2661 - - -Methanobacterium thermoautotrophicum ΔH 1 MutS2 -

EukaryotesSaccharomyces cerevisiae 6 MSH1-6 3+Homo sapiens 5 MSH2-6 3+


Blast Search of H. pylori “MutS”

Score E Sequences producing significant alignments: (bits) Value sp|P73625|MUTS_SYNY3 DNA MISMATCH REPAIR PROTEIN 117 3e-25 sp|P74926|MUTS_THEMA DNA MISMATCH REPAIR PROTEIN 69 1e-10 sp|P44834|MUTS_HAEIN DNA MISMATCH REPAIR PROTEIN 64 3e-09 sp|P10339|MUTS_SALTY DNA MISMATCH REPAIR PROTEIN 62 2e-08 sp|O66652|MUTS_AQUAE DNA MISMATCH REPAIR PROTEIN 57 4e-07 sp|P23909|MUTS_ECOLI DNA MISMATCH REPAIR PROTEIN 57 4e-07

• Blast search pulls up Syn. sp MutS#2 with much higher p value than other MutS homologs

• Based on this TIGR predicted this species had mismatch repair

Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.



High Mutation Rate in H. pylori

Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.



PHYLOGENENETIC PREDICTION OF GENE FUNCTION

IDENTIFY HOMOLOGS

OVERLAY KNOWNFUNCTIONS ONTO TREE

INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST

1 2 3 4 5 6

3 5

3

1A 2A 3A 1B 2B 3B

2A 1B

1A

3A

1B2B

3B

ALIGN SEQUENCES

CALCULATE GENE TREE

12

4

6

CHOOSE GENE(S) OF INTEREST

2A

2A

5

3

Species 3Species 1 Species 2

1

1 2

2

2 31

1A 3A

1A 2A 3A

1A 2A 3A

4 6

4 5 6

4 5 6

2B 3B

1B 2B 3B

1B 2B 3B

ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)

Duplication?

EXAMPLE A EXAMPLE B

Duplication?

Duplication?

Duplication

5

METHOD

Ambiguous

Based on Eisen, 1998 Genome Res 8: 163-167.

Phylogenomics



2

3

14

5

6


Chemosynthetic Symbionts

Eisen et al. 1992

Eisen et al. 1992. J. Bact.174: 3416

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC206016/


Carboxydothermus hydrogenoformans

• Isolated from a Russian hotspring • Thermophile (grows at 80°C) • Anaerobic • Grows very efficiently on CO (Carbon

Monoxide) • Produces hydrogen gas • Low GC Gram positive (Firmicute) • Genome Determined (Wu et al. 2005 PLoS

Genetics 1: e65. )


Homologs of Sporulation Genes

Wu et al. 2005 PLoS Genetics 1: e65.

http://www.ncbi.nlm.nih.gov/entrez/utils/lofref.fcgi?PrId=4656&uid=16311624&db=pubmed&url=http://genetics.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pgen.0010065


Carboxydothermus sporulates




Non-Homology Predictions: Phylogenetic Profiling

• Step 1: Search all genes in organisms of interest against all other genomes

• Ask: Yes or No, is each gene found in each other species

• Cluster genes by distribution patterns (profiles)


Sporulation Gene Profile




B. subtilis new sporulation genes


Functional Prediction III: Colocalization

• Operon structure is often maintained over fairly large taxonomic regions.

– Sometimes gene order is altered, and sometimes one or more enzymes are missing.

– But in general, this phenomenon allows recognition or verification that widely diverged enzymes do in fact have the same function.

• This is an operon that contains part of the glycolytic pathway.

– 1: phosphoclycerate mutase – 2: triosephosphate isomerase – 3: enolase – 4: phosphoglycerate kinase – 5: glyceraldehyde 3-phosphate

dehydrogenase – 6: central glycolytic gene regulator


Metabolic Predictions


Comparative Genomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !85


Using the Core

!86


insight progress


means that database searches must be repeated regularly to keepannotation accurate and up to date.

One possible solution to the annotation problem is to bring moreof the resources of the scientific community to bear on each genome.No single centre can annotate all the functions of a living organism;experts from many different areas of biology should be encouraged tocontribute to the annotation process. One possible model would befor geographically separated experts to deposit annotation to a central repository, which might also take on a curatorial or editorialrole. An alternative model is one in which annotation resides in manydifferent locations (as it does today), but in which new electroniclinks are created that allow scientists to locate rapidly all the informa-tion about a gene, genome or function. This latter model scales moreeasily and avoids the problem of overdependence on a single source.

What have we learned from genome analysis?Comparison of the results from 24 completed prokaryotic genomesequences, containing more than 50 Mbp of DNA sequence and54,000 predicted open reading frames (ORFs), has revealed that genedensity in the microbes is consistent across many species, with aboutone gene per kilobase (Table 2). Almost half of the ORFs in eachspecies are of unknown biological function. When the function ofthis large subset of genes begins to be explained, it is likely that entire-ly novel biochemical pathways will be identified that might be rele-vant to medicine and biotechnology. Perhaps even more unexpectedis the observation that about a quarter of the ORFs in each speciesstudied so far are unique, with no significant sequence similarity toany other available protein sequence. Although this might at presentbe an artefact of the small number of microbial species studied bywhole-genome analysis, it nevertheless supports the idea that there istremendous biological diversity between microorganisms. Takentogether, these data indicate that much of microbial biology has yet tobe understood and suggest that the idea of a ‘model’ organism in themicrobial world might not be appropriate, given the vast differencesbetween even related species.

Our molecular picture of evolution for the past 20 years has beendominated by the small-subunit ribosomal RNA phylogentic tree

that proposes three non-overlapping groups of living organisms: thebacteria, the archaea and the eukaryotes8. Although the archaea possess bacterial cell structures, it has been suggested that they sharea common ancestor exclusive of bacteria.

Analysis of complete genome sequences is beginning to providegreat insight into many questions about the evolution of microbes.One such area has encompassed the occurrence of genetic exchangesbetween different evolutionary lineages, a phenomenon known ashorizontal, or lateral, gene transfer. The occurrence of horizontalgene transfer, such as that involving genes from organellar genomesto the nucleus, or of antibiotic resistance genes between bacterialspecies, has been well established for many years (see, for example,ref. 9). This phenomenon causes problems for studying the evolutionof species because it means that some species are chimaeric, with different histories for different genes. Before the availability of complete genome sequences, studies of horizontal gene transfer hadbeen limited because of the incompleteness of the data sets beinganalysed. Analyses of complete genome sequences have led to manyrecent suggestions that the extent of horizontal gene exchange ismuch greater than was previously realized10–12. For example, an

Table 1 Results of a BLAST search of a newly sequenced M. tuberculosisgene against a comprehensive protein database

Gene ID Similarity (%) Length (bp) Gene name E-value*

GP:2905647 44.8 1,191 D-Arabinitol kinase 6.2e!15(Klebsiella pneumoniae)

EGAD:22614 46.2 1,191 Gluconokinase 1.4e!13(Bacillus subtilis)

EGAD:20418 43.0 1,302 Xylulose kinase 4.8e!13(Lactobacillus pentosus)

EGAD:105114 43.4 1,320 Carbohydrate kinase, 4.7e!12FGGY family (Archaeoglobus fulgidus)

GP:2895855 42.7 1,263 Xylulokinase 1.0e!07(Lactobacillus brevis)

EGAD:10899 45.4 1,296 Xylulose kinase 2.1e!06(Escherichia coli)

*E-value is a statistical measure of the significance of a BLAST search result.

Table 2 Genome features from 24 microbial genome sequencing projects

Organism Genome No. of ORFs Unknown Unique size (Mbp) (% coding) function ORFs

Aeropyrum pernix K1 1.67 1,885 (89%)

A. aeolicus VF5 1.50 1,749 (93%) 663 (44%) 407 (27%)

A. fulgidus 2.18 2,437 (92%) 1,315 (54%) 641 (26%)

B. subtilis 4.20 4,779 (87%) 1,722 (42%) 1,053 (26%)

B. burgdorferi 1.44 1,738 (88%) 1,132 (65%) 682 (39%)

Chlamydia pneumoniae AR39 1.23 1,134 (90%) 543 (48%) 262 (23%)

Chlamydia trachomatis MoPn 1.07 936 (91%) 353 (38%) 77 (8%)

C. trachomatis serovar D 1.04 928 (92%) 290 (32%) 255 (29%)

Deinococcus radiodurans 3.28 3,187 (91%) 1,715 (54%) 1,001 (31%)

E. coli K-12-MG1655 4.60 5,295 (88%) 1,632 (38%) 1,114 (26%)

H. influenzae 1.83 1,738 (88%) 592 (35%) 237 (14%)

H. pylori 26695 1.66 1,589 (91%) 744 (45%) 539 (33%)

Methanobacterium thermotautotrophicum 1.75 2,008 (90%) 1,010 (54%) 496 (27%)

Methanococcus jannaschii 1.66 1,783 (87%) 1,076 (62%) 525 (30%)

M. tuberculosis CSU#93 4.41 4,275 (92%) 1,521 (39%) 606 (15%)

M. genitalium 0.58 483 (91%) 173 (37%) 7 (2%)

M. pneumoniae 0.81 680 (89%) 248 (37%) 67 (10%)

N. meningitidis MC58 2.24 2,155 (83%) 856 (40%) 517 (24%)

Pyrococcus horikoshii OT3 1.74 1,994 (91%) 859 (42%) 453 (22%)

Rickettsia prowazekii Madrid E 1.11 878 (75%) 311 (37%) 209 (25%)

Synechocystis sp. 3.57 4,003 (87%) 2,384 (75%) 1,426 (45%)

T. maritima MSB8 1.86 1,879 (95%) 863 (46%) 373 (26%)

T. pallidum 1.14 1,039 (93%) 461 (44%) 280 (27%)

Vibrio cholerae El Tor N1696 4.03 3,890 (88%) 1,806 (46%) 934 (24%)

50.60 52,462 (89%) 22,358 (43%) 12,161 (23%)



After the Genomes

• Better analysis and annotation

• Comparative genomics

• Functional genomics (Experimental analysis of gene function on a genome scale)

• Genome-wide gene expression studies

• Proteomics

• Genome wide genetic experiments