Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Multi-Centre Genome Sequencing and Analysis of the Dutch Elm Disease Fungus
using the Roche/454 Titanium System
Ken Dewar and Vince ForgettaDepartment of Human Geneticsp
McGill University
McGill University and Genome Quebec Innovation Centrey
- 0 -
Acknowledgements
Study Design and LeadershipABRF DSRG: Chair (outgoing): Jan Kieleczawa, Pfizer ResearchChair (incoming): Michael Zianni Ohio State UniversityChair (incoming): Michael Zianni, Ohio State University
Robert Steen, Harvard Medical SchoolDeborah Grove, Penn State University, Anoja Perera, Stowers Institute for Medical ResearchAnoja Perera, Stowers Institute for Medical ResearchRobert Lyons Jr., University of MichiganSushmita Singh, University of MinnesotaDoug Bintzler, University of CincinnatiScottie Adams, Trudeau InstituteKen Dewar, McGill University
S i L bSequencing LabsDeborah Grove, Gregory Grove, Penn State UniversityRobert Lyons Jr., Suzanne Genik, University of MichiganChris Wright Alvaro Hernandez Sharon Bachman Lorie Hetrick UChris Wright, Alvaro Hernandez, Sharon Bachman, Lorie Hetrick, U. Illinois at Urbana- ChampaignSushmita Singh, Nichole Peterson, University of MinnesotaVince Forgetta, Gary Leveque, Joana Dias, McGill University
- 1 -
g , y q , , y
Roche 454Clotilde Teiling, Tim Harkins
Rationale of the Study
To evaluate the reproducibility of Roche/454 instrumentation, protocols and reagentsprotocols, and reagents.
Experiment1) Single DNA sample → ssDNA (MID) fragment library1) Single DNA sample → ssDNA (MID) fragment library2) Aliquots distributed to each of 5 centers3) Common protocol followed for sequencing4) Data returned for centralized analysis4) Data returned for centralized analysis
Evaluation1) Sequence yield (Mb reads read lengths)1) Sequence yield (Mb, reads, read lengths)2) Sequence accuracy3) Effects on assembly
- 2 -
Choice of sample
Experimental considerations5 sites performing minimum ½ run
Target output = 5 * 250 Mb = 1.25 Gb>300X coverage of a typical 4 Mb bacteria<0.5X coverage of a typical 3 Gb mammal<0.5X coverage of a typical 3 Gb mammal
Sequence a known genome (easier analysis) versus sequence a new genome (contribution to new knowledge)sequence a new genome (contribution to new knowledge)
Fungal genome, the fungus responsible for Dutch elm disease
Pro’s: 30-50 Mb genome (25-40X coverage)
Con’s:AT or GC skew?repetitive elements?
- 3 -
Bad genome or bad assembly?
Location of Participating Sequencing Cores
Site Size of Team Team Experience Instrument Runs
A 2 22 22B 5 172 163C 3 128 128D 2 5 5E 7 345 197
- 4 -
The participating sites encompassed a variety of levels of experience.
The Ideal Titanium RunA run is advertised as ~1.2 million reads @ ~420 nt = >500 Mb
(½ run = 600K reads and 250 Mb)
Random sequence200 cycles600K reads (1/2 run)
310 Mb516 nt read lengthg99% reads between 472-560 nt
Th k f d (i hi h /l GC) th
- 5 -
The more skew from random (i.e. higher/lower GC), the more the length should increase (more homopolymers).
The Actual Titanium RunsAll sites used an aliquot from a single ssDNA library.
Not all sites used the same lots of reagents.
Site Reads MbA 602 239B 626 243B 626 243C 714 228D 743 308E 547 215long shoulder of short reads
C
N it t th “ t d d” 600K d d 250 Mb
- 6 -
No site met the “standard” 600K reads and 250 Mb (Although several were very close considering we trimmed MIDs.)
Fluidics problems cause short reads
This is hat the This is what the
- 7 -
This is what the results looked like.
This is what the lab looked like.
The Actual Titanium Runs
- 8 -
The Actual Titanium Runs
Site Reads MbA 602 239B 626 243
Site Reads MbA 602 239B 626 243B 626 243
C 714 228D 743 308E 547 215
B 626 243C 791 297D 743 308E 547 215
- 9 -
Ready for Genome AssemblyAll sites attempted ½ run under control conditionsAll sites attempted ½ run under control conditions,
and a second ½ run under conditions of their choice.
In total >6 4 million reads >2 53 Gb generatedIn total, >6.4 million reads, >2.53 Gb generated.
78.2% of the datain reads >400 nt
- 10 -
Dutch elm disease is a vascular wilt.
- 11 -
DED has killed >1 billion elms in <100 years.
Dutch elm disease
Host: elm trees (varying susceptibility)Pathogen: fungus (varying pathogenicity)Vector: beetleVector: beetle
1920’s
1960’s1990
1960’s
Multiple invasions3 sub-speciesO hi t l i l i d hi l l i
- 12 -
Ophiostoma ulmi, novo-ulmi, and himal-ulmi
Ophiostoma novo-ulmi genetics and genomics
t f (N S h )ascomycete fungus (Neurospora, Saccharomyces)mycelial or budding cell growthvegetatively haploidcan be cultured and crossedpathogenicity under polygenic control
genetic linkage maps chromosomes separated by PFGEg g p y
Estimated genome25-35 Mb
- 13 -
>6 linear chromosomes
Genome Sequencing Pilot
Single run performed for QC/QA before samples sent to other sites
7X d bl~7X coverage and assembly
31 Mb assemblyti 50 Kb
mtDNA18S rDNA
- 14 -
contigs >50 Kb nuclear genomelow repeat content
Genome Sequencing Pilot
In addition to the ssDNA library sent to all sites
Generated an 8 Kb paired-end library½ run sequenced at a single site
Total reads 181,162Reads >50nt-adaptor->50nt 78,026Unique >50nt-adaptor->50nt 61,209Unique >50nt adaptor >50nt 61,209
Est. paired-end distance 7.5 KbEst. genome template coverage 15XEst. genome template coverage 15X
- 15 -
Genome Assembly Statistics
l PE t l ti i ff ldre-analyse PE to place contigs in scaffoldsre-assemble high coverage contigs
WGS WGS/PE WGS/PE/analysisRuns
Reads 6.4 million 6.6 6.6Bases 2.53 Gb 2.6 2.6
A blAssemblySize 31.7 Mb 31.8 31.8
Coverage 80X 81X 81X
Contigs 252 160 160
In this assembly we estimate 151 gaps
Contigs 252 160 160N50 contig 262 Kb 368 368Scaffolds 0 (+252) 8 (+9) 8 (+1)
In this assembly we estimate 151 gaps22 > 1kb85 between 400-1 kb44 <400 bp
DiGuistini et al., 2009, Genome BiologySequencing the genome of Grosmannia clavigeraCombination of
GS20/FLX ~8.3XIllumina 42 bp PE >50XFosmid ends >18,000 reads
Assembly
- 16 -
Assembly3295 contigs N50=32 Kb163 scaffolds N50=558 Kb
Genome Assembly Quality AssessmentWhat proportion of reads accounted for in the assembly?What proportion of reads accounted for in the assembly?
Total Reads 6 425 438Total Reads 6,425,438Genome (incl. rDNA) 5,670,075 88.20%
mtDNA 633,519 9.90%unassembled 121,844 1.90%
- 17 -
How do scaffolds compare to chromosomes?
Genome Assembly Quality AssessmentHow do scaffolds compare to chromosomes?
- 18 -
Does the assembly contain known genes?
Genome Assembly Quality AssessmentDoes the assembly contain known genes?
3309 EST sequences available3309 EST sequences available3277 (99%) align to the assembly
- 19 -
Does the mtDNA assembly make sense?
Genome Assembly Quality AssessmentDoes the mtDNA assembly make sense?
- 20 -
mtDNA: 66kb, circular (confirmed by reads and PE)
What’s up with the single contig not in a scaffold?
Genome Assembly Quality AssessmentWhat s up with the single contig not in a scaffold?
Smallest scaffold is 2.53 MbOne non-scaffolded contig of 4.1 kb.
dsDNA plasmid
- 21 -
p
Site Performance versus AssemblyDeviations in genome coverage?Deviations in genome coverage?
99 9% of all assembled bases are99.9% of all assembled bases are present in the read set of each site.
- 22 -
Site Performance versus AssemblyDeviations in homopolymer counting?
Length Number Bases4 294,261 1,177,044 5 79 130 395 650
Deviations in homopolymer counting?
5 79,130 395,650 6 20,974 125,844 7 8,090 56,630 8 3,381 27,048 9 1 654 14 8869 1,654 14,886 10 825 8,250
Site Reads MbA 602 239
Site Size of Team Team Experience Instrument Runs
A 2 22 22 60 39B 626 243C 791 297D 743 308E 547 215
A 2 22 22B 5 172 163C 3 128 128D 2 5 5E 7 345 197 E 547 215E 7 345 197
- 23 -
Site Performance versus AssemblyDeviations in substitution error rates?
0.0025
Deviations in substitution error rates?0.0025
0.002
0.002
T>G
T>C
0.0015
Series1
Across the entirety of the data,a read h h l i t d
0.0015
T>C
T>A
G>T
G>C
G>A
0.001
Series1has a non-homopolymer associated change at a rate of 1/430 to 1/450 0.001
G>A
C>T
C>G
C>A
A>T
Substitution errors are slightly biased, C>T four times more common than anything else
0.0005
0.0005
A>T
A>G
A>C
0
A B C D E
0
A B C D E
- 24 -
A B C D E
Summary and Conclusions
1 R h /454 i t t ti t d1. Roche/454 instrumentation, reagents and protocols are robust and reproducible across sitessitesThe fluidic systems need to be in top shape for best results.
2. A combination of Titanium reads and 8 kb paired ends have led to a very high quality assembly of the O. novo-ulmi genome.It’ ti f th O l i bi l i t t t tIt’s time for the O. ulmi biologists to get to back to work…we’ve done our part.
Thanks ABRF and Roche/454
- 25 -
Thanks ABRF and Roche/454