Upload
dothien
View
214
Download
0
Embed Size (px)
Citation preview
FIND MEANING IN COMPLEXITY
Jonas Korlach
Looking Ahead: Improving Workflows for
SMRT Sequencing
Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in the United States and/or other countries. Covaris is a trademark of Covaris, Inc.; g-TUBE is a trademark
of Bio Plas, Inc.; Caliper and Sciclone are trademarks of Caliper Life Sciences, Inc.; Agilent is a trademark of Agilent Technologies, Inc.; 454 is a trademark of Roche Diagnostics; and Illumina and Moleculo are trademarks of
Illumina, Inc.© Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.
Requirements for Achieving High-Quality, Finished Genomes
1. High Consensus Accuracy
– >99.999% (QV50)
– Lack of systematic bias
2. Lack of sequence context bias
– GC content
– Low complexity sequence
3. Long sequence reads
– Resolve repeats, plasmids
– Full-length cDNA sequencing
– Long-range haplotype phasing
4. Base modification detection
– Epigenome characterization
Finished Genomes to Fight Foodborne Outbreaks
• ~76 million illnesses each year
• ~325,000 hospitalizations
• $78 billion economic loss (US)
• High serotype diversity
• Emerging hypervirulence
• ~76 million illnesses each year
• ~325,000 hospitalizations
• $78 billion economic loss (US)
• High serotype diversity
• Emerging hypervirulence
National Collection of Type Cultures (NCTC)
• Collaboration with Public Health England & the Wellcome Trust
Sanger Institute
• Plan to finish 3000 bacterial and 500 viral genomes
Joint Genome Institute Production Pipeline
http://www.jgi.doe.gov/News/news_13_05_06.html
SMRT Sequencing
Joint Genome Institute Production Workflow
8
Bacterial
Sample
Extract DNA
Shear to 10 kb w/
Covaris g-TUBE
devices
Automated Library Prep
on Caliper Sciclone
Workstation
Publication Quality
Finished Genomes
Automated Data
Analysis
Automated Library Preparation
• Bravo platform (Agilent): • Sciclone platform (Caliper):
http://www.perkinelmer.com/Catalog/Family/ID/Sciclone
%20NGS%20Workstation
http://www.chem.agilent.com/en-US/products-services/Instruments-
Systems/Automation-Solutions/Bravo-Automated-Liquid-Handling-
Platform/Pages/default.aspx
Expanding the Scale
• Number of samples
• Sample complexity
Number of samples
Sa
mp
le c
om
ple
xity
Sequencing Full-Length 16S RNA
Collaboration with Chunlab, DNA Link, and Molecular Diagnostics Korea
Presented at ASM Annual Meeting, Denver, May 2013
16S RNA length covered Accuracy
Measures of Diversity
318 100 684 137
Water Soil
Collaboration with Chunlab, DNA Link, and Molecular Diagnostics Korea
Measures of Diversity
Water
Soil
Collaboration with Chunlab, DNA Link, and Molecular Diagnostics Korea
Expanding the Scale
• Number of samples
• Sample complexity
• Genome size
Number of samples
Sa
mp
le c
om
ple
xity
Yeast De Novo HGAP Assembly
I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI M
data at http://pacbiodevnet.com/
100 kb
Reference (S228C): 17 contigs
• Genome size = 12.3 Mb
De novo assembly: 30 contigs
• Assembly size = 12.3 Mb
• N50 = 770 kb
• Max contig = 1.5 Mb (chr. IV)
Malaria De Novo Assembly
Plasmodium falciparum:
– 350-500 million infections per year
– 1 million deaths per year
– 20% average GC content, 23.3 Mb genome
454
pyrosequencing*
Sanger
sequencing*
Illumina
sequencing*
SMRT
sequencing
Progeny Parents Reference genome 30 SMRT Cells
7C126 SC05 Dd2 HB3 NP-3D7-S NP-3D7-L 3D7
Number of Contigs 9,452 9,597 4,511 2,971 26,920 22,839 98
N50 Contig Size (kb) 3.3 3.3 11.6 20.6 1.5 1.6 1,242
Largest Contig (kb) 36.7 34.4 79.2 111.9 29.1 24.0 2,534
Number of assembled bases (Mb) 20.8 21.1 19.5 23.4 19.0 21.1 23.5
Average Coverage 33× 36× 7.8× 7.1× 43× 64× 155×
Sample provided by the Broad Institute & Sarah Volkmann (Harvard School of Public Health)
*Samarakoon et al. (2011) BMC Genomics 12: 116.
Arabidopsis De Novo Assembly
• Original Col-0 assembly (Sanger) ~$70M, several years
• Sequenced & assembled Ler-0 strain:
data at http://pacbiodevnet.com/
*http://1001genomes.org/data/MPI/MPISchneeberger2011/releases/current/
Column1 PacBio assembly Short-read assembly
(2011)* Improvement
Assembly size (bp) 124,572,784 110,357,164 12%
# contigs 540 4,662 8.6x
Contig N50 (bp) 6,190,353 66,600 90x
Max contig length (bp) 12,982,390 462,490 30x
Hybrid Preassembly on HC-2ex
OverlapInCore Dominates PacBioToCA Run Time for Large Genomes
Parrot
• Data from original paper by Koren et al.
OverlapInCore speed up 14.7x
• Standard server 19.8 days
• HC-2ex 1.2 days
Ongoing optimization of demanding steps in hybrid and non-hybrid workflows with Pacific Biosciences
HC-2ex: 2@8 core Intel X5670 2.93GHz, 48GB
DDR3, stripe 4 @ 600GB SATA disk (host)
16GB SG (coprocessor)
16C x86: same, host only
19.83
1.17
0
2
4
6
8
10
12
14
16
18
20
Ru
n T
ime
(D
ays
)
16C x86 HC-2ex
The Next Challenge: Assembling Diploid Genomes
Build bioinformatics and visualization
tools for building new algorithms that
can resolve diploid genomes
Early assembly result
for the Ler-0 + Col-0
“synthetic” diploid.
“With the RS, the contigs from
our de novo assembly of the 400
Mbp rice genome are several fold
better than the state-of-the-art
ALLPATHS-LG assembly using
short reads”
Michael C. Schatz, Ph.D.
Assistant Professor of Quantitative Biology
Cold Spring Harbor Laboratory
Rice Genome Assembly (Oryza sativa pv Nipponbare: 400 MB)
Contig N50
HiSeq® Fragments 50x 2x100bp @ 180
3,925
MiSeq® Fragments 23x 459bp
8x 2x251bp @ 450
6,332
Illumina® Mates 50x 2x100bp @ 180
36x 2x50bp @ 2100
51x 2x50bp @ 4800
18,248
PBeCR + Illumina reads 7x 3500bp ** MiSeq reads for
correction
50,995
PBeCR + Illumina reads 19x ** MiSeq reads for correction
155 kb
M. Schatz AGBT talk 2013
http://schatzlab.cshl.edu/presentations/2013-02-20.AGBT.Assembling%20Crop%20Genomes.pdf