24
FIND MEANING IN COMPLEXITY Jonas Korlach Looking Ahead: Improving Workflows for SMRT Sequencing Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in the United States and/or other countries. Covaris is a trademark of Covaris, Inc.; g-TUBE is a trademark of Bio Plas, Inc.; Caliper and Sciclone are trademarks of Caliper Life Sciences, Inc.; Agilent is a trademark of Agilent Technologies, Inc.; 454 is a trademark of Roche Diagnostics; and Illumina and Moleculo are trademarks of Illumina, Inc.© Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.

Looking Ahead: Improving Workflows for SMRT Sequencing · Joint Genome Institute Production Workflow 8 Bacterial Sample Extract DNA Shear to 10 kb w/ Covaris g-TUBE devices Automated

  • Upload
    dothien

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

FIND MEANING IN COMPLEXITY

Jonas Korlach

Looking Ahead: Improving Workflows for

SMRT Sequencing

Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in the United States and/or other countries. Covaris is a trademark of Covaris, Inc.; g-TUBE is a trademark

of Bio Plas, Inc.; Caliper and Sciclone are trademarks of Caliper Life Sciences, Inc.; Agilent is a trademark of Agilent Technologies, Inc.; 454 is a trademark of Roche Diagnostics; and Illumina and Moleculo are trademarks of

Illumina, Inc.© Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.

A Year Ago

Today

Requirements for Achieving High-Quality, Finished Genomes

1. High Consensus Accuracy

– >99.999% (QV50)

– Lack of systematic bias

2. Lack of sequence context bias

– GC content

– Low complexity sequence

3. Long sequence reads

– Resolve repeats, plasmids

– Full-length cDNA sequencing

– Long-range haplotype phasing

4. Base modification detection

– Epigenome characterization

Finished Genomes to Fight Foodborne Outbreaks

• ~76 million illnesses each year

• ~325,000 hospitalizations

• $78 billion economic loss (US)

• High serotype diversity

• Emerging hypervirulence

• ~76 million illnesses each year

• ~325,000 hospitalizations

• $78 billion economic loss (US)

• High serotype diversity

• Emerging hypervirulence

National Collection of Type Cultures (NCTC)

• Collaboration with Public Health England & the Wellcome Trust

Sanger Institute

• Plan to finish 3000 bacterial and 500 viral genomes

Joint Genome Institute Production Pipeline

http://www.jgi.doe.gov/News/news_13_05_06.html

SMRT Sequencing

Joint Genome Institute Production Workflow

8

Bacterial

Sample

Extract DNA

Shear to 10 kb w/

Covaris g-TUBE

devices

Automated Library Prep

on Caliper Sciclone

Workstation

Publication Quality

Finished Genomes

Automated Data

Analysis

Automated Library Preparation

• Bravo platform (Agilent): • Sciclone platform (Caliper):

http://www.perkinelmer.com/Catalog/Family/ID/Sciclone

%20NGS%20Workstation

http://www.chem.agilent.com/en-US/products-services/Instruments-

Systems/Automation-Solutions/Bravo-Automated-Liquid-Handling-

Platform/Pages/default.aspx

Automated Library Preparation

Expanding the Scale

• Number of samples

• Sample complexity

Number of samples

Sa

mp

le c

om

ple

xity

Sequencing Full-Length 16S RNA

Collaboration with Chunlab, DNA Link, and Molecular Diagnostics Korea

Presented at ASM Annual Meeting, Denver, May 2013

16S RNA length covered Accuracy

Measures of Diversity

318 100 684 137

Water Soil

Collaboration with Chunlab, DNA Link, and Molecular Diagnostics Korea

Measures of Diversity

Water

Soil

Collaboration with Chunlab, DNA Link, and Molecular Diagnostics Korea

Expanding the Scale

• Number of samples

• Sample complexity

• Genome size

Number of samples

Sa

mp

le c

om

ple

xity

Yeast De Novo HGAP Assembly

I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI M

data at http://pacbiodevnet.com/

100 kb

Reference (S228C): 17 contigs

• Genome size = 12.3 Mb

De novo assembly: 30 contigs

• Assembly size = 12.3 Mb

• N50 = 770 kb

• Max contig = 1.5 Mb (chr. IV)

Malaria De Novo Assembly

Plasmodium falciparum:

– 350-500 million infections per year

– 1 million deaths per year

– 20% average GC content, 23.3 Mb genome

454

pyrosequencing*

Sanger

sequencing*

Illumina

sequencing*

SMRT

sequencing

Progeny Parents Reference genome 30 SMRT Cells

7C126 SC05 Dd2 HB3 NP-3D7-S NP-3D7-L 3D7

Number of Contigs 9,452 9,597 4,511 2,971 26,920 22,839 98

N50 Contig Size (kb) 3.3 3.3 11.6 20.6 1.5 1.6 1,242

Largest Contig (kb) 36.7 34.4 79.2 111.9 29.1 24.0 2,534

Number of assembled bases (Mb) 20.8 21.1 19.5 23.4 19.0 21.1 23.5

Average Coverage 33× 36× 7.8× 7.1× 43× 64× 155×

Sample provided by the Broad Institute & Sarah Volkmann (Harvard School of Public Health)

*Samarakoon et al. (2011) BMC Genomics 12: 116.

Arabidopsis De Novo Assembly

• Original Col-0 assembly (Sanger) ~$70M, several years

• Sequenced & assembled Ler-0 strain:

data at http://pacbiodevnet.com/

*http://1001genomes.org/data/MPI/MPISchneeberger2011/releases/current/

Column1 PacBio assembly Short-read assembly

(2011)* Improvement

Assembly size (bp) 124,572,784 110,357,164 12%

# contigs 540 4,662 8.6x

Contig N50 (bp) 6,190,353 66,600 90x

Max contig length (bp) 12,982,390 462,490 30x

New Algorithm Developments

Hybrid Preassembly on HC-2ex

OverlapInCore Dominates PacBioToCA Run Time for Large Genomes

Parrot

• Data from original paper by Koren et al.

OverlapInCore speed up 14.7x

• Standard server 19.8 days

• HC-2ex 1.2 days

Ongoing optimization of demanding steps in hybrid and non-hybrid workflows with Pacific Biosciences

HC-2ex: 2@8 core Intel X5670 2.93GHz, 48GB

DDR3, stripe 4 @ 600GB SATA disk (host)

16GB SG (coprocessor)

16C x86: same, host only

19.83

1.17

0

2

4

6

8

10

12

14

16

18

20

Ru

n T

ime

(D

ays

)

16C x86 HC-2ex

The Next Challenge: Assembling Diploid Genomes

Build bioinformatics and visualization

tools for building new algorithms that

can resolve diploid genomes

Early assembly result

for the Ler-0 + Col-0

“synthetic” diploid.

“With the RS, the contigs from

our de novo assembly of the 400

Mbp rice genome are several fold

better than the state-of-the-art

ALLPATHS-LG assembly using

short reads”

Michael C. Schatz, Ph.D.

Assistant Professor of Quantitative Biology

Cold Spring Harbor Laboratory

Rice Genome Assembly (Oryza sativa pv Nipponbare: 400 MB)

Contig N50

HiSeq® Fragments 50x 2x100bp @ 180

3,925

MiSeq® Fragments 23x 459bp

8x 2x251bp @ 450

6,332

Illumina® Mates 50x 2x100bp @ 180

36x 2x50bp @ 2100

51x 2x50bp @ 4800

18,248

PBeCR + Illumina reads 7x 3500bp ** MiSeq reads for

correction

50,995

PBeCR + Illumina reads 19x ** MiSeq reads for correction

155 kb

M. Schatz AGBT talk 2013

http://schatzlab.cshl.edu/presentations/2013-02-20.AGBT.Assembling%20Crop%20Genomes.pdf

Applications Across All Genome Sizes

http://commons.wikimedia.org/wiki/File:Genome_Sizes.png

FIND MEANING IN COMPLEXITY

© Copyright 2012 by Pacific Biosciences of California, Inc. All rights reserved.