48
For Research Use Only. Not for use in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. NIST Genome in a Bottle (GIAB) Consortium Workshop at Stanford University Luke Hickey Senior Director, Human BioMedical Sciences, PacBio January 29, 2016

Jan2016 pac bio giab

Embed Size (px)

Citation preview

Page 1: Jan2016 pac bio giab

For Research Use Only. Not for use in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved.

NIST Genome in a Bottle (GIAB) Consortium

Workshop at Stanford University Luke Hickey – Senior Director, Human BioMedical Sciences, PacBio January 29, 2016

Page 2: Jan2016 pac bio giab

Topics

- PacBio SMRT Sequencing Technology Development

- Human Genome Sequencing with PacBio Systems

- The Role of NIST GIAB Reference Material in PacBio

Sequencing Technology Development, Optimization

and Demonstration

Page 3: Jan2016 pac bio giab

PacBio SMRT Sequencing Technology

Page 4: Jan2016 pac bio giab

PACIFIC BIOSCIENCES® CONFIDENTIAL

SINGLE MOLECULE, REAL-TIME (SMRT) DNA SEQUENCING

Page 5: Jan2016 pac bio giab

PACIFIC BIOSCIENCES® CONFIDENTIAL

Long Reads

- Average >10,000 bases

High Consensus Accuracy

- Achieves >99.999% (30x)

Uniform, Unbiased Coverage

- Lack of GC% or sequence

complexity bias

DNA Modification Detection

- Epigenome characterization

SMRT SEQUENCING DATA CHARACTERISTICS

Page 6: Jan2016 pac bio giab

PACIFIC BIOSCIENCES® CONFIDENTIAL

AREAS OF PACBIO TECHNOLOGY DEVELOPMENT

Library Preparation

Sequencing Data Analysis

Instruments

SMRT Cells Zero-Mode

Waveguides

Phospholinked

Nucleotides

DNA Shearing

Size Selection

SMRTbell™

Library

Preparation

Primary Analysis

- Base calling

Secondary & Tertiary Analysis

- Mapping

(daligner/BLASR)

- Consensus accuracy

(Quiver / HGAP)

- De novo assembly

(Falcon / MHAP)

- SV calling

- Phasing

- Epigenetic analysis

Consumables

PacBio® RS II SEQUEL™ SYSTEM

Page 7: Jan2016 pac bio giab

PACIFIC BIOSCIENCES® CONFIDENTIAL

PRODUCT RELEASES OVER THE LAST FOUR YEARS

7

Feb 2012

C2 Launch

May 2012

v1.3.1 SW Release – Base Mods

Aug 2012

v1.3.2 MagBead Release

Nov 2012

v1.3.3

Microbial Base Modification

XL Chemistry

Stage Start

Jan 2013

SMRT® Cells v3

HGAP/Quiver

Oct 2013

v2.1

• P5-C3 release

• HGAP 2.0

Apr 2013

RS II Product Release

• 75K to 150K ZMW

• 2x Throughput

Mar 2014

v2.2

• IsoSeq™

• HLA-Typing Oct 2015

Sequel System

Oct 2014

v2.3

• P6-C4 release

Apr 2015

Barcode Support

Increased throughput by over 100x

Page 8: Jan2016 pac bio giab

0

2000

4000

6000

8000

10000

12000

14000

HISTORY OF READ LENGTH PERFORMANCE A

vera

ge R

ead L

ength

(b

p)

2008 2009 2010 2011 2012 2013 2014 2015

Early PacBio chemistries

453 1012 1734 LPR

FCR

ECR2

C2–C2

P4–C2

P5–C3

Average Read Length: 10,000 - 15,000 bp

Throughput / SMRT® Cell: 750 Mb – 1.25 Gb

Consensus Accuracy: QV50 @30-fold P6–C4

Page 9: Jan2016 pac bio giab

NIST GIAB REFERENCE MATERIAL 8398

- Serves as a well characterized control material to facilitate development of novel library

preparation and sequencing methods for human genomes at PacBio.

Page 10: Jan2016 pac bio giab

PACIFIC BIOSCIENCES® CONFIDENTIAL

LIBRARY PREPARATION

DNA Sample

Building of the

SMRTbell Template Sample Preparation

Repair Ends

Ligate Adapters

Purify DNA

Binding

Fragment DNA

Page 11: Jan2016 pac bio giab

ASSESSING THE IMPACT OF DNA QUALITY

ON READ LENGTH

Human gDNA samples from NIST GIAB: NA12878: CEPH/Utah Pedigree 1463, Lot K6

Thanks Dave Hsu!

E. coli K12 gDNA is mostly >40 kb (same gel)

Both NA12878 samples show significant degradation

Look similar to Coriell samples

PFGE conditions:

Bio-Rad CHEF Mapper XA System

1% PFG-certified agarose gel in 0.5x TBE

~200 ng DNA per lane

Auto-algorithm program

Low = 5 kb

High = 150 kb

Markers:

1 kb Extension Ladder (Invitrogen)

5 kb DNA Ladder (Bio-Rad)

EtBr stained post-electrophoresis

Typhoon imaging:

Fluorescence mode, EtBr channel

100 microns resolution

+3 mm focal plane

- Initial QC of human gDNA samples (NIST/Stanford)

Page 12: Jan2016 pac bio giab

Performance of NIST/NA12878 Libraries and E.coli K12

Metrics from SMRT Portal RS.PreAssembler.2

>15 kb libraries loaded at 25 pM on-chip (OCPW)

>30 and >40 kb libraries loaded at 75 pM on-chip (OCPW)

Sample nReads #Bases Mean RL RL N50

NA12878_15kb 84,969 1,150 Mb 13,533 18,622

K12_15kb 24,941 378 Mb 15,161 21,140

K12_30kb_DDR 60,460 1,031 Mb 17,055 24,745

K12_40kb_DDR 51,679 922 Mb 17,835 26,282

Page 13: Jan2016 pac bio giab

TYPICAL P6-C4 CHEMISTRY READ LENGTH

PERFORMANCE ON A HUMAN GENOME

Data per SMRT Cell: 0.5 – 1 Gb

20 kb size-selected human library

4 hour movie

P6-C4 chemistry

Page 14: Jan2016 pac bio giab

NEW LARGE INSERT LIBRARY PREPARATION

PROTOCOLS

http://www.pacb.com/wp-content/uploads/2015/09/Unsupported-Preparing-Greater-than-30kb-SMRTbell-Libraries-Megaruptor-Shearing.pdf

http://www.pacb.com/wp-content/uploads/2015/09/Unsupported-Preparing-Greater-than-30kb-SMRTbell-Libraries-Needle_Shearing.pdf

Page 15: Jan2016 pac bio giab

Sequencing Human Genomes So, you sequenced a human genome … how well did you do?

Page 16: Jan2016 pac bio giab

THE HUMAN GENOME – FEBRUARY 2001

Source: Science. 2001 Feb 16;291(5507):1304-51., Nature. 2001 Feb 15;409(6822):860-921.

Page 17: Jan2016 pac bio giab

THE HUMAN GENOME

- Over 6 billion base pairs

- Organized into 23 chromosomes

- With 2 copies of each

- One maternal, one paternal

- Carrying 20,000 genes

- Each encoding an average of 3 proteins

Source: NHGRI fact sheet

Accessing variation in the human genome enables genetic research.

“Much of the missing heritability (the 'dark matter' of the

genome) will probably turn up as the technology advances.”

- Francis Collins

Nature 464, 674-675 (1 April 2010)

Page 18: Jan2016 pac bio giab

PACIFIC BIOSCIENCES® CONFIDENTIAL

TYPES OF INFORMATION COLLECTED FROM

PACBIO SEQUENCING OF A HUMAN GENOME

DNA

- Single-Nucleotide Variation (SNPs) ← Illumina “$1000 Genome”

- Structural Variation (SVs) ← Illumina “$1000 Genome”

- Haplotype Phasing ← Cloning/Sanger sequencing

- Epigenetics ← Illumina + bisulfite sequencing

- De Novo Genome Assembly ← Illumina + Hi-C/Dovetail

RNA

- Expression Quantitation ← Illumina

- Isoform Characterization ← PacBio

PacBio Genome

Page 19: Jan2016 pac bio giab

PACBIO SEQUENCING AND ASSEMBLY OF NA12878

“We sequenced NA12878 genomic DNA across 851

Pre P5-C3 and 162 P5-C3 [SMRT Cells] to generate

24× and 22× coverage with aligned mean read

lengths of 2,425 and 4,891 base pairs, respectively.”

Page 20: Jan2016 pac bio giab

TABLE 1. NA12878 – PACBIO ASSEMBLY RESULTS

Page 21: Jan2016 pac bio giab

FIGURE 2. TANDEM-REPEAT DETECTION FROM SINGLE

MOLECULES PREDICTS A LARGE DIVERGENCE FROM

REFERENCE.

Page 22: Jan2016 pac bio giab

REPEAT EXPANSION DISEASES

Sergei M. Mirkin (2007). Expandable DNA repeats and human disease, Nature 447, 932-940

Page 23: Jan2016 pac bio giab

“It is time to stop thinking

that merely more DNA

sequencing will give us the

variants that determine

human traits”

“We encourage the use of a

range of sequencing

technologies to explore

highly variable and complex

genomic regions in a large

number of human samples.”

http://www.nature.com/ng/journal/v47/n9/pdf/ng.3397.pdf

SEPTEMBER 2015 -

Page 24: Jan2016 pac bio giab

“Full resolution of variation

is only guaranteed by

complete de novo assembly

of a genome.”

“We … emphasize the

importance of complete de

novo assembly as opposed

to read mapping as the

primary means to

understanding the full range

of human genetic variation.”

VOLUME 16 | NOVEMBER 2015 | 627

Source: www.nature.com/nrg/journal/v16/n11/full/nrg3933.html

Page 25: Jan2016 pac bio giab

COST-PER-GENOME DILEMMA (QUANTITY VS. QUALITY)

NCBI-34

Contig N50 29 Mb

HuRef: 107 kb

BGI YH: 7.4 kb

KB1: 5.5 kb

NA12878: 24 kb

CHM1: 144 kb

RP11: 127 kb

According to NHGRI

website, the definition

of “sequencing a

genome” changed in

the year 2008 to refer

to “re-sequencing” in

lieu of “de novo

assembly.”

- Obtaining a de novo human genome that has the same scientific quality standard as

the initial HGP work has NOT followed Moore’s law.

Source: NHGRI – Genome Sequencing Costs - http://www.genome.gov/sequencingcosts/

Page 30: Jan2016 pac bio giab

Data sources: HuRef (Venter) (http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050254); BGI YH (http://genome.cshlp.org/content/

20/2/265.abstract Table II); KB1 (http://www.nature.com/nature/journal/v463/n7283/full/nature08795.html); NA12878 (http://www.pnas.org/content/

early/2010/12/20/1017351108.abstract Table3); CHM1 Illumina (http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2/)

HUMAN GENOME DE NOVO ASSEMBLIES

Year Technology Assembler Sample

2007 ABI 3730 Celera HuRef

2009 Illumina GA SOAP

de novo BGI YH

2010 454 GS Flx

Titanium Newbler KB1

2010 Illumina GA ALLPATHS-LG NA12878

2013 454 GS, HiSeq,

MiSeq Newbler RP11_0.7

2014 HiSeq, BAC

clones

Reference-

guided CHM1

2014 PacBio RS II FALCON CHM1

2015 PacBio RS II FALCON CHM13

2015 PacBio RS II FALCON AK1

2015 PacBio RS II FALCON HuRef

2015 PacBio RS II FALCON PC-9*

2015 PacBio RS II FALCON SK-BR-3*

*cancer cell lines

0.11

0.007

0.006

0.024

0.13

0.14

4.38

12.98

7.28

10.38

3.58

2.56

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Contig N50 (Mb)

26.9 Mb - NCBI: GCA_001297185.1

Page 31: Jan2016 pac bio giab

THE HUMAN GENOME - 2015

http://www.ncbi.nlm.nih.gov/assembly/GCA_001297185.1/

Contig N50 26.9 MB

Page 32: Jan2016 pac bio giab

TOWARDS PLATINUM GENOMES: PACBIO RELEASES A

NEW, HIGHER QUALITY CHM1 ASSEMBLY TO NCBI

Figure 1. The PacBio CHM1 assembly resolves the q arms of

chromosomes 2 and 6 into very few contigs, with max contigs

107 Mbp and 109 Mbp long, respectively.

Posted: Friday, October 2, 2015

Source: PacBio blog post, Tuesday September 29, 2015, http://pacb.com/blog

Page 34: Jan2016 pac bio giab

NIST GENOME IN A BOTTLE (GIAB) PROJECT

34

Ashkenazim Trio de novo Genome Sequencing Project Collaborative project with Icahn School of Medicine at Mt. Sinai, New York City

Sequencing: • Generated PacBio de novo human sequencing from the GIAB Ashkenazim son-father-

mother trio from the Personal Genome Project (HG002, HG003, HG004). • The AJ genomes are candidate NIST Reference Materials planned for release in 2016. • PacBio coverage is 69X, 32X, and 30X for HG002, HG003, and HG004, respectively. • A paper describing these data and other data from GIAB is now on biorxiv Sequencing data publicly posted on NCBI: • NIST Human HG002 NA24385 (Ashkenazim Trio Son) on NCBI FTP site here. • NIST Human HG003 NA24149 (Ashkenazim Trio Father) on NCBI FTP site here. • NIST Human HG004 NA24143 (Ashkenazim Trio Mother) on NCBI FTP site here.

https://github.com/PacificBiosciences/DevNet/wiki/Genome-in-a-Bottle-Ashkenazim-Trio

Page 35: Jan2016 pac bio giab

GIAB PacBio Assembly Summary with SV calls derived from de novo

assemblies

Mount Sinai: Ali Bashir, Matthew Pendleton, Ryan Neff

Pacific Biosciences: Jason Chin

Reed College: Anna Ritz

Page 36: Jan2016 pac bio giab

Overview

• Steps for SV calling

– De novo Falcon assembly

– Reference-based comparison

• Mapping with BLASR and Nucmer – Secondary refined using HMM

– Re-examination of potential deviations in the reference with raw-reads

• Currently extending MultiBreak-SV

Page 37: Jan2016 pac bio giab

PacBio Falcon Assembly Stats Trio

Sample Contigs Average N50 Max Total Size HG002 13231 230Kb 4.1 Mb 31.6 Mb 3.04 Gb HG003 17873 172kb 4.6 Mb 21.5Mb 3.08 Gb HG004 16487 185kb 5.3 Mb 22.6 Mb 3.05 Gb

Log y-scale Log x-scale

Page 38: Jan2016 pac bio giab

Both high/low coverage AJ assemblies highly consistent with GRCh38

HG002

Page 39: Jan2016 pac bio giab

Both high/low coverage AJ assemblies highly consistent with GRCh38

HG003

Page 40: Jan2016 pac bio giab

Both high/low coverage AJ assemblies highly consistent with GRCh38

HG004

Page 41: Jan2016 pac bio giab

PacBio Assembly Based SV Calls

Sample Deletion Insertion Other Total HG002 9237 12489 2534 24260

HG003 9356 12299 2580 24235 HG004 9189 12290 2589 24068

Page 42: Jan2016 pac bio giab

PacBio Assembly Based SV Calls

Sample Deletion Insertion Other Total HG002 9237 12489 2534 24260

HG003 9356 12299 2580 24235 HG004 9189 12290 2589 24068

Note: Log x-scale to show full event sizes

Page 43: Jan2016 pac bio giab

SV calls consistent between assembly approaches (Falcon vs. Celera)

Insertion Deletion

Other

Page 44: Jan2016 pac bio giab

Ongoing

• Refining raw read-based analysis: – Build new calls – Mark false-positives – Identifying discrepancies between two assemblies – Force calling trios

• Improving heterozygous calls missed via local assembly

• Refining “other” categories – e.g. splitting out simple and complex inversions

• Merging BioNano/10X calls with PacBio data

Page 45: Jan2016 pac bio giab

ROLE OF NIST GIAB AJ TRIO PROJECT AND REFERENCE

MATERIAL IN PACBIO TECHNOLOGY DEVELOPMENT

- PacBio characterization data serves as a public resource for data analysis methods development by community:

- Structural variation

- SNV calling

- De novo assembly

- Phasing & haplotype reconstruction

- Methylation / Epigenetic analysis

- Analytical data from multiple-platforms serves as validation for algorithm development

- Characterization data and reference material provide a benchmark for development of novel methods

- New chemistry development to increase read-length and accuracy (e.g., library prep methods, polymerase, etc.)

- Scaffolding using novel library perpetration methods

- Rare variant calling with dilution analysis

- Well-characterized RM will serve as a resource for future use in internal quality testing

- Consumables

- Instruments

- Analysis methods

Page 46: Jan2016 pac bio giab

PACIFIC BIOSCIENCES® CONFIDENTIAL

1000+ PUBLICATIONS TO DATE FEATURING PACBIO

SEQUENCING

0

100

200

300

400

500

600

700

800

2011 2012 2013 2014 2015

Human Biomedical

Plant & Animal

Microbiology

Page 47: Jan2016 pac bio giab

For Research Use Only. Not for use in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio,

SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx.

All other trademarks are the sole property of their respective owners.

www.pacb.com

Page 48: Jan2016 pac bio giab

PACBIO RS II

150+ PLACEMENTS

Some pins represent multiple placements