Download pdf - Characterization of Bacterial Genomic Reference Materials · views of the DHS, NIST, or affiliated venues. Certain commercial equipment, instruments, or materi-als are identified

0.6

0.8

1.0

0.6 0.8 1.0MiSeq

PGM

Template DNA

FragmentationEnzymatic (e.g., Fragmentase or Ion Xpress)

Genomic DNA

+

End repair

+

Tagmentation

Transposase + adaptors

Genomic DNA

Fragmentation and adaptor ligation

Size selection

Blunt-endfragments

Adaptors

Tagging

Amplification

Sequencing

(© Pacific Biosciences)

PacBio RS

(© Illumina)

Emulsion PCR and enrichment Solid-phase bridge amplification

MiSeqIon PGM

0.6

0.8

1.0

0.6 0.8 1.0MiSeq

PGM

Modified From Loman et al. 2012 Nature Reviews Microbiology 10(9)

Property AnalysisGenome Assembly de-novo assembly

(Koren et al. 2013)

Chromosome structure Validation

- OpGen MapSolver

- Assembly finishing (Walker et al. 2014)

Base Level Purity Purity metric

Strain diversity: within lot ratio of reference vs. alternative base calls

Homogeneity: vial-to-vial Somatic variant caller (Kobalt et al. 2009)

Genomic Contaminants Taxonomic read classifier (Hong et al. 2014)

non-RM strain DNA

DNA Stability* gel image analysis*Work in progress data not shown

Achromobacter xylosoxidansDelftia acidovorans

Delftia sp.Homo sapiens

Methylobacterium populiPseudomonas fluorescens

Enterobacter aerogenesStenotrophomonas maltophilia

Enterobacteria phageCronobacter sakazakii

Enterobacter sp.Erwinia sp.

uncultured bacteriumCitrobacter koseri

Enterobacteriaceae bacteriumEnterobacter cloacae

synthetic constructCitrobacter rodentium

Klebsiella pneumoniaeEscherichia coli

0 100 200 300Reads

Org

anis

m

Platformmiseqpgm

Targeted CoverageSeq Platform Vials Libraries Read Length Library TotalPac Bio RSII 2 1 8 kb 200Illumina MiSeq 8 2 2 X 300 bp 175 2800Ion Torrent PGM 8 1 400 bp 37.5 600

Total Coverage: 3600

Pipeline for Evaluating Prokaryotic References (PEPR)• A reproducible, transparent, and reuseable bioinformatic pipeline for

chacterizing prokaryotic genomic reference materials.

Characterization of Bacterial Genomic Reference MaterialsN. D. Olson1 ([email protected]), S.A. Jackson1, N. J. Lin1, J. M. Zook1, M. L. Salit1,2

1 Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD 2 Department of Bioengineering, Stanford University, Stanford, CA

Salmonella enteria LT2 Results In the Public Health Sector, high stakes decisions are being made using microbial genomic sequencing data (Tang and Gardy 2014). For example, whole genome sequencing was recently used as part of an investigation of the European 2011 Escherichia coli O1O4:H4 sprout associated outbreak (Grad et al. 2012). As the stakes increase so does the re-quired level of confidence in the measurement.

To address this problem NIST is developing:1. Bioinformatic pipeline to characterize genomic materials,

2. Microbial genomic DNA Candidate Reference Materials.

Background

Strain Selection• Public health relevance• Range of %GC

Sequencing

Genome Assembly

Base Level Analysis

Genomic Contaminants

Acknowledgements

References

The author’s would like to thank Jenny McDaniel, Lindsay Vang, and David Catoe for performing the measurements; Heike Sichtig, Marc Allard, Shashi Sharma, and Nangarajan Thirunavukkasu for guidance and assistance with strain selection and aquisition; Tim Muruvanda for performing the PacBio sequencing.

This work was supported by the Department of Homeland Security (DHS) Science and Technology Directorate under the Interagency Agreement HSHQPM-14-X-00078 with NIST and by two inter-agency agreements with the FDA.

Four Microbial Genomic Candidate RMs• Strains selected for public health relevance• Range of %GC to best challenge sequencing technology

Key Concepts to RM Characterization • Orthogonal measurement methods• Conservative approach to analysis• Reproducible and transparent workflow

Characterization Results for Candidate RMs• Genome Assembly

• Closed assembly using long read sequencing data• S. enterica: 1.6 Mb putative inversion• S. aureus: agreement between optical map and assem-

bly• Base Level Analysis

• Low strain diversity • No indication of genomic heterogeneity

• Genomic Contaminats• Likely reagents or bioinformatic errors

Expected Use of Candidate RM and Data• Stable and homogenous material suitable for benchmark-

ing sequencing platforms and chemistries.• Data generated as part of material characterization can be

used to evaluate bioinformatic pipeline and algorithms.

Expected Use of PEPR• Characterize application specific in-house materials as

part of a routine quality control program.

NIST Candidate RM Development

Material Production• Produced by local vendor• For each strain

• Pure culture• Homogenized DNA batch • ~ 1000 vials with 3μg DNA

Strain Biosample Size GCSalmonella enteria LT2 SAMN02854572 C* 4.8 Mb 52

P* 94 kb 53Staphylococcus aureus SAMN02854573 C* 2.8 Mb 33

P* 25 kb 29Pseudomonas aeruginosa SAMN02854574 C* 6.3 Mb 67Clostridium sporogenes SAMN02854575 C* 4.1 Mb 28*C - chromosome, P- plasmid

De-Novo Assembly

Optical MapComparison of long read de-novo assembly to optical mapping data. Red lines and black arrows highlight the discrepancy between the optical mapping data and de-novo assembly suggesting a 1.6 Mb inversion

• Long-read de-novo assembly -closed genome• Optical mapping - putative ~1.6 Mb inversion

Comparison of purity values by platform. Dots represent posi-tions with purity values less than 0.98 for both platforms. Posi-tions with purity values less than 0.97 for both platforms are indi-cated by orange dots.

• 11 of 4.8 M positions with purity < 0.97 for MiSeq and PGM

Number of reads assigned to contaminants, each point represents read counts for a single dataset.

• 99.995% minimum genomic purity

Bioinformatic characterzation pipeline. Tan objects represent input measurement data, blue the four primary analyses, and the green objects are the pipeline output. Source code for bioinformatic pipeline github.com/nate-d-olson/pepr, R package for generating database and report of analysis github.com/nate-d-olson/peprr. The report of analysis is a summary of the characterization process results.

RM Characterization

• Grad, Y. et al. 2012. “Genomic Epidemiology of the Escherichia Coli O104:H4 Outbreaks in Europe, 2011.” PNAS 109(8): 3065–70. doi:10.1073/pnas.1121491109.

• Hong, C. et al. 2014. “PathoScope 2.0: a Complete Computational Framework for Strain Identification in Environmental or Clinical Sequencing Samples.” Microbiome 2 (1). doi:10.1186/2049-2618-2-33.

• Koboldt, D. et al. 2009. “VarScan: variant Detection in Massively Parallel Sequencing of Individual and Pooled Samples.” Bioin-formatics 25 (17): 2283–5. doi:10.1093/bioinformatics/btp373.

• Koren, S. et al. 2013. “Reducing Assembly Complexity of Microbial Genomes with Single-Molecule Sequencing.” Genome Biol-ogy 14 (9): R101. doi:10.1186/gb-2013-14-9-r101.

• Tang, P. et al. 2014. “Stopping Outbreaks with Real-Time Genomic Epidemiology.” Genome Medicine 6 (11): 104. doi:10.1186/s13073-014-0104-4.

• Walker, B. et al. 2014. “Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Im-provement.” PLoS ONE 9 (11). doi:10.1371/journal.pone.0112963.

Opinions expressed in this paper are the authors’ and do not necessarily reflect the policies and views of the DHS, NIST, or affiliated venues. Certain commercial equipment, instruments, or materi-als are identified in this paper only to specify the experimental procedure adequately. Such identi-fication is not intended to imply recommendation or endorsement by the NIST, nor is it intended to imply that the materials or equipment identified are necessarily the best available for the purpose.

Official contribution of NIST; not subject to copyrights in USA.

Disclaimer

Salmonella Typhimurium LT2 DNA for Whole Genome Variant Assessment Sample: MG-001 Store at -20°C

Genome Assembly

Base Level Analysis

Genomic Contaminants

Comparison of long read de-novo assembly to optical mapping data.

• Long-read de-novo assembly resulted in a closed genome• Optical mapping - supports de-novo assembly

Comparison of purity values by platform. Dots represent positions with purity values less than 0.98 for both platforms. Positions with purity values less than 0.97 for both platforms are indicated by orange dots.

• 4 of 2.8 M positions with purity < 0.97 for MiSeq and PGM

Number of reads assigned to contaminants, each point represents read counts for a single dataset.

• 99.997% minimum genomic purity

S. aureus Results

• Orthogonal methods used to characterize RM

Achromobacter xylosoxidansMethylobacterium populi

Gallibacterium anatisStreptococcus oralis

Human papillomavirusStreptococcus mitis

NanoLuc reporterunidentified cloning

Bacillus cereusShuttle vector

Roseburia hominisStreptococcus pneumoniae

Stenotrophomonas maltophiliaPseudomonas mendocina

synthetic constructCampylobacter coli

Gemella morbillorumEnterococcus sp.

Nanoluc luciferaseEnterococcus faecium

Escherichia coliuncultured bacterium

0 50 100 150 200Reads

Org

anis

m

Platformmiseqpgm

De-novo Assembly

Optical Map

Conclusions