0.6
0.8
1.0
0.6 0.8 1.0MiSeq
PGM
Template DNA
FragmentationEnzymatic (e.g., Fragmentase or Ion Xpress)
Genomic DNA
+
End repair
+
Tagmentation
Transposase + adaptors
Genomic DNA
Fragmentation and adaptor ligation
Size selection
Blunt-endfragments
Adaptors
Tagging
Amplification
Sequencing
(© Pacific Biosciences)
PacBio RS
(© Illumina)
Emulsion PCR and enrichment Solid-phase bridge amplification
MiSeqIon PGM
0.6
0.8
1.0
0.6 0.8 1.0MiSeq
PGM
Modified From Loman et al. 2012 Nature Reviews Microbiology 10(9)
Property AnalysisGenome Assembly de-novo assembly
(Koren et al. 2013)
Chromosome structure Validation
- OpGen MapSolver
- Assembly finishing (Walker et al. 2014)
Base Level Purity Purity metric
Strain diversity: within lot ratio of reference vs. alternative base calls
Homogeneity: vial-to-vial Somatic variant caller (Kobalt et al. 2009)
Genomic Contaminants Taxonomic read classifier (Hong et al. 2014)
non-RM strain DNA
DNA Stability* gel image analysis*Work in progress data not shown
Achromobacter xylosoxidansDelftia acidovorans
Delftia sp.Homo sapiens
Methylobacterium populiPseudomonas fluorescens
Enterobacter aerogenesStenotrophomonas maltophilia
Enterobacteria phageCronobacter sakazakii
Enterobacter sp.Erwinia sp.
uncultured bacteriumCitrobacter koseri
Enterobacteriaceae bacteriumEnterobacter cloacae
synthetic constructCitrobacter rodentium
Klebsiella pneumoniaeEscherichia coli
0 100 200 300Reads
Org
anis
m
Platformmiseqpgm
Targeted CoverageSeq Platform Vials Libraries Read Length Library TotalPac Bio RSII 2 1 8 kb 200Illumina MiSeq 8 2 2 X 300 bp 175 2800Ion Torrent PGM 8 1 400 bp 37.5 600
Total Coverage: 3600
Pipeline for Evaluating Prokaryotic References (PEPR)• A reproducible, transparent, and reuseable bioinformatic pipeline for
chacterizing prokaryotic genomic reference materials.
Characterization of Bacterial Genomic Reference MaterialsN. D. Olson1 ([email protected]), S.A. Jackson1, N. J. Lin1, J. M. Zook1, M. L. Salit1,2
1 Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD 2 Department of Bioengineering, Stanford University, Stanford, CA
Salmonella enteria LT2 Results In the Public Health Sector, high stakes decisions are being made using microbial genomic sequencing data (Tang and Gardy 2014). For example, whole genome sequencing was recently used as part of an investigation of the European 2011 Escherichia coli O1O4:H4 sprout associated outbreak (Grad et al. 2012). As the stakes increase so does the re-quired level of confidence in the measurement.
To address this problem NIST is developing:1. Bioinformatic pipeline to characterize genomic materials,
2. Microbial genomic DNA Candidate Reference Materials.
Background
Strain Selection• Public health relevance• Range of %GC
Sequencing
Genome Assembly
Base Level Analysis
Genomic Contaminants
Acknowledgements
References
The author’s would like to thank Jenny McDaniel, Lindsay Vang, and David Catoe for performing the measurements; Heike Sichtig, Marc Allard, Shashi Sharma, and Nangarajan Thirunavukkasu for guidance and assistance with strain selection and aquisition; Tim Muruvanda for performing the PacBio sequencing.
This work was supported by the Department of Homeland Security (DHS) Science and Technology Directorate under the Interagency Agreement HSHQPM-14-X-00078 with NIST and by two inter-agency agreements with the FDA.
Four Microbial Genomic Candidate RMs• Strains selected for public health relevance• Range of %GC to best challenge sequencing technology
Key Concepts to RM Characterization • Orthogonal measurement methods• Conservative approach to analysis• Reproducible and transparent workflow
Characterization Results for Candidate RMs• Genome Assembly
• Closed assembly using long read sequencing data• S. enterica: 1.6 Mb putative inversion• S. aureus: agreement between optical map and assem-
bly• Base Level Analysis
• Low strain diversity • No indication of genomic heterogeneity
• Genomic Contaminats• Likely reagents or bioinformatic errors
Expected Use of Candidate RM and Data• Stable and homogenous material suitable for benchmark-
ing sequencing platforms and chemistries.• Data generated as part of material characterization can be
used to evaluate bioinformatic pipeline and algorithms.
Expected Use of PEPR• Characterize application specific in-house materials as
part of a routine quality control program.
NIST Candidate RM Development
Material Production• Produced by local vendor• For each strain
• Pure culture• Homogenized DNA batch • ~ 1000 vials with 3μg DNA
Strain Biosample Size GCSalmonella enteria LT2 SAMN02854572 C* 4.8 Mb 52
P* 94 kb 53Staphylococcus aureus SAMN02854573 C* 2.8 Mb 33
P* 25 kb 29Pseudomonas aeruginosa SAMN02854574 C* 6.3 Mb 67Clostridium sporogenes SAMN02854575 C* 4.1 Mb 28*C - chromosome, P- plasmid
De-Novo Assembly
Optical MapComparison of long read de-novo assembly to optical mapping data. Red lines and black arrows highlight the discrepancy between the optical mapping data and de-novo assembly suggesting a 1.6 Mb inversion
• Long-read de-novo assembly -closed genome• Optical mapping - putative ~1.6 Mb inversion
Comparison of purity values by platform. Dots represent posi-tions with purity values less than 0.98 for both platforms. Posi-tions with purity values less than 0.97 for both platforms are indi-cated by orange dots.
• 11 of 4.8 M positions with purity < 0.97 for MiSeq and PGM
Number of reads assigned to contaminants, each point represents read counts for a single dataset.
• 99.995% minimum genomic purity
Bioinformatic characterzation pipeline. Tan objects represent input measurement data, blue the four primary analyses, and the green objects are the pipeline output. Source code for bioinformatic pipeline github.com/nate-d-olson/pepr, R package for generating database and report of analysis github.com/nate-d-olson/peprr. The report of analysis is a summary of the characterization process results.
RM Characterization
• Grad, Y. et al. 2012. “Genomic Epidemiology of the Escherichia Coli O104:H4 Outbreaks in Europe, 2011.” PNAS 109(8): 3065–70. doi:10.1073/pnas.1121491109.
• Hong, C. et al. 2014. “PathoScope 2.0: a Complete Computational Framework for Strain Identification in Environmental or Clinical Sequencing Samples.” Microbiome 2 (1). doi:10.1186/2049-2618-2-33.
• Koboldt, D. et al. 2009. “VarScan: variant Detection in Massively Parallel Sequencing of Individual and Pooled Samples.” Bioin-formatics 25 (17): 2283–5. doi:10.1093/bioinformatics/btp373.
• Koren, S. et al. 2013. “Reducing Assembly Complexity of Microbial Genomes with Single-Molecule Sequencing.” Genome Biol-ogy 14 (9): R101. doi:10.1186/gb-2013-14-9-r101.
• Tang, P. et al. 2014. “Stopping Outbreaks with Real-Time Genomic Epidemiology.” Genome Medicine 6 (11): 104. doi:10.1186/s13073-014-0104-4.
• Walker, B. et al. 2014. “Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Im-provement.” PLoS ONE 9 (11). doi:10.1371/journal.pone.0112963.
Opinions expressed in this paper are the authors’ and do not necessarily reflect the policies and views of the DHS, NIST, or affiliated venues. Certain commercial equipment, instruments, or materi-als are identified in this paper only to specify the experimental procedure adequately. Such identi-fication is not intended to imply recommendation or endorsement by the NIST, nor is it intended to imply that the materials or equipment identified are necessarily the best available for the purpose.
Official contribution of NIST; not subject to copyrights in USA.
Disclaimer
Salmonella Typhimurium LT2 DNA for Whole Genome Variant Assessment Sample: MG-001 Store at -20°C
Genome Assembly
Base Level Analysis
Genomic Contaminants
Comparison of long read de-novo assembly to optical mapping data.
• Long-read de-novo assembly resulted in a closed genome• Optical mapping - supports de-novo assembly
Comparison of purity values by platform. Dots represent positions with purity values less than 0.98 for both platforms. Positions with purity values less than 0.97 for both platforms are indicated by orange dots.
• 4 of 2.8 M positions with purity < 0.97 for MiSeq and PGM
Number of reads assigned to contaminants, each point represents read counts for a single dataset.
• 99.997% minimum genomic purity
S. aureus Results
• Orthogonal methods used to characterize RM
Achromobacter xylosoxidansMethylobacterium populi
Gallibacterium anatisStreptococcus oralis
Human papillomavirusStreptococcus mitis
NanoLuc reporterunidentified cloning
Bacillus cereusShuttle vector
Roseburia hominisStreptococcus pneumoniae
Stenotrophomonas maltophiliaPseudomonas mendocina
synthetic constructCampylobacter coli
Gemella morbillorumEnterococcus sp.
Nanoluc luciferaseEnterococcus faecium
Escherichia coliuncultured bacterium
0 50 100 150 200Reads
Org
anis
m
Platformmiseqpgm
De-novo Assembly
Optical Map
Conclusions