Upload
externalevents
View
828
Download
1
Embed Size (px)
Citation preview
Building bioinformatics resources for the global community
James Pettengill [email protected]
Biostatistics and Bioinformatics Staff Office of Analytics and Outreach
FDA Center for Food Safety and Applied Nutrition
GMI9 May 24, 2016 Rome, Italy
CFSAN’s open-access peer reviewed methods for analyzing and differentiating among samples based on WGS data.
Submitted 16 April 2014Accepted 23 September 2014Published 14 October 2014
Corresponding authorErrol Strain,[email protected]
Academic editorKeith Crandall
Additional Information andDeclarations can be found onpage 21
DOI 10.7717/peerj.620
Copyright2014 Pettengill et al.
Distributed underCreative Commons CC-BY 4.0
OPEN ACCESS
An evaluation of alternative methods forconstructing phylogenies from wholegenome sequence data: a case study withSalmonella
James B. Pettengill, Yan Luo, Steven Davis, Yi Chen,Narjol Gonzalez-Escalona, Andrea Ottesen, Hugh Rand,Marc W. Allard and Errol Strain
Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration, College Park,MD, USA
ABSTRACTComparative genomics based on whole genome sequencing (WGS) is increasinglybeing applied to investigate questions within evolutionary and molecular biology,as well as questions concerning public health (e.g., pathogen outbreaks). Given theimpact that conclusions derived from such analyses may have, we have evaluatedthe robustness of clustering individuals based on WGS data to three key factors: (1)next-generation sequencing (NGS) platform (HiSeq, MiSeq, IonTorrent, 454, andSOLiD), (2) algorithms used to construct a SNP (single nucleotide polymorphism)matrix (reference-based and reference-free), and (3) phylogenetic inference method(FastTreeMP, GARLI, and RAxML). We carried out these analyses on 194 wholegenome sequences representing 107 unique Salmonella enterica subsp. enterica ser.Montevideo strains. Reference-based approaches for identifying SNPs produced treesthat were significantly more similar to one another than those produced under thereference-free approach. Topologies inferred using a core matrix (i.e., no missingdata) were significantly more discordant than those inferred using a non-core matrixthat allows for some missing data. However, allowing for too much missing data likelyresults in a high false discovery rate of SNPs. When analyzing the same SNP matrix,we observed that the more thorough inference methods implemented in GARLI andRAxML produced more similar topologies than FastTreeMP. Our results also confirmthat reproducibility varies among NGS platforms where the MiSeq had the lowestnumber of pairwise diVerences among replicate runs. Our investigation into the ro-bustness of clustering patterns illustrates the importance of carefully considering howdata from diVerent platforms are combined and analyzed. We found clear diVerencesin the topologies inferred, and certain methods performed significantly better thanothers for discriminating between the highly clonal organisms investigated here. Themethods supported by our results represent a preliminary set of guidelines and astep towards developing validated standards for clustering based on whole genomesequence data.
Subjects Bioinformatics, Evolutionary Studies, Food Science and Technology, Microbiology,Public HealthKeywords Salmonella, Outbreak, Congruence, Phylogenetics, Next generation sequencing,Single nucleotide polymorphism
How to cite this article Pettengill et al. (2014), An evaluation of alternative methods for constructing phylogenies from whole genomesequence data: a case study with Salmonella. PeerJ 2:e620; DOI 10.7717/peerj.620
Real-time pathogen detection in the era of whole-genome sequencing and big data: K-mer and site-based methods for
inferring the distances among tens of thousands of Salmonella samples
James Pettengill [email protected]
Biostatistics and Bioinformatics Staff Office of Analytics and Outreach
FDA Center for Food Safety and Applied Nutrition
GMI9 May 24, 2016 Rome, Italy
• The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time.
Premise/Background of the project
• The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time.
• These databases contain 60,000+ samples and are expected to grow to hundreds of thousands within a few years.
Premise/Background of the project
• The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time.
• These databases contain 60,000+ samples and are expected to grow to hundreds of thousands within a few years.
• For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genomic distances among a set of samples.
Premise/Background of the project
• The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time.
• These databases contain 60,000+ samples and are expected to grow to hundreds of thousands within a few years.
• For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genomic distances among a set of samples.
• Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues.
Premise/Background of the project
• The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time.
• These databases contain 60,000+ samples and are expected to grow to hundreds of thousands within a few years.
• For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genomic distances among a set of samples.
• Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues.
• Evaluated 7 measures of genetic distance based on k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) and nucleotide sites (NUCmer and whole-genome multi-locus sequence typing (wgMLST))
Premise/Background of the project
• The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time.
• These databases contain 60,000+ samples and are expected to grow to hundreds of thousands within a few years.
• For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genomic distances among a set of samples.
• Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues.
• Evaluated 7 measures of genetic distance based on k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) and nucleotide sites (NUCmer and multi-locus sequence typing (MLST))
• Empirical data: whole-genome sequence data from 18,997 Salmonella isolates
Premise/Background of the project
NutButter Outbreak
? http://www.cdc.gov/salmonella/braenderup-08-14/index.html
NCBI GenomeTrakr Tree
Efficient method
inter-category comparisons
intra-category comparisons genetic distances
Experimental design: based on a classification scheme determine how well each distance measure performs
#
Inefficient method
genetic distances
#
Experimental design:
Simulated data:
Experimental design:
Empirical data: • Analyze different distance methods on de novo assemblies of all Salmonella
samples in GenomeTrakr
• Use serovar as the classification scheme
Efficient method
inter-enteritidis comparisons
intra-enteritidis comparisons genetic distances
#
Experimental design:
Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
Assembly workflow:
Obtain latest metadata file from NCBI pathogen database
Assembly workflow:
Obtain latest metadata file
from NCBI pathogen database
Parse metadata and download raw data
Experimental design:
Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
Assembly workflow:
Obtain latest metadata file
from NCBI pathogen database
Parse metadata and download raw data
Quality filter using fastx toolkit
Experimental design:
Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
Assembly workflow:
Obtain latest metadata file
from NCBI pathogen database
Parse metadata and download raw data
Quality filter using fastx toolkit
Taxonomic/contamination filtering using Kraken with custom db
Experimental design:
Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
Assembly workflow:
Obtain latest metadata file
from NCBI pathogen database
Parse metadata and download raw data
Quality filter using fastx toolkit
Taxonomic/contamination filtering using Kraken with custom db
Assembly using SPAdes
Experimental design:
Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
1. Obtain an assembly for each sample within GenomeTrakr
• Use pilot of cloud computing to accomplish assemblies – “cloudbursting”
! 1!
CycleCloud)Test)Summary)–)Use)Cases)2)and)3)!
Joseph!Baugher1,!Jer/Ming!Chia2,!James!Pettengill1!1!CFSAN,!2!Cycle!Computing!
)
))Summary! –!We! have! successfully! completed! running! Use! Cases! 2! and! 3! on! AWS!servers!via!the!CycleCloud!platform.!Even!without!time!for!extensive!optimization!of!the!clusters,!we!were!able!to!complete!the!Use!Cases!rapidly!and!inexpensively.!!!!!Use)Case)2)–))Listeria)Isolates))! A!workflow!was!designed!to!analyze!sequencing!data!from!all!of!the!publicly!available! Listeria! isolates! (3645)! collected! by! the! GenomeTrackr! network.! This!workflow! involves! downloading! data! from! the! NCBI! servers,! trimming! the!sequencing!reads!based!on!quality!scores,! filtering! the!reads!based!on!quality!and!taxonomy,!and!assembling!the!reads!into!contiguous!genome!segments.!The!results!of! this! workflow! will! allow! us! to! improve! our! methods! of! identifying! outbreak!isolates.!! !
1.!!Cluster!Specs!–!!! Max!cores! ! 4000!! Max!parallel!jobs! 1000!! Master!node! ! i2.4xlarge!! Compute!nodes!! r3.2xlarge,!r3.4xlarge!!
2.!Results!–!! Jobs! ! ! 3645!! Run!time! ! 8)hours!!! Job!completion!rate! 99.8%!! Approximate!cost! $1800.00!!3.!Additional!Notes!–!! Local!runtime!! ! 3.5)days!! Feasible!to!run!locally! YES!! Anticipated!frequency! once/quarter!! Estimated!yearly!cost! !$9000.00!
*!Assuming!the!current!growth!rate!of!this!dataset,!we!estimate!900!additional!samples!per!quarter!for!the!next!year.!
!!!!!!!
3,645 Listeria assemblies
! 2!
!Use)Case)3)–))Salmonella)Isolates))
! Our! revised!Use! Case! 3! applies! the!workflow!described! above! to! all! of! the!publicly! available! Salmonella! isolates! (25765)! collected! by! the! GenomeTrackr!network.!The!analysis!of!this!dataset!is!much!more!difficult!due!to!a!larger!genome!size! and!a!much! larger!number!of! isolates! and! is!not! feasible!on!our! current!HPC!resources.!!
1.!!Cluster!Specs!–!!! Max!cores! ! 12000!! Max!parallel!jobs! 3000!! Master!node! ! i2.4xlarge!! Compute!nodes!! r3.2xlarge,!r3.4xlarge,!r3.8xlarge!!
2.!Results!–!! Jobs! ! ! 25765!! Run)time! ! 20)hours!!! Job!completion!rate! 99.1%!! Approximate!cost! $8000.00!!3.!Additional!Notes!–!! Estimated)local)runtime! 23)days!! Feasible)to)run)locally! NO!! Anticipated!frequency! once/quarter!! Estimated!yearly!cost! !$56000.00!
*!Assuming!the!current!growth!rate!of!this!dataset,!we!estimate!12,500!additional!samples!per!quarter!for!the!next!year.!
!!Conclusions)–))
! Given! additional! time! for! cluster! optimization,! the! runtime! of! the! cloud/based!analyses!can!be!further!improved!(by!increasing!the!number!of!cores!and!the!speed!of!machine!acquisition,!etc.)!and!the!cost!further!reduced!(by!including!spot/pricing).!
The!job!completion!rate!was!tracked!in!order!to!identify!conditions!resulting!in! failure! of! individual! samples.! An! analysis! is! ongoing! to! identify! the! cause! of!sample! failures,!but! following!optimization!by! Jer/Ming,! the!majority!of! failures!at!this!point!are!attributable!to!the!quality!of!the!sample!data.!!! The!yearly! cost! estimates!apply!only! for! these! specific! analyses!and!do!not!include!Use!Case!1!(Outbreak!analysis!overflow)!and!research/oriented!usage.! !!
In! comparison! to! our! current! HPC! resources,! this! test! of! cloud/based!analyses!highlights!the!tremendous!advantages!of!the!application!of!scalable!cluster!resources! to! complex!analyses!of! large!numbers!of! samples.!The!ability! to! rapidly!perform! large! analyses! benefits! us! greatly! in! our! investigation! of! pathogen!outbreaks.!
25,765 Salmonella assemblies
Site-based: Sample1: ACCTAGTACC Sample2: ACGTACTACC Requires statements about homology/sequence alignment
Kmer-based (L = 9): Sample1: ACCTAGTACC
kmer1: ACCTAGTAC kmer2: CCTAGTACC
Sample2: ACGTACTACC
kmer1: ACGTACTAC kmer2: CGTACTACC
Fast but loss/oversimplification of information
Similarity = 0.8
Similarity = 0
Experimental design:
Distance measures
Summary of methods used to infer the relationships among samples.
Class Method Description Exec. time (s)
Site-based Nucmer§ Pairwise genome alignment using suffix arrays 11.9
wgMLST¶ Gene based approach 46.95
K-mer based
Jaccard Index§ The intersection divided by the union of all K-mers found between two samples 9.4
Manhattan Distance§ Sum of the absolute differences between the abundance of each K-mer present between two samples 45.1
Euclidean Distance§ The square root of the sum of square of all pairwise differences in K-mer abundance 44.2
Mash Distance MinHash (Broder 1998) technique to reduce genomes to sketches and estimates a novel evolutionary distance metric among them 1.2
Mash Jaccard Distance The Jaccard Distance (as described above) but based on the sketch size (e.g., the number of hashes) 1.2
§ Performed using de novo assemblies and requires k-mer indexing, which with jellyfish takes 7.4s (0.8) per sample (2.1 days for 25,000 samples) ¶ Requires a reference genome
Classification of simulated data: ROC curves identical across different distance methods * Simulated data is not complex/noisy enough
Summary/Implications: • There are features (e.g., genomic, assembly, and contamination) that cause
k-mer based methods to fail to accurately capture the distance between samples. • Treating absent data as informative may be problematic
Summary/Implications: • There are features (e.g., genomic, assembly, and contamination) that cause
k-mer based methods to fail to accurately capture the distance between samples. • Treating absent data as informative may be problematic
• Site-based methods, like NUCmer and MLST, tended to be superior in performance
Summary/Implications: • There are features (e.g., genomic, assembly, and contamination) that cause
k-mer based methods to fail to accurately capture the distance between samples. • Treating absent data as informative may be problematic
• Site-based methods, like NUCmer and MLST, tended to be superior in performance
• Accessing the computing resources necessary to perform site-based methods may be challenging when analyzing large databases.
Summary/Implications: • There are features (e.g., genomic, assembly, and contamination) that cause
k-mer based methods to fail to accurately capture the distance between samples. • Treating absent data as informative may be problematic
• Site-based methods, like NUCmer and MLST, tended to be superior in performance
• Accessing the computing resources necessary to perform site-based methods may be challenging when analyzing large databases.
• If working with k-mer distances err on the side of false positives • And have high quality assemblies
Acknowledgements
FDA • Center for Food Safety and Applied Nutrition
• Biostats/Bioinformatics staff – J. Baugher, H. Rand, J. Miller, Y. Luo, S. Davis, E. Strain
• Center for Veterinary Medicine • Office of Regulatory Affairs
National Institutes of Health • National Center for Biotechnology Information
State Health and University Labs • Alaska • Arizona • California • Florida • Hawaii • Maryland • Minnesota • New Mexico • New York • South Dakota • Texas • Virginia • Washington
USDA/FSIS • Eastern Laboratory
CDC • Enteric Diseases Laboratory
• INEI-ANLIS “Carolos Malbran Institute,” Argentina
• Centre for Food Safety, University College Dublin, Ireland
• Food Environmental Research Agency, UK
• Public Health England, UK
• WHO
• Illumina
• Pac Bio
• CLC Bio
• Other independent collaborators
• False negatives are primarily due to failure to meet consensus frequency threshold
ConsensusFrequency<0.9
Coverage<8
X20x_Coverage
X100x_Coverage
X20x_Coverage
X100x_Coverage
0 1000 2000 3000 4000 5000value
variable variable
X20x_CoverageX100x_Coverage
Validation exercise key findings:
Number of false negatives
• False negatives are not random across the genome
Validation exercise of CFSAN SNP Pipeline key findings:
• 100× dataset • Recovered 98.9% of the introduced SNPs • False positive rate of 1.04 × 10−6
• 20× dataset • Recovered 98.8% of SNPs • False positive rate of 8.34 × 10−7