Building bioinformatics resources for the global community

Building bioinformatics resources for the global community

James Pettengill [email protected]

Biostatistics and Bioinformatics Staff Office of Analytics and Outreach

FDA Center for Food Safety and Applied Nutrition

GMI9 May 24, 2016 Rome, Italy

CFSAN’s open-access peer reviewed methods for analyzing and differentiating among samples based on WGS data.

Submitted 16 April 2014Accepted 23 September 2014Published 14 October 2014

Corresponding authorErrol Strain,[email protected]

Academic editorKeith Crandall

Additional Information andDeclarations can be found onpage 21

DOI 10.7717/peerj.620

Copyright2014 Pettengill et al.

Distributed underCreative Commons CC-BY 4.0

OPEN ACCESS

An evaluation of alternative methods forconstructing phylogenies from wholegenome sequence data: a case study withSalmonella

James B. Pettengill, Yan Luo, Steven Davis, Yi Chen,Narjol Gonzalez-Escalona, Andrea Ottesen, Hugh Rand,Marc W. Allard and Errol Strain

Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration, College Park,MD, USA

ABSTRACTComparative genomics based on whole genome sequencing (WGS) is increasinglybeing applied to investigate questions within evolutionary and molecular biology,as well as questions concerning public health (e.g., pathogen outbreaks). Given theimpact that conclusions derived from such analyses may have, we have evaluatedthe robustness of clustering individuals based on WGS data to three key factors: (1)next-generation sequencing (NGS) platform (HiSeq, MiSeq, IonTorrent, 454, andSOLiD), (2) algorithms used to construct a SNP (single nucleotide polymorphism)matrix (reference-based and reference-free), and (3) phylogenetic inference method(FastTreeMP, GARLI, and RAxML). We carried out these analyses on 194 wholegenome sequences representing 107 unique Salmonella enterica subsp. enterica ser.Montevideo strains. Reference-based approaches for identifying SNPs produced treesthat were significantly more similar to one another than those produced under thereference-free approach. Topologies inferred using a core matrix (i.e., no missingdata) were significantly more discordant than those inferred using a non-core matrixthat allows for some missing data. However, allowing for too much missing data likelyresults in a high false discovery rate of SNPs. When analyzing the same SNP matrix,we observed that the more thorough inference methods implemented in GARLI andRAxML produced more similar topologies than FastTreeMP. Our results also confirmthat reproducibility varies among NGS platforms where the MiSeq had the lowestnumber of pairwise diVerences among replicate runs. Our investigation into the ro-bustness of clustering patterns illustrates the importance of carefully considering howdata from diVerent platforms are combined and analyzed. We found clear diVerencesin the topologies inferred, and certain methods performed significantly better thanothers for discriminating between the highly clonal organisms investigated here. Themethods supported by our results represent a preliminary set of guidelines and astep towards developing validated standards for clustering based on whole genomesequence data.

Subjects Bioinformatics, Evolutionary Studies, Food Science and Technology, Microbiology,Public HealthKeywords Salmonella, Outbreak, Congruence, Phylogenetics, Next generation sequencing,Single nucleotide polymorphism

How to cite this article Pettengill et al. (2014), An evaluation of alternative methods for constructing phylogenies from whole genomesequence data: a case study with Salmonella. PeerJ 2:e620; DOI 10.7717/peerj.620

Real-time pathogen detection in the era of whole-genome sequencing and big data: K-mer and site-based methods for

inferring the distances among tens of thousands of Salmonella samples

James Pettengill [email protected]

Biostatistics and Bioinformatics Staff Office of Analytics and Outreach

FDA Center for Food Safety and Applied Nutrition

GMI9 May 24, 2016 Rome, Italy

•  The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time.

Premise/Background of the project


•  These databases contain 60,000+ samples and are expected to grow to hundreds of thousands within a few years.




•  For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genomic distances among a set of samples.





•  Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues.






•  Evaluated 7 measures of genetic distance based on k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) and nucleotide sites (NUCmer and whole-genome multi-locus sequence typing (wgMLST))






•  Evaluated 7 measures of genetic distance based on k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) and nucleotide sites (NUCmer and multi-locus sequence typing (MLST))

•  Empirical data: whole-genome sequence data from 18,997 Salmonella isolates


NutButter Outbreak

? http://www.cdc.gov/salmonella/braenderup-08-14/index.html

NCBI GenomeTrakr Tree

Efficient method

inter-category comparisons

intra-category comparisons genetic distances

Experimental design: based on a classification scheme determine how well each distance measure performs

#

Inefficient method

genetic distances

#

Experimental design:

Simulated data:


Empirical data: •  Analyze different distance methods on de novo assemblies of all Salmonella

samples in GenomeTrakr

•  Use serovar as the classification scheme

Efficient method

inter-enteritidis comparisons

intra-enteritidis comparisons genetic distances

#


Empirical data: using cloud computing to perform assemblies on GenomeTrakr data

Assembly workflow:

Obtain latest metadata file from NCBI pathogen database

Assembly workflow:

Obtain latest metadata file

from NCBI pathogen database

Parse metadata and download raw data



Assembly workflow:




Quality filter using fastx toolkit



Assembly workflow:





Taxonomic/contamination filtering using Kraken with custom db



Assembly workflow:





Taxonomic/contamination filtering using Kraken with custom db

Assembly using SPAdes



1.  Obtain an assembly for each sample within GenomeTrakr

•  Use pilot of cloud computing to accomplish assemblies – “cloudbursting”

! 1!

CycleCloud)Test)Summary)–)Use)Cases)2)and)3)!

Joseph!Baugher1,!Jer/Ming!Chia2,!James!Pettengill1!1!CFSAN,!2!Cycle!Computing!

)

))Summary! –!We! have! successfully! completed! running! Use! Cases! 2! and! 3! on! AWS!servers!via!the!CycleCloud!platform.!Even!without!time!for!extensive!optimization!of!the!clusters,!we!were!able!to!complete!the!Use!Cases!rapidly!and!inexpensively.!!!!!Use)Case)2)–))Listeria)Isolates))! A!workflow!was!designed!to!analyze!sequencing!data!from!all!of!the!publicly!available! Listeria! isolates! (3645)! collected! by! the! GenomeTrackr! network.! This!workflow! involves! downloading! data! from! the! NCBI! servers,! trimming! the!sequencing!reads!based!on!quality!scores,! filtering! the!reads!based!on!quality!and!taxonomy,!and!assembling!the!reads!into!contiguous!genome!segments.!The!results!of! this! workflow! will! allow! us! to! improve! our! methods! of! identifying! outbreak!isolates.!! !

1.!!Cluster!Specs!–!!! Max!cores! ! 4000!! Max!parallel!jobs! 1000!! Master!node! ! i2.4xlarge!! Compute!nodes!! r3.2xlarge,!r3.4xlarge!!

2.!Results!–!! Jobs! ! ! 3645!! Run!time! ! 8)hours!!! Job!completion!rate! 99.8%!! Approximate!cost! $1800.00!!3.!Additional!Notes!–!! Local!runtime!! ! 3.5)days!! Feasible!to!run!locally! YES!! Anticipated!frequency! once/quarter!! Estimated!yearly!cost! !$9000.00!

*!Assuming!the!current!growth!rate!of!this!dataset,!we!estimate!900!additional!samples!per!quarter!for!the!next!year.!

!!!!!!!

3,645 Listeria assemblies

! 2!

!Use)Case)3)–))Salmonella)Isolates))

! Our! revised!Use! Case! 3! applies! the!workflow!described! above! to! all! of! the!publicly! available! Salmonella! isolates! (25765)! collected! by! the! GenomeTrackr!network.!The!analysis!of!this!dataset!is!much!more!difficult!due!to!a!larger!genome!size! and!a!much! larger!number!of! isolates! and! is!not! feasible!on!our! current!HPC!resources.!!

1.!!Cluster!Specs!–!!! Max!cores! ! 12000!! Max!parallel!jobs! 3000!! Master!node! ! i2.4xlarge!! Compute!nodes!! r3.2xlarge,!r3.4xlarge,!r3.8xlarge!!

2.!Results!–!! Jobs! ! ! 25765!! Run)time! ! 20)hours!!! Job!completion!rate! 99.1%!! Approximate!cost! $8000.00!!3.!Additional!Notes!–!! Estimated)local)runtime! 23)days!! Feasible)to)run)locally! NO!! Anticipated!frequency! once/quarter!! Estimated!yearly!cost! !$56000.00!

*!Assuming!the!current!growth!rate!of!this!dataset,!we!estimate!12,500!additional!samples!per!quarter!for!the!next!year.!

!!Conclusions)–))

! Given! additional! time! for! cluster! optimization,! the! runtime! of! the! cloud/based!analyses!can!be!further!improved!(by!increasing!the!number!of!cores!and!the!speed!of!machine!acquisition,!etc.)!and!the!cost!further!reduced!(by!including!spot/pricing).!

The!job!completion!rate!was!tracked!in!order!to!identify!conditions!resulting!in! failure! of! individual! samples.! An! analysis! is! ongoing! to! identify! the! cause! of!sample! failures,!but! following!optimization!by! Jer/Ming,! the!majority!of! failures!at!this!point!are!attributable!to!the!quality!of!the!sample!data.!!! The!yearly! cost! estimates!apply!only! for! these! specific! analyses!and!do!not!include!Use!Case!1!(Outbreak!analysis!overflow)!and!research/oriented!usage.! !!

In! comparison! to! our! current! HPC! resources,! this! test! of! cloud/based!analyses!highlights!the!tremendous!advantages!of!the!application!of!scalable!cluster!resources! to! complex!analyses!of! large!numbers!of! samples.!The!ability! to! rapidly!perform! large! analyses! benefits! us! greatly! in! our! investigation! of! pathogen!outbreaks.!

25,765 Salmonella assemblies

Site-based: Sample1: ACCTAGTACC Sample2: ACGTACTACC Requires statements about homology/sequence alignment

Kmer-based (L = 9): Sample1: ACCTAGTACC

kmer1: ACCTAGTAC kmer2: CCTAGTACC

Sample2: ACGTACTACC

kmer1: ACGTACTAC kmer2: CGTACTACC

Fast but loss/oversimplification of information

Similarity = 0.8

Similarity = 0


Distance measures

Summary of methods used to infer the relationships among samples.

Class Method Description Exec. time (s)

Site-based Nucmer§ Pairwise genome alignment using suffix arrays 11.9

wgMLST¶ Gene based approach 46.95

K-mer based

Jaccard Index§ The intersection divided by the union of all K-mers found between two samples 9.4

Manhattan Distance§ Sum of the absolute differences between the abundance of each K-mer present between two samples 45.1

Euclidean Distance§ The square root of the sum of square of all pairwise differences in K-mer abundance 44.2

Mash Distance MinHash (Broder 1998) technique to reduce genomes to sketches and estimates a novel evolutionary distance metric among them 1.2

Mash Jaccard Distance The Jaccard Distance (as described above) but based on the sketch size (e.g., the number of hashes) 1.2

§ Performed using de novo assemblies and requires k-mer indexing, which with jellyfish takes 7.4s (0.8) per sample (2.1 days for 25,000 samples) ¶ Requires a reference genome

Classification of simulated data: ROC curves identical across different distance methods * Simulated data is not complex/noisy enough

Summary/Implications: •  There are features (e.g., genomic, assembly, and contamination) that cause

k-mer based methods to fail to accurately capture the distance between samples. •  Treating absent data as informative may be problematic



•  Site-based methods, like NUCmer and MLST, tended to be superior in performance




•  Accessing the computing resources necessary to perform site-based methods may be challenging when analyzing large databases.




•  Accessing the computing resources necessary to perform site-based methods may be challenging when analyzing large databases.

•  If working with k-mer distances err on the side of false positives •  And have high quality assemblies

Acknowledgements

FDA •  Center for Food Safety and Applied Nutrition

•  Biostats/Bioinformatics staff – J. Baugher, H. Rand, J. Miller, Y. Luo, S. Davis, E. Strain

•  Center for Veterinary Medicine •  Office of Regulatory Affairs

National Institutes of Health •  National Center for Biotechnology Information

State Health and University Labs •  Alaska •  Arizona •  California •  Florida •  Hawaii •  Maryland •  Minnesota •  New Mexico •  New York •  South Dakota •  Texas •  Virginia •  Washington

USDA/FSIS •  Eastern Laboratory

CDC •  Enteric Diseases Laboratory

•  INEI-ANLIS “Carolos Malbran Institute,” Argentina

•  Centre for Food Safety, University College Dublin, Ireland

•  Food Environmental Research Agency, UK

•  Public Health England, UK

•  WHO

•  Illumina

•  Pac Bio

•  CLC Bio

•  Other independent collaborators

•  False negatives are primarily due to failure to meet consensus frequency threshold

ConsensusFrequency<0.9

Coverage<8

X20x_Coverage

X100x_Coverage

X20x_Coverage

X100x_Coverage

0 1000 2000 3000 4000 5000value

variable variable

X20x_CoverageX100x_Coverage

Validation exercise key findings:

Number of false negatives

•  False negatives are not random across the genome

Validation exercise of CFSAN SNP Pipeline key findings:

•  100× dataset •  Recovered 98.9% of the introduced SNPs •  False positive rate of 1.04 × 10−6

•  20× dataset •  Recovered 98.8% of SNPs •  False positive rate of 8.34 × 10−7

Education

Building bioinformatics resources for the global community