11
Phylogenomic Pipeline Validation for Foodborne Pathogen Disease Surveillance Ruth E. Timme, a Errol Strain, a Joseph D. Baugher, a Steven Davis, a Narjol Gonzalez-Escalona, a Maria Sanchez Leon, a Marc W. Allard, a Eric W. Brown, a Sandra Tallent, a Hugh Rand a a Center for Food Safety and Applied Nutrition, U.S. Food and Drug Administration, College Park, Maryland, USA ABSTRACT Foodborne pathogen surveillance in the United States is transitioning from strain identification using restriction digest technology (pulsed-field gel electro- phoresis [PFGE]) to shotgun sequencing of the entire genome (whole-genome se- quencing [WGS]). WGS requires a new suite of analysis tools, some of which have long histories in academia but are new to the field of public health and regulatory decision making. Although the general workflow is fairly standard for collecting and analyzing WGS data for disease surveillance, there are a number of differences in how the data are collected and analyzed across public health agencies, both nation- ally and internationally. This impedes collaborative public health efforts, so national and international efforts are underway to enable direct comparison of these differ- ent analysis methods. Ultimately, the harmonization efforts will allow the (mutually trusted and understood) production and analysis of WGS data by labs and agencies worldwide, thus improving outbreak response capabilities globally. This review pro- vides a historical perspective on the use of WGS for pathogen tracking and summa- rizes the efforts underway to ensure the major steps in phylogenomic pipelines used for pathogen disease surveillance can be readily validated. The tools for doing this will ensure that the results produced are sound, reproducible, and comparable across different analytic approaches. KEYWORDS Listeria, Salmonella, validation, WGS, bioinformatic pipeline, foodborne pathogen, outbreak, phylogeny PHYLOGENIES AND THE HISTORY OF MOLECULAR EPIDEMIOLOGY P hylogenetics is the study of the evolutionary relationships among individuals or groups of organisms, as inferred through characters drawn from heritable traits. Molecular phylogenetics often uses the four DNA bases as characters to infer a phylogenetic tree, or phylogeny. Collecting the same, or homologous, DNA characters across a set of individuals became more accessible in the 1980s when DNA sequencing technology became commonly used in academic labs. Applications of molecular phy- logenies usually focus on species-level relationships and deeper “Tree of Life” ques- tions. However, the clinical potential of phylogenetics was realized in 1992 when a molecular phylogeny was used to trace the source of a localized HIV outbreak back to a dentist’s office (1). This was the first time a phylogeny was used to identify the source of an outbreak. In 1995 a more formal manuscript described the application of phylogenetics to disease tracking (2): random mutations accumulate in the genomes of pathogens as they replicate within and between hosts, leaving molecular signatures that track the history of transmission events. At any point in time, a snapshot of pathogen DNA gathered from infected individuals can be analyzed to reconstruct the history of those transmission events. This evolutionary history, or phylogeny, can provide information about the origin of disease outbreaks, including whether new strains are entering the population, and can help construct a contact network between infected individuals, such as the HIV-infected dentist in the previous example. Citation Timme RE, Strain E, Baugher JD, Davis S, Gonzalez-Escalona N, Sanchez Leon M, Allard MW, Brown EW, Tallent S, Rand H. 2019. Phylogenomic pipeline validation for foodborne pathogen disease surveillance. J Clin Microbiol 57:e01816-18. https://doi.org/10 .1128/JCM.01816-18. Editor Colleen Suzanne Kraft, Emory University This is a work of the U.S. Government and is not subject to copyright protection in the United States. Foreign copyrights may apply. Address correspondence to Ruth E. Timme, [email protected]. Accepted manuscript posted online 6 February 2019 Published MINIREVIEW crossm May 2019 Volume 57 Issue 5 e01816-18 jcm.asm.org 1 Journal of Clinical Microbiology 26 April 2019 on June 16, 2020 by guest http://jcm.asm.org/ Downloaded from

Phylogenomic Pipeline Validation for Foodborne Pathogen ... · and bacterial foodborne pathogens (19–21). All three projects have publicly available genomic databases, however,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Phylogenomic Pipeline Validation for Foodborne Pathogen ... · and bacterial foodborne pathogens (19–21). All three projects have publicly available genomic databases, however,

Phylogenomic Pipeline Validation for Foodborne PathogenDisease Surveillance

Ruth E. Timme,a Errol Strain,a Joseph D. Baugher,a Steven Davis,a Narjol Gonzalez-Escalona,a Maria Sanchez Leon,a

Marc W. Allard,a Eric W. Brown,a Sandra Tallent,a Hugh Randa

aCenter for Food Safety and Applied Nutrition, U.S. Food and Drug Administration, College Park, Maryland, USA

ABSTRACT Foodborne pathogen surveillance in the United States is transitioningfrom strain identification using restriction digest technology (pulsed-field gel electro-phoresis [PFGE]) to shotgun sequencing of the entire genome (whole-genome se-quencing [WGS]). WGS requires a new suite of analysis tools, some of which havelong histories in academia but are new to the field of public health and regulatorydecision making. Although the general workflow is fairly standard for collecting andanalyzing WGS data for disease surveillance, there are a number of differences inhow the data are collected and analyzed across public health agencies, both nation-ally and internationally. This impedes collaborative public health efforts, so nationaland international efforts are underway to enable direct comparison of these differ-ent analysis methods. Ultimately, the harmonization efforts will allow the (mutuallytrusted and understood) production and analysis of WGS data by labs and agenciesworldwide, thus improving outbreak response capabilities globally. This review pro-vides a historical perspective on the use of WGS for pathogen tracking and summa-rizes the efforts underway to ensure the major steps in phylogenomic pipelines usedfor pathogen disease surveillance can be readily validated. The tools for doing thiswill ensure that the results produced are sound, reproducible, and comparableacross different analytic approaches.

KEYWORDS Listeria, Salmonella, validation, WGS, bioinformatic pipeline, foodbornepathogen, outbreak, phylogeny

PHYLOGENIES AND THE HISTORY OF MOLECULAR EPIDEMIOLOGY

Phylogenetics is the study of the evolutionary relationships among individuals orgroups of organisms, as inferred through characters drawn from heritable traits.

Molecular phylogenetics often uses the four DNA bases as characters to infer aphylogenetic tree, or phylogeny. Collecting the same, or homologous, DNA charactersacross a set of individuals became more accessible in the 1980s when DNA sequencingtechnology became commonly used in academic labs. Applications of molecular phy-logenies usually focus on species-level relationships and deeper “Tree of Life” ques-tions. However, the clinical potential of phylogenetics was realized in 1992 when amolecular phylogeny was used to trace the source of a localized HIV outbreak back toa dentist’s office (1). This was the first time a phylogeny was used to identify the sourceof an outbreak. In 1995 a more formal manuscript described the application ofphylogenetics to disease tracking (2): random mutations accumulate in the genomes ofpathogens as they replicate within and between hosts, leaving molecular signaturesthat track the history of transmission events. At any point in time, a snapshot ofpathogen DNA gathered from infected individuals can be analyzed to reconstruct thehistory of those transmission events. This evolutionary history, or phylogeny, canprovide information about the origin of disease outbreaks, including whether newstrains are entering the population, and can help construct a contact network betweeninfected individuals, such as the HIV-infected dentist in the previous example.

Citation Timme RE, Strain E, Baugher JD, DavisS, Gonzalez-Escalona N, Sanchez Leon M, AllardMW, Brown EW, Tallent S, Rand H. 2019.Phylogenomic pipeline validation forfoodborne pathogen disease surveillance. JClin Microbiol 57:e01816-18. https://doi.org/10.1128/JCM.01816-18.

Editor Colleen Suzanne Kraft, Emory University

This is a work of the U.S. Government and isnot subject to copyright protection in theUnited States. Foreign copyrights may apply.

Address correspondence to Ruth E. Timme,[email protected].

Accepted manuscript posted online 6February 2019Published

MINIREVIEW

crossm

May 2019 Volume 57 Issue 5 e01816-18 jcm.asm.org 1Journal of Clinical Microbiology

26 April 2019

on June 16, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 2: Phylogenomic Pipeline Validation for Foodborne Pathogen ... · and bacterial foodborne pathogens (19–21). All three projects have publicly available genomic databases, however,

Coordinated disease surveillance for pathogens in the United States started in thelate 1980s (3) to foster communication between hospitals, local public health labs, andfederal labs, each of which had been tracking diseases only within their respectivejurisdiction. Several surveillance networks were established for various diseases:PulseNet for foodborne pathogens (4), FluNet for influenza (5), and the HIV surveillancesystem for HIV/AIDS (6). The initial application of molecular phylogenetics to pathogentransmission focused on viruses, which have small and rapidly evolving genomes.Sequencing short segments of those genomes was feasible using Sanger technology (7,8) and captured enough genetic variation for phylogenetic discrimination. Theseprojects typically addressed a specific question or hypothesis, and the results weredisseminated through scientific publications. These early academic studies revealed thepotential of using phylogenetics for real-time surveillance, whereby collected patho-gens could be analyzed immediately, allowing public health organizations to respondmore rapidly to the threat of disease outbreaks.

By the mid-2000s, people were starting to build the infrastructure to make thepotential into actionable reality. Several projects were initiated using genomic infor-mation for epidemiological applications, including surveillance. Academic and publichealth officials began collaborating on molecular epidemiology projects: Ghedin andcolleagues at the National Institutes of Health (NIH) refined influenza tracking (9),Gifford and colleagues at the UK Health Protection Agency began tracking HIV (10), anda team from the University of Cambridge collaborated with nearby hospitals to trackbacterial infections (11). These efforts established important collaborations betweenacademic and applied science, laying the groundwork for expansion that dovetailedimprovements in sequencing technology.

Huge improvements in sequencing technology, termed “next-generation sequenc-ing” (NGS), transformed the emerging field of molecular epidemiology in the late 2000s.NGS allowed researchers to quickly and cheaply sequence the entire genomes ofpathogens (bacteria, parasites, and some fungi) that had larger genomes (ca. 2 to 20megabases). NGS also provided increased resolution over previous technologies, likesingle-locus Sanger approaches or pulsed-field gel electrophoresis (PFGE), by uncov-ering intricate clonal relationships of strains previously assigned to the same straingroup, or subtype. This technology improvement made it possible for public healthscientists to concieve of implementing whole-genome sequencing (WGS) for real-timedisease surveillance. During the transition from previous technologies like Sanger andPFGE to newer sequencing technologies, several different sequence-based typingapproaches were considered alongside WGS, such as narrower targeted sequencingefforts like the 7-gene multilocus sequencing typing (MLST) and others designed tocapture specific virulence factors and/or antibiotic resistance genes (12). The increasein sequence data also prompted a reexamination of the current phylogenetic methodsbeing used for analyzing epidemiological data sets, including effects of outgroupchoices, identification of disease origins, and dating of common ancestors (13). By 2016,genomic surveillance databases had been established for HIV (14, 15), influenza (16–18),and bacterial foodborne pathogens (19–21). All three projects have publicly availablegenomic databases, however, only the genomes of foodborne pathogens collectedthrough GenomeTrakr and PulseNet laboratory networks are made available in real-time at the National Center for Biotechnology Information (NCBI), with analyses publiclyavailable through NCBI’s Pathogen Detection portal (21). Although we are at the earlystages, these genomic database efforts reveal the power of utilizing comparativegenomics for disease surveillance in public health.

DNA sequencing and phylogenetic analyses are mature technical and scientificmethods, each with extensive scientific literature supporting their utility. However, theirapplication to public health requires an extra burdon of rigerous validation in thelaboratories where the data are generated, analyzed, and interpreted. This requires thatthe data and analyses are accurate and reproducible under a strict set of parameters.The remainder of this review will discuss validation efforts for the use of NGS in the

Minireview Journal of Clinical Microbiology

May 2019 Volume 57 Issue 5 e01816-18 jcm.asm.org 2

on June 16, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 3: Phylogenomic Pipeline Validation for Foodborne Pathogen ... · and bacterial foodborne pathogens (19–21). All three projects have publicly available genomic databases, however,

public health arena, with a focus on tools developed for the surveillance of bacterialfoodborne pathogens.

APPLICATIONS OF NGS TO PUBLIC HEATH

Current NGS technologies for public health include WGS, whole-exome sequencing,transcriptome sequencing, organelle and/or plasmid sequencing, targeted gene orlocus sequencing, resistome profiling, and metagenomics. Within public health, the twogroups which have been the most prominent adopters of NGS are researchers explor-ing the human genome and researchers performing disease surveillance. Each group isworking to validate their data collection efforts and accompanying analyses.

Application to human and animal medicine. Hospital labs, federal agencies suchas the NIH, and associated principal investigators led much of the early NGS validationeffort. The application in these labs is mostly targeted sequencing of the humangenome, or of associated tumor genomes. Enormous amounts of data are generated;ideally, those results are used for diagnosis, prognosis, and providing guidance fortreating or managing disease. However, although this wealth of data provides theopportunity for detailed analyses of patterns correlations, all of which could maketreatment more precise and effective, incorrect conclusions can have calamitous con-sequences for patients. Therefore, early NGS validation in these labs was critical andresulted in strict guidelines and regulations for NGS clinical use in each setting. TheAmerican College of Medical Genetics (ACMG) published NGS laboratory standards fordata collection (22) and for variant analysis (23). Leading research hospitals, such as Mt.Sinai, validated NGS for clinical applications (24). The National Cancer Institute pub-lished a validation approach for an NGS assay used in a precision medicine clinical trial(25). The Centers for Disease Control and Prevention (CDC) followed up with similarguidelines for public health laboratories (26). Significant validation efforts have alsobeen contributed outside the United States, namely, in Australia (27), the UnitedKingdom (28), and the Netherlands (29). The analysis approaches of clinical WGS datacan be extremely diverse, depending on the focus of the labs (cancer, rare geneticdiseases, infant genetic screening, etc.), making the validation methods equally ascomplex. Roy et. al (30) detail the analysis nuances in these approaches and propose 17general recommendations for validating bioinformatic pipelines environment, most ofwhich are also applicable in nonclinical labs. The major conclusion of these efforts canbe summarized as follows: when strict wet-lab standards are in place, the collection ofNGS data (NGS itself) is accurate and reproducible. However, the downstream analysesare varied enough such that each must be carefully validated for its respectiveapplication.

Application to molecular pathogen disease surveillance. NGS in the regulatoryenvironment. NGS data for foodborne pathogen surveillance in the United States iscollected through two tightly coordinated networks. PulseNet (20) performs WGS onclinical isolates, and GenomeTrakr (19) performs WGS on food and environmentalisolates; the WGS data are shared publicly in real-time at the NCBI’s Pathogen Detectionwebsite (21). These networks, along with non-U.S. submitters such as Public HealthEngland, have amassed over 300,000 foodborne pathogen genomes as of February2019, a very rich resource for pathogen surveillance. Genomics for Food Safety (Gen-FS)is a working group in the United States (CDC, 2015), with representatives from the CDC,FDA, USDA, and NCBI. The Gen-FS working groups have worked to standardize qualityassurance (QA) measures and accompanying quality control (QC) checks acrossGenomeTrakr and PulseNet to ensure all WGS data in the Pathogen Detection databasemeet the Gen-FS minimum quality standards. On the international level coordination ishandled through the Globial Microbial Identifier (GMI), which meets annually to addresssimiliar issues on the global scale.

In the regulatory environment there is extra consideration when scientific data areused to make regulatory decisions (e.g., recalls for contaminated foods, injunctions,etc.) since the data and accompanying analyses need to stand up to scrutiny in court.Recent papers highlight this importance; a report produced by the U.S. President’s

Minireview Journal of Clinical Microbiology

May 2019 Volume 57 Issue 5 e01816-18 jcm.asm.org 3

on June 16, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 4: Phylogenomic Pipeline Validation for Foodborne Pathogen ... · and bacterial foodborne pathogens (19–21). All three projects have publicly available genomic databases, however,

Council of Advisors on Science and Technology (PCAST) outlines the required stepsnecessary to ensure the validity of forensic evidence, including DNA evidence, used inthe United States’s legal system (PCAST, 2016). Scientists from food industry (31) andfrom local public health labs (32) are both taking this issue very seriously with recentpublications outlining end-to-end WGS validation pipelines developed within theirrespective labs.

Applications of NGS for pathogen surveillance have been able to build on humanclinical validation efforts, focusing instead on areas specific to pathogen sequencingand extending these efforts to include phylogenetic reconstruction of isolates fromdisease outbreaks and foodborne contamination events. Molecular pathogen diseasesurveillance starts with understanding the genome biology of the pathogen: its ploidy,rate of sequence mutation, and prevalence of horizontal gene transfer (plasmids,phages, recombination, etc.), and mode of disease transmission (Fig. 1A). The next stepis collecting the WGS data and, just as with clinical appliations, strict protocols ensuredata quality and data integrity (Fig. 1B). The analysis of WGS data for pathogen diseasesurveillance, often called phylogenomic analysis, is most easily described in two majorsteps (Fig. 1C): (i) collection of the relevant variable sites into a character matrix and (ii)phylogenetic reconstruction. Finally, for disease surveillance to be effective, it is criticalto have in place a mechanism to archive and distribute the data and analysis results,especially when public health organizations around the world can greatly benefit fromopen access and potentially contribute as partners to these growing databases.

Phylogenomic pipeline approaches, expanded. As mentioned previously, theprimary data collection is a similar across many different applications and is thereforefairly straightforward to validate. In contrast, the phylogenomic analyses are builtspecifically for the application of molecular disease surveillance and therefore requirean in-depth look at the current approaches being utilized. After the raw data arecollected, there are many different methods and approaches to identify the relevantvariable sites, the details of which will be summarized following this section for the twomajor variant collection approaches (Fig. 2): single nucleotide polymorphisms (SNPs)

FIG 1 NGS phylogenomic workflow for molecular disease surveillance, with critical validation pointslisted within each module.

Minireview Journal of Clinical Microbiology

May 2019 Volume 57 Issue 5 e01816-18 jcm.asm.org 4

on June 16, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 5: Phylogenomic Pipeline Validation for Foodborne Pathogen ... · and bacterial foodborne pathogens (19–21). All three projects have publicly available genomic databases, however,

and core genome multilocus sequence typing (cgMLST). There are precedents for usingphylogenetic methods to pinpoint origin of disease outbreaks, and notable examplesof real-world use are discussed earlier in this review. However, these “variant-only”methods diverge from the traditional phylogenetic approach. Traditionally, phylog-enomic analyses used multilocus sequence alignments (not to be confused withcgMLST), in which orthologous genes were determined across the taxa, or isolates, ofinterest, and then aligned orthologs, or genes with the same evolutionary decent, wereconcatenated for downstream analyses. This traditional approach can introduce someuncertainty. First, ortholog determination is a hypothesis of shared ancestry, causingsystematic errors if the original determination is incorrect. Second, unless a standardcore genome is developed, ortholog determination can require many iterations to findthe correct number and diversity of loci to include, which can be very time-consumingand delay the prompt traceback of contaminated foods or confirmation of an outbreak.Finally, the full concatenated multigene alignments can be very long (more than 4 Mbin the case of Salmonella genomes), requiring vast computational resources for analysisand result interpretation. Because of these hurdles, assembling a matrix of variant-onlysites identified directly from the raw sequences became an attractive goal. The require-ment of orthology still holds here for SNP-based approaches and cgMLST, so these newmethods must distinguish between true variants and variants introduced by sequenceartifacts, assembly error, gene duplication, or horizontal gene transfer (HGT).

SNP analysis. SNP-based approaches have been developed in an attempt to providea faster, more objective, automated approach to characterize the variation among a setof isolates. In this reference-based approach, NGS reads are mapped against a specifiedreference genome, and variant sites are extracted and filtered and then concatenatedinto a sequence matrix (generally called an SNP matrix) containing only sites whereSNPs were detected. There are close to a dozen published SNP pipelines being used forvarious purposes. Evaluating all of them is outside the scope of this review, but theCFSAN SNP Pipeline (33), LyveSet (34), and NCBI’s Pathogen Detection pipeline (21) areexamples of SNP pipelines currently being applied for U.S. public health applications.There are also reference-free approaches to SNP calling, in which the variant sites areextracted without comparison to a reference (36, 37). SNP pipelines either end at the

FIG 2 Technical view of the two main types of analysis pipelines implemented for foodborne pathogen surveillance. First, DNA is isolated fromthe bacteria. Then, it is sequenced using a short-read NGS technology. The short reads can be analyzed in two different ways, each with the samegoal of uncovering variants across the genome for use in the final clustering step. For the SNP-based approach, short reads from each isolate aremapped to a reference genome (draft or complete assembly), SNPs are called and filtered, filtered SNPs are written to a FASTA formatted SNPmatrix, and then a phylogenetic clustering analysis is performed using that matrix as its input file. For the wgMLST- or cgMLST-based approach,short reads from each isolate are mapped against a species-specific allele database, an allele assignment is made for each gene and added toa FASTA-formated allele matrix, and then a phylogenetic clustering analysis is performed using that matrix as its input file.

Minireview Journal of Clinical Microbiology

May 2019 Volume 57 Issue 5 e01816-18 jcm.asm.org 5

on June 16, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 6: Phylogenomic Pipeline Validation for Foodborne Pathogen ... · and bacterial foodborne pathogens (19–21). All three projects have publicly available genomic databases, however,

variant calling step with the final output file containing the SNP matrix, leaving thechoice of phylogenetic inference the user, or include the last step of phylogeneticinference, resulting in a tree as the final output. When comparing different pipelines, itis useful to consider these steps as separate even if they are bundled together.

Phylogenetic inference for SNP matrices. An SNP matrix is consistent with traditionalphylogenetic theory in that it is an alignment of orthologous characters (nucleotidevariants); however, it diverges from traditional multigene alignments used in molecularphylogenetics in that the resulting character matrixes contains only SNPs that passedthe pipeline’s filter criteria, resulting in a matrix containing only sites with variants of aparticular type. This makes the phylogenetic analysis much faster because the charactermatrix contains only hundreds or thousands of sites (not millions), but theoretically itcan also introduce the risk of aquisition bias contributed by parameter choices at themultiple steps within an analytical pipeline (i.e., mapping, read filtering, choice ofreference genome, and model of evolution) (38). In practice, variant-only data have notbeen shown to impact tree topologies of shallow evolutionary analyses (38, 39), suchas the clonal data sets usually seen in pathogen disease tracking, but users should beaware that branch lengths might be affected. The NCBI is using a method calledmaximum compatability (40) to ensure that only real, or orthologous, SNPs are includedin the phylogenetic analysis. For analyzing data sets with deeper divergences, one canuse reference-free SNP calling methods (36, 37), include monomorphic sites in the datamatrix (39), and/or adjust the model of evolution used in the maximum-likelihoodanalysis to mitigate aquisition bias (41).

Allele-based approaches. The other widely used rapid analysis approach is whole-genome (wg) or core-genome (cg) MLST, where the sequence for selected loci is codedas allele types. In this approach, orthologs are identified using an automated approachagainst a curated database of possible alleles, and a sequence type is assigned to theisolate that can be used for downstream phylogenetic analyses. The abstraction of oneor more indels and/or SNPs within each gene into a single allele call results in acharacter matrix of allele types (1,2,3. . .n) instead of the underlying nucleotides (A, C, G,or T), as with an SNP matrix. In contrast to SNP approaches, wgMLST or cgMLSTapproaches must use highly curated databases unique for each taxon of interest. Therehave been several wgMLST or cgMLST schemas developed recently for Listeria mono-cytogenes (42–44), Campylobacter jejuni (45), and Salmonella enterica (46, 47). Althoughthis approach is much faster and potentially more automated than SNP-based ap-proaches, heavy curation of the MLST databases are required to keep this method upto date.

Phylogenetic inference for allele-based approaches. The resulting character matrixfrom MLST analyses are similar to SNP matrixes in that they both contain rows of taxaand columns of variant characters. However, these matrixes are different in that thevariant characters (columns) each represent an individual gene with character statesthat are numbered, or coded, to represent some type of change within that gene (indel,SNP, or both). Because these matrixes comprise allele calls instead of nucleotides,standard models of sequence evolution used in maximum-likelihood and Bayesiananalyses cannot be used here. Instead, nonprobabilistic phylogenetic inference meth-ods, such as distance and maximum parsimony, are usually employed.

VALIDATION EFFORTS TO DATEPathogen biology. Foodborne pathogens such as Salmonella, Listeria, E. coli, and

Campylobacter cause human illness through the consumption of contaminationedfood. Transmission networks for these pathogens originate with a contaminated prod-uct that spreads clonally within the human population. This pattern results in adistinctive tree topology having one large polytomy, or outbreak clade, that containsboth source (food/environmental) and clinical isolates, with no significant genomicdifferences between the two. Scientists at FDA-CFSAN (48) investigated the genomebiology of Salmonella by testing the variability of laboratory replicates along sideempirical replicates within an outbreak cluster. Numerous replicate isolates picked from

Minireview Journal of Clinical Microbiology

May 2019 Volume 57 Issue 5 e01816-18 jcm.asm.org 6

on June 16, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 7: Phylogenomic Pipeline Validation for Foodborne Pathogen ... · and bacterial foodborne pathogens (19–21). All three projects have publicly available genomic databases, however,

a single S. enterica subsp. enterica serovar Montevideo (S. enterica serovar Montevideo)strain were sequenced to identify genomic differences (i.e., nucleotide substitutions)that might be attributable to variations in sample preparation, errors in sequencingprocedures, or additional passaging of cultures. The results showed one to two SNPdifferences were possible between a parent strain and the newly subcultured daughterstrain. However, based on this study and multiple other retrospective outbreak analysesin which isolates were collected from both food and clinical patients (49–51), thesenucleotide substitutions are random and clocklike at this level of resolution and do notobscure or alter the phylogenetic history or conclusions drawn from the phylogenetictopology. This observation has held up at all levels within and between serovars ofSalmonella and Listeria, showing that outbreak investigators using these methods cando so with the confidence that their conclusions and the linkages that they establish arescientifically sound.

NGS data collection. Several labs have published on their validation efforts forusing WGS in pathogen tracking and surveillance (31, 32). In particular, Kozyreva et al.(32) includes an assessment of the accuracy of the Illumina MiSeq platform, finding anaccuracy of �99.999% agreement between newly collected MiSeq data compared to aknown reference, accounting for possible genetic mutations and PCR error, and a basecalling accuracy of of �99.9999% within and between runs. The use of NGS for nationaland global pathogen monitoring means that interlab comparability is critical. This isbeing addressed by annual multilab proficiency tests (PTs) across the GenomeTrakr labnetwork (52) (and now jointly across the PulseNet/GenomeTrakr network). In thisexercise, participating labs each sequence the same set of bacterial pathogen isolates,and resulting data are used to assess the proficiency of the lab. In addition, summa-rizing the sequence data collected across the exercise enables an estimate of normalgenetic variation within clonal isolates and expected error in several key areas, such assequence quality, read mapping, assembly, insert sizes, and variant detection pipelines.Timme et al. observed a low number of SNPs (0 to 4) across all isolates submitted in thePT exercise, with the majority (73%) having 0 SNPs. These SNPs either reflect realgenetic mutations or error in the data collection, such as amplification error. Either way,these data show a minimum of 99.99992% platform accuracy and thus are in agree-ment with the numbers reported by Kozyreva et al., which provides confidence in theuse of this technology in decision making for public health applications such asoutbreak detection, outbreak management, food facility inspection positives analysis,and monitoring for antibiotic resistance elements. The raw data, complete referencegenomes, and a table of metrics summarized in the PT were made publicly available forthe PT study (52), so this exercise can be fully replicated in a new lab looking tocompare their sequencing quality to the data collected in this exercise.

WGS analysis. Given the diversity of pipelines within and between these differentapproaches, it is important to allow for innovation while validating that the results areaccurate and reproducible. Instead of trying to standarize analysis methods across theworld, validation approaches that allow different implemented pipelines to continue tomake incremental improvements as needed while also being free to evolve andinnovate in response to the ever-changing landscape of sequence technology will workbest for the public health community. This is also analogous to most wet-lab validationapproaches, e.g., several different DNA isolation kits could be validated for the sameDNA extraction step, the result of which are all the same (pure DNA).

One approach to analysis validation focuses on well-vetted data sets that can be runthrough any relevant phylogenomic pipeline, comparing the “test” results against theknown “truth.” These benchmark data sets can be assembled through empirical ap-proaches, or they can be entirely simulated. On the empirical side a set of vetted,well-studied, retrospective data sets can be used to validate results—in this case the“truth” is not actually knowable, but rather the result is supported by multiple lines ofevidence with no disagreement among the community working with that respectivedata set. In molecular epidemiology these benchmark data sets represent a well-

Minireview Journal of Clinical Microbiology

May 2019 Volume 57 Issue 5 e01816-18 jcm.asm.org 7

on June 16, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 8: Phylogenomic Pipeline Validation for Foodborne Pathogen ... · and bacterial foodborne pathogens (19–21). All three projects have publicly available genomic databases, however,

studied outbreak or other clonal event with strong WGS plus epidemiological concor-dance. These data sets comprise the following observed data: a set of raw fastq filescollected from the event, a phylogeny, a file with variant calls (VCF file), and epidemi-ological metadata describing the isolates (related or unrelated to outbreak event). TheGen-FS (CDC, 2015) group published a set of benchmark data sets for foodbornepathogens exactly for this purpose (53). These data sets can be used to validate the SNPcalling step by comparing test versus “truth” VCF files, as well as comparing theresulting tree topologies of the test pipeline versus the tree accompanying the bench-mark data set. A GMI workgroup has taken over the curation of these benchmark datasets and is expanding them to include new species and scenarios to better serve thecommunity (54). However, growing benchmark data sets through empirical, or ob-served, data is a slow, tedious, manual process. Alternatively, benchmark data sets canbe entirely simulated (38), enabling a researcher to explore the full parameter spaceacross multiple metrics (sequence coverage, sequence quality, tree topology, rate ofsequence evolution, rate of indels, etc.). Once the simulated data sets are built, they canbe used exactly the same way as the empirical benchmark data sets, comparing the testresult to the truth, except in this case the “truth” is known for certain.

Published validation results using the empirical benchmark data sets were includedin Katz et al. for comparing SNP pipelines versus cgMLST approaches (34). In addition,simulated data sets showed that the CFSAN SNP Pipeline was able to consistentlyrecover the correct phylogeny under different simulation scenarios (38).

FUTURE DIRECTIONS

Although the basic tools have been published for analysis validation (both observedand simulated benchmark data sets), utilizing these tools for a broader validation ofanalytical methods for foodborne pathogen surveillance would be of great benefit. Forexample, a few published validation studies to date show that phylogenomic pipelinescurrently being used in public health are highly consistent given a set of parameters(data quality, pathogen species, sequence divergence, and tree shape [34, 55]). Initialsuccess with using these data sets prompts the desire for many more benchmark datasets that cover the range of scenarios observed in real life. Consortiums such as theGlobial Microbial Identifier (GMI) are working to increase this diversity by adding datasets from different species of pathogens, with different tree topology shapes, rates ofevolution, and rates of horizontal gene transfer through plasmids and other mobileelements. This is needed to cover the diversity of foodborne pathogens, but also forother pathogen-caused diseases. For example, how do our current foodborne diseasesurveillance analysis methods perform on pathogens that have different genomearchitectures and transmission networks? Phylogenetic trees of foodborne outbreakshave the shape of “super spreaders,” where a single point source (food) causesnumerous illnesses (56). This classic tree topology comprises a set of outgroups andusually one large polytomy or “outbreak” clade containing both source (food/environ-mental) and clinical isolates. Other transmission networks resulting from differentdiseases (e.g., HIV, flu, etc.) can have quite different shapes when there is person-to-person transmission. Considering all these possibilities, but focusing on the foodbornesurveillance, we find that utilizing both types of benchmark data sets, empirical andsimulated, will enable us to explore the parameter space expected so that we canensure that our pipelines are accurate and consistent when run within that validatedspace. In addition, the community would also benefit from sets of gold standard SNPsaccompanying several of the benchmark data sets described by Timme et al. (53). TheSNPs or variants collected by these pipelines must be orthologous. The requirement oforthology still holds here, so these new methods must distinguish between truevariants and variants introduced by sequence artifacts, assembly error, or HGT havinga set of “known” SNPs is key for validating the variant calling step in both SNP andwgMLST pipelines.

Minireview Journal of Clinical Microbiology

May 2019 Volume 57 Issue 5 e01816-18 jcm.asm.org 8

on June 16, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 9: Phylogenomic Pipeline Validation for Foodborne Pathogen ... · and bacterial foodborne pathogens (19–21). All three projects have publicly available genomic databases, however,

CONCLUSIONS

Phylogenetics has a long academic history, starting with the conceptual picturefrom Darwin, and maturing in the computer age with tools of phylogenetic inferencealgorithms and NGS to really bear out its promise in real-time pathogen surveillance.The method is now being used for real-time disease surveillance for the top fivefoodborne pathogens in the United States. Both SNP-based and wgMLST-based phy-logenomic analysis approaches have been widely adopted for foodborne pathogensurveillance; it is crucial that these approaches are validated in context. To this end, thefoodborne pathogen community has made numerous strides in ensuring that thesemethods are validated through the formal collaboration within two consortiums: on theU.S. national level within Gen-FS and on the global level within the GMI. This commu-nity is actively engaged in addressing the key remaining areas, like the establishmentof gold standard SNPs to accompany the benchmark data sets, thus providing a betterunderstanding of the parameter space in which the phylogenomic pipelines arevalidated to include parameters such as sequence quality levels, sequence diversity,choice of reference genome, tree topology, ecology and biology of the pathogen, andperhaps the effect of increased horizontal gene transfer. Rigorous validation practiceswill allow researchers to properly understand the limits of our existing analytic pipe-lines, while also building confidence in the way these tools can inform public healthdecisions.

ACKNOWLEDGMENTSThis work was supported by the Center for Food Safety and Applied Nutrition at the

U.S. Food and Drug Administration and, more specifically, the GenomeTrakr team.We thank Lili Velez for her careful editorial improvements and our two reviewers,

who greatly improved the final manuscript. We also thank the many collaborators in theGenomeTrakr network who contributed WGS data for real-time surveillance.

REFERENCES1. Ou C-Y, Ciesielski CA, Myers G, Bandea CI, Luo C-C, Korber BTM, Mullins

JI, Schochetman G, Berkelman RL, Economou AN, Witte JJ, Furman LJ,Satten GA, Maclnnes KA, Curran JW, Jaffe HW. 1992. Molecular epide-miology of HIV transmission in a dental practice. Science 256:1165–1171.https://doi.org/10.1126/science.256.5060.1165.

2. Holmes EC, Nee S, Rambaut A, Garnett GP, Harvey PH. 1995. Revealingthe history of infectious disease epidemics through phylogenetic trees.Philos Trans R Soc Lond B Biol Sci 349:33– 40. https://doi.org/10.1098/rstb.1995.0088.

3. Thacker SB, Berkelman RL. 1988. Public health surveillance in the UnitedStates. Epidemiol Rev 10:164–190. https://doi.org/10.1093/oxfordjournals.epirev.a036021.

4. Swaminathan B, Barrett TJ, Hunter SB, Tauxe RV, CDC PulseNet TaskForce. 2001. PulseNet: the molecular subtyping network for foodbornebacterial disease surveillance, United States. Emerg Infect Dis 7:382–389.https://doi.org/10.3201/eid0703.010303.

5. Flahault A, Dias-Ferrao V, Chaberty P, Esteves K, Valleron AJ, Lavanchy D.1998. FluNet as a tool for global monitoring of influenza on the Web.JAMA 280:1330 –1332. https://doi.org/10.1001/jama.280.15.1330.

6. Glynn MK, Lee LM, McKenna MT. 2007. The status of national HIV casesurveillance, United States 2006. Public Health Rep 122(Suppl 1):63–71.https://doi.org/10.1177/00333549071220S110.

7. Sanger F, Nicklen S, Coulson AR. 1977. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74:5463–5467. https://doi.org/10.1073/pnas.74.12.5463.

8. Sanger F, Coulson AR. 1975. A rapid method for determining sequencesin DNA by primed synthesis with DNA polymerase. J Mol Biol 94:441– 448. https://doi.org/10.1016/0022-2836(75)90213-2.

9. Ghedin E, Sengamalay NA, Shumway M, Zaborsky J, Feldblyum T, SubbuV, Spiro DJ, Sitz J, Koo H, Bolotov P, Dernovoy D, Tatusova T, Bao Y, StGeorge K, Taylor J, Lipman DJ, Fraser CM, Taubenberger JK, Salzberg SL.2005. Large-scale sequencing of human influenza reveals the dynamicnature of viral genome evolution. Nature 437:1162–1166. https://doi.org/10.1038/nature04239.

10. Gifford RJ, de Oliveira T, Rambaut A, Pybus OG, Dunn D, Vandamme A-M,Kellam P, Pillay D, UK Collaborative Group on HIV Drug Resistance. 2007.Phylogenetic surveillance of viral genetic diversity and the evolvingmolecular epidemiology of human immunodeficiency virus type 1. JVirol 81:13050 –13056. https://doi.org/10.1128/JVI.00889-07.

11. Köser CU, Ellington MJ, Cartwright EJP, Gillespie SH, Brown NM, Far-rington M, Holden MTG, Dougan G, Bentley SD, Parkhill J, Peacock SJ.2012. Routine use of microbial whole-genome sequencing in diagnosticand public health microbiology. PLoS Pathog 8:e1002824. https://doi.org/10.1371/journal.ppat.1002824.

12. Sintchenko V, Iredell JR, Gilbert GL. 2007. Pathogen profiling for diseasemanagement and surveillance. Nat Rev Microbiol 5:464 – 470. https://doi.org/10.1038/nrmicro1656.

13. Kühnert D, Wu C-H, Drummond AJ. 2011. Phylogenetic and epidemicmodeling of rapidly evolving infectious diseases. Infect Genet Evol11:1825–1841. https://doi.org/10.1016/j.meegid.2011.08.005.

14. Foley BT, Korber BTM, Leitner TK, Apetrei C, Hahn B, Mizrachi I, MullinsJ, Rambaut A, Wolinsky S. 2018. HIV sequence compendium 2018.Theoretical Biology and Biophysics Group, Los Alamos National Labora-tory, Los Alamos, NM.

15. CDC. 2018. HIV cluster and outbreak detection and response. Centers forDisease Control and Prevention, Atlanta, GA. https://www.cdc.gov/hiv/programresources/guidance/molecular-cluster-identification/index.html.

16. McGinnis J, Laplante J, Shudt M, George KS. 2016. Next generationsequencing for whole genome analysis and surveillance of influenza Aviruses. J Clin Virol 79:44 –50. https://doi.org/10.1016/j.jcv.2016.03.005.

17. Hatcher EL, Zhdanov SA, Bao Y, Blinkova O, Nawrocki EP, Ostapchuck Y,Schäffer AA, Brister JR. 2017. Virus variation resource. Nucleic Acids Res45:D482–D490. https://doi.org/10.1093/nar/gkw1065.

18. Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky L, Tatusova T, OstellJ, Lipman D. 2008. The Influenza Virus Resource at the National Centerfor Biotechnology Information. J Virol 82:596 – 601. https://doi.org/10.1128/JVI.02005-07.

Minireview Journal of Clinical Microbiology

May 2019 Volume 57 Issue 5 e01816-18 jcm.asm.org 9

on June 16, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 10: Phylogenomic Pipeline Validation for Foodborne Pathogen ... · and bacterial foodborne pathogens (19–21). All three projects have publicly available genomic databases, however,

19. Allard MW, Strain E, Melka D, Bunning K, Musser SM, Brown EW, TimmeR. 2016. Practical value of food pathogen traceability through building awhole-genome sequencing network and database. J Clin Microbiol 54:1975–1983. https://doi.org/10.1128/JCM.00081-16.

20. Jackson BR, Tarr C, Strain E, Jackson KA, Conrad A, Carleton H, Katz LS,Stroika S, Gould LH, Mody RK, Silk BJ, Beal J, Chen Y, Timme R, Doyle M,Fields A, Wise M, Tillman G, Defibaugh-Chavez S, Kucerova Z, Sabol A,Roache K, Trees E, Simmons M, Wasilenko J, Kubota K, Pouseele H,Klimke W, Besser J, Brown E, Allard M, Gerner-Smidt P. 2016. Implemen-tation of nationwide real-time whole-genome sequencing to enhancelisteriosis outbreak detection and investigation. Clin Infect Dis 63:380 –386. https://doi.org/10.1093/cid/ciw242.

21. NCBI. 2018. Pathogen detection. U.S. National Library of Medicine/National Center for Biotechnology Information, Bethesda, MD. https://www.ncbi.nlm.nih.gov/pathogens/.

22. Rehm HL, Bale SJ, Bayrak-Toydemir P, Berg JS, Brown KK, Deignan JL,Friez MJ, Funke BH, Hegde MR, Lyon E, Working Group of theAmerican College of Medical Genetics and Genomics LaboratoryQuality Assurance Commitee. 2013. ACMG clinical laboratory stan-dards for next-generation sequencing. Genet Med 15:733–747.https://doi.org/10.1038/gim.2013.92.

23. Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, Grody WW,Hegde M, Lyon E, Spector E, Voelkerding K, Rehm HL, ACMG LaboratoryQuality Assurance Commitee. 2015. Standards and guidelines for theinterpretation of sequence variants: a joint consensus recommendationof the American College of Medical Genetics and Genomics and theAssociation for Molecular Pathology. Genet Med 17:405– 423. https://doi.org/10.1038/gim.2015.30.

24. Linderman MD, Brandt T, Edelmann L, Jabado O, Kasai Y, Kornreich R,Mahajan M, Shah H, Kasarskis A, Schadt EE. 2014. Analytical validation ofwhole exome and whole-genome sequencing for clinical applications.BMC Med Genomics 7:20. https://doi.org/10.1186/1755-8794-7-20.

25. Lih C-J, Harrington RD, Sims DJ, Harper KN, Bouk CH, Datta V, Yau J,Singh RR, Routbort MJ, Luthra R, Patel KP, Mantha GS, Krishnamurthy S,Ronski K, Walther Z, Finberg KE, Canosa S, Robinson H, Raymond A, LeLP, McShane LM, Polley EC, Conley BA, Doroshow JH, Iafrate AJ, Sklar JL,Hamilton SR, Williams PM. 2017. Analytical validation of the next-generation sequencing assay for a nationwide signal-finding clinicaltrial: molecular analysis for therapy choice clinical trial. J Mol Diagn19:313–327. https://doi.org/10.1016/j.jmoldx.2016.10.007.

26. Gargis AS, Kalman L, Lubin IM. 2016. Assuring the quality of next-generation sequencing in clinical microbiology and public healthlaboratories. J Clin Microbiol 54:2857–2865. https://doi.org/10.1128/JCM.00949-16.

27. Bennett NC, Farah CS. 2014. Next-generation sequencing in clinicaloncology: next steps towards clinical validation. Cancers (Basel)6:2296 –2312. https://doi.org/10.3390/cancers6042296.

28. Deans Z, Watson CM, Charlton R, Ellard S, Wallis Y, Mattocks C, Abbs S.2007. Best practice guidelines for targeted next generation sequencing.Association for Clinical Genetic Science, London, United Kingdom.

29. Weiss MM, Van der Zwaag B, Jongbloed JDH, Vogel MJ, BrüggenwirthHT, Lekanne Deprez RH, Mook O, Ruivenkamp CAL, van SlegtenhorstMA, van den Wijngaard A, Waisfisz Q, Nelen MR, van der Stoep N. 2013.Best practice guidelines for the use of next-generation sequencingapplications in genome diagnostics: a national collaborative study ofDutch genome diagnostic laboratories. Hum Mutat 34:1313–1321.https://doi.org/10.1002/humu.22368.

30. Roy S, Coldren C, Karunamurthy A, Kip NS, Klee EW, Lincoln SE, Leon A,Pullambhatla M, Temple-Smolkin RL, Voelkerding KV, Wang C, Carter AB.2018. Standards and guidelines for validating next-generation sequenc-ing bioinformatics pipelines: a joint recommendation of the Associationfor Molecular Pathology and the College of American Pathologists. J MolDiagn 20:4 –27. https://doi.org/10.1016/j.jmoldx.2017.11.003.

31. Portmann A-C, Fournier C, Gimonet J, Ngom-Bru C, Barretto C, Baert L.2018. A validation approach of an end-to-end whole genome sequenc-ing workflow for source tracking of Listeria monocytogenes and Salmo-nella enterica. Front Microbiol 9:446. https://doi.org/10.3389/fmicb.2018.00446.

32. Kozyreva VK, Truong C-L, Greninger AL, Crandall J, Mukhopadhyay R,Chaturvedi V. 2017. Validation and implementation of Clinical Labora-tory Improvements Act (CLIA)-compliant whole genome sequencing inpublic health microbiology laboratory. J Clin Microbiol 55:2502–2520.https://doi.org/10.1128/JCM.00361-17.

33. Davis S, Pettengill JB, Luo Y, Payne J, Shpuntoff A, Rand H, Strain E. 2015.

CFSAN SNP Pipeline: an automated method for constructing SNP ma-trices from next-generation sequence data. PeerJ Comput Sci 1:e20.https://doi.org/10.7717/peerj-cs.20.

34. Katz LS, Griswold T, Williams-Newkirk AJ, Wagner D, Petkau A, Sieffert C,Van Domselaar G, Deng X, Carleton HA. 2017. A comparative analysis ofthe Lyve-SET phylogenomics pipeline for genomic epidemiology offoodborne pathogens. Front Microbiol 8:1–13.

35. Reference deleted.36. Gardner SN, Hall BG. 2013. When whole-genome alignments just won’t

work: kSNP v2 software for alignment-free SNP discovery and phyloge-netics of hundreds of microbial genomes. PLoS One 8:e81760. https://doi.org/10.1371/journal.pone.0081760.

37. Bertels F, Silander OK, Pachkov M, Rainey PB, van Nimwegen E. 2014.Automated reconstruction of whole-genome phylogenies from short-sequence reads. Mol Biol Evol 31:1077–1088. https://doi.org/10.1093/molbev/msu088.

38. McTavish EJ, Pettengill J, Davis S, Rand H, Strain E, Allard M, Timme RE.2017. TreeToReads: a pipeline for simulating raw reads from phylog-enies. BMC Bioinformatics 18:178. https://doi.org/10.1186/s12859-017-1592-1.

39. Sahl JW, Lemmer D, Travis J, Schupp JM, Gillece JD, Aziz M, Driebe EM,Drees KP, Hicks ND, Williamson CHD, Hepp CM, Smith DE, Roe C,Engelthaler DM, Wagner DM, Keim P. 2016. NASP: an accurate, rapidmethod for the identification of SNPs in WGS datasets that supportsflexible input and output formats. Microb Genom 2:e000074. https://doi.org/10.1099/mgen.0.000074.

40. Cherry JL. 2017. A practical exact maximum compatibility algorithm forreconstruction of recent evolutionary history. BMC Bioinformatics 18:127. https://doi.org/10.1186/s12859-017-1520-4.

41. Leaché AD, Banbury BL, Felsenstein J, de Oca AN-M, Stamatakis A. 2015.Short tree, long tree, right tree, wrong tree: new acquisition bias cor-rections for inferring SNP phylogenies. Syst Biol 64:1032–1047. https://doi.org/10.1093/sysbio/syv053.

42. Pightling AW, Petronella N, Pagotto F. 2015. The Listeria monocytogenesCore-Genome Sequence Typer (LmCGST): a bioinformatic pipeline formolecular characterization with next-generation sequence data. BMCMicrobiol 15:224. https://doi.org/10.1186/s12866-015-0526-1.

43. Moura A, Criscuolo A, Pouseele H, Maury MM, Leclercq A, Tarr C, Björk-man JT, Dallman T, Reimer A, Enouf V, Larsonneur E, Carleton H, Bracq-Dieye H, Katz LS, Jones L, Touchon M, Tourdjman M, Walker M, StroikaS, Cantinelli T, Chenal-Francisque V, Kucerova Z, Rocha EPC, Nadon C,Grant K, Nielsen EM, Pot B, Gerner-Smidt P, Lecuit M, Brisse S. 2016.Whole genome-based population biology and epidemiological surveil-lance of Listeria monocytogenes. Nat Microbiol 2:16185. https://doi.org/10.1038/nmicrobiol.2016.185.

44. Lüth S, Kleta S, Dahouk Al S. 2018. Whole-genome sequencing as atyping tool for foodborne pathogens like Listeria monocytogenes: theway towards global harmonization and data exchange. Trends Food SciTechnol 73:67–75. https://doi.org/10.1016/j.tifs.2018.01.008.

45. Cody AJ, McCarthy ND, Jansen van Rensburg M, Isinkaye T, Bentley SD,Parkhill J, Dingle KE, Bowler ICJW, Jolley KA, Maiden MCJ. 2013. Real-timegenomic epidemiological evaluation of human Campylobacter isolatesby use of whole-genome multilocus sequence typing. J Clin Microbiol51:2526 –2534. https://doi.org/10.1128/JCM.00066-13.

46. Taylor AJ, Lappi V, Wolfgang WJ, Lapierre P, Palumbo MJ, Medus C,Boxrud D. 2015. Characterization of foodborne outbreaks of Salmonellaenterica serovar Enteritidis with whole-genome sequencing single nu-cleotide polymorphism-based analysis for surveillance and outbreakdetection. J Clin Microbiol 53:3334 –3340. https://doi.org/10.1128/JCM.01280-15.

47. Yachison CA, Yoshida C, Robertson J, Nash JHE, Kruczkiewicz P, TaboadaEN, Walker M, Reimer A, Christianson S, Nichani A, PulseNet CanadaSteering Committee, Nadon C. 2017. The validation and implications ofusing whole genome sequencing as a replacement for traditional sero-typing for a national Salmonella reference laboratory. Front Microbiol8:1044. https://doi.org/10.3389/fmicb.2017.01044.

48. Allard MW, Luo Y, Strain E, Li C, Keys CE, Son I, Stones R, Musser SM,Brown EW. 2012. High resolution clustering of Salmonella enterica sero-var Montevideo strains using a next-generation sequencing approach.BMC Genomics 13:32. https://doi.org/10.1186/1471-2164-13-32.

49. Hoffmann M, Luo Y, Monday SR, Gonzalez-Escalona N, Ottesen AR,Muruvanda T, Wang C, Kastanis G, Keys C, Janies D, Senturk IF, Cataly-urek UV, Wang H, Hammack TS, Wolfgang WJ, Schoonmaker-Bopp D,Chu A, Myers R, Haendiges J, Evans PS, Meng J, Strain EA, Allard MW,

Minireview Journal of Clinical Microbiology

May 2019 Volume 57 Issue 5 e01816-18 jcm.asm.org 10

on June 16, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 11: Phylogenomic Pipeline Validation for Foodborne Pathogen ... · and bacterial foodborne pathogens (19–21). All three projects have publicly available genomic databases, however,

Brown EW. 2016. Tracing origins of the Salmonella Bareilly strain causinga foodborne outbreak in the United States. J Infect Dis 213:502–508.https://doi.org/10.1093/infdis/jiv297.

50. Chen Y, Burall LS, Luo Y, Timme R, Melka D, Muruvanda T, Payne J, WangC, Kastanis G, Maounounen-Laasri A, De Jesus AJ, Curry PE, Stones R,KAluoch O, Liu E, Salter M, Hammack TS, Evans PS, Parish M, Allard MW,Datta A, Strain EA, Brown EW. 2016. Isolation, enumeration and whole-genome sequencing of Listeria monocytogenes in stone fruits linked to amultistate outbreak. Appl Environ Microbiol 82:7030 –7040. https://doi.org/10.1128/AEM.01486-16.

51. Lienau EK, Strain E, Wang C, Zheng J, Ottesen AR, Keys CE, Hammack TS,Musser SM, Brown EW, Allard MW, Cao G, Meng J, Stones R. 2011. Identi-fication of a salmonellosis outbreak by means of molecular sequencing. NEngl J Med 364:981–982. https://doi.org/10.1056/NEJMc1100443.

52. Timme RE, Rand H, Sanchez Leon M, Hoffmann M, Strain E, Allard M,Roberson D, Baugher JD. 2018. GenomeTrakr proficiency testing for

foodborne pathogen surveillance: an exercise from 2015. MicrobialGenomics 57:289.

53. Timme RE, Rand H, Shumway M, Trees EK, Simmons M, Agarwala R, DavisS, Tillman GE, Defibaugh-Chavez S, Carleton HA, Klimke WA, Katz LS.2017. Benchmark datasets for phylogenomic pipeline validation, appli-cations for foodborne pathogen surveillance. PeerJ 5:e3893. https://doi.org/10.7717/peerj.3893.

54. Global Microbial Identifier, Workgroup 3. 2018. Benchmark datasets for phy-logenomic validation. GitHub. https://github.com/globalmicrobialidentifier-WG3/datasets.

55. Page AJ, Alikhan N-F, Carleton HA, Seemann T, Keane JA, Katz LS. 2017.Comparison of classical multi-locus sequence typing software for next-generation sequencing data. Microb Genomics 3:e000124.

56. Colijn C, Gardy J. 2014. Phylogenetic tree shapes resolve disease trans-mission patterns. Evol Med Public Health 2014:96 –108. https://doi.org/10.1093/emph/eou018.

Minireview Journal of Clinical Microbiology

May 2019 Volume 57 Issue 5 e01816-18 jcm.asm.org 11

on June 16, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from