Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
A peer-reviewed version of this preprint was published in PeerJ on 18February 2020.
View the peer-reviewed version (peerj.com/articles/8544), which is thepreferred citable publication unless you specifically need to cite this preprint.
Dreier M, Berthoud H, Shani N, Wechsler D, Junier P. 2020. SpeciesPrimer: abioinformatics pipeline dedicated to the design of qPCR primers for thequantification of bacterial species. PeerJ 8:e8544https://doi.org/10.7717/peerj.8544
SpeciesPrimer: A bioinformatics pipeline dedicated to thedesign of qPCR primers for the quantification of bacterialspeciesMatthias Dreier Corresp., 1, 2 , Hélène Berthoud 1 , Noam Shani 1 , Daniel Wechsler 1 , Pilar Junier 2
1 Agroscope, Bern, Switzerland2 Laboratory of Microbiology, University of Neuchâtel, Neuchâtel, Switzerland
Corresponding Author: Matthias DreierEmail address: [email protected]
Background. Quantitative real-time PCR (qPCR) is a well-established method for detecting andquantifying bacteria, and it is progressively replacing culture-based diagnostic methods in foodmicrobiology. High-throughput qPCR using microfluidics brings further advantages by providing fasterresults, decreasing the costs per sample and reducing errors due to automatic distribution of samplesand reactants. In order to develop a high-throughput qPCR approach for the rapid and cost-efficientquantification of microbial species in a given system (for instance, cheese), the preliminary setup of qPCRassays working efficiently under identical PCR conditions is required. Identification of target-specificnucleotide sequences and design of specific primers are the most challenging steps in this process. Todate, most available tools for primer design require either laborious manual manipulation or high-performance computing systems.
Results. We developed the SpeciesPrimer pipeline for automated high-throughput screening of species-specific target regions and the design of dedicated primers. Using SpeciesPrimer specific primers weredesigned for four bacterial species of importance in cheese quality control, namely Enterococcusfaecium, Enterococcus faecalis, Pediococcus acidilactici and Pediococcus pentosaceus. Selected primerswere first evaluated in silico and subsequently in vitro using DNA from pure cultures of a variety ofstrains found in dairy products. Specific qPCR assays were developed and validated, satisfying thecriteria of inclusivity, exclusivity and amplification efficiencies.
Conclusion. In this work, we present the SpeciesPrimer pipeline, a tool to design species-specificprimers for the detection and quantification of bacterial species. We use SpeciesPrimer to design qPCRassays for four bacterial species and describe a workflow to evaluate the designed primers.SpeciesPrimer facilitates efficient primer design for species-specific quantification, paving the way for afast and accurate quantitative investigation of microbial communities.
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27870v1 | CC BY 4.0 Open Access | rec: 23 Jul 2019, publ: 23 Jul 2019
1
2 SpeciesPrimer: A bioinformatics pipeline dedicated to the
3 design of qPCR primers for the quantification of bacterial
4 species
5
6
7 Matthias Dreier1, 2, Hélène Berthoud1, Noam Shani1, Daniel Wechsler1, Pilar Junier2
8
9 1 Agroscope, Schwarzenburgstrasse 161, CH-3003 Bern, Switzerland
10 2 Laboratory of Microbiology, University of Neuchâtel, Emile-Argand 11, CH-2000 Neuchâtel,
11 Switzerland
12
13 Corresponding Author:
14 Matthias Dreier1, 2
15 Schwarzenburgstrasse 161, Bern, CH-3003, Switzerland
16 Email address: [email protected]
17
18 Abstract
19 Background. Quantitative real-time PCR (qPCR) is a well-established method for detecting and
20 quantifying bacteria, and it is progressively replacing culture-based diagnostic methods in food
21 microbiology. High-throughput qPCR using microfluidics brings further advantages by
22 providing faster results, decreasing the costs per sample and reducing errors due to automatic
23 distribution of samples and reactants. In order to develop a high-throughput qPCR approach for
24 the rapid and cost-efficient quantification of microbial species in a given system (for instance,
25 cheese), the preliminary setup of qPCR assays working efficiently under identical PCR
26 conditions is required. Identification of target-specific nucleotide sequences and design of
27 specific primers are the most challenging steps in this process. To date, most available tools for
28 primer design require either laborious manual manipulation or high-performance computing
29 systems.
30 Results. We developed the SpeciesPrimer pipeline for automated high-throughput screening of
31 species-specific target regions and the design of dedicated primers. Using SpeciesPrimer specific
32 primers were designed for four bacterial species of importance in cheese quality control, namely
33 Enterococcus faecium, Enterococcus faecalis, Pediococcus acidilactici and
34 Pediococcus pentosaceus. Selected primers were first evaluated in silico and subsequently in
35 vitro using DNA from pure cultures of a variety of strains found in dairy products. Specific
36 qPCR assays were developed and validated, satisfying the criteria of inclusivity, exclusivity and
37 amplification efficiencies.
38 Conclusion. In this work, we present the SpeciesPrimer pipeline, a tool to design species-
39 specific primers for the detection and quantification of bacterial species. We use SpeciesPrimer
40 to design qPCR assays for four bacterial species and describe a workflow to evaluate the
41 designed primers. SpeciesPrimer facilitates efficient primer design for species-specific
42 quantification, paving the way for a fast and accurate quantitative investigation of microbial
43 communities.
44 Introduction
45 Quantitative real-time PCR (qPCR) is a well-established method for the detection and
46 quantification of bacteria in microbiology, for instance in the context of pathogen detection in
47 clinical and veterinary diagnostics and food safety (Cremonesi et al. 2014; Curran et al. 2007;
48 Garrido-Maestu et al. 2018; Ramirez et al. 2009). Culture-based diagnostic methods are
49 progressively being replaced by qPCR due to advantages such as faster results, more specific
50 detection, and the ability to detect sub-dominant populations (Postollec et al. 2011). High-
51 throughput microfluidic qPCR brings further advantages including the fast generation of results,
52 a lower cost per sample and fewer errors due to automatic distribution of samples and reactants.
53 However, in order to work efficiently high-throughput qPCR systems use identical PCR
54 chemistry and PCR conditions for all reactions taking place on a single chip. Therefore, existing
55 qPCR assays are often not suitable and new primers have to be designed (Hermann-Bank et al.
56 2013; Ishii et al. 2013; Kleyer et al. 2017).
57 The main challenges for the successful development of any qPCR assay are the identification of
58 a specific target nucleotide sequence and the design of primers that bind exclusively to that target
59 sequence. Before microbial draft genomes became widely available, the 16S rRNA gene
60 sequence was frequently used as a target sequence. However, the regions that are targeted in the
61 16S rRNA gene do not provide sufficient resolution to differentiate between closely related
62 bacterial species (Moyaert et al. 2008; Torriani et al. 2001; Wang et al. 2007). Further,
63 housekeeping genes such as, for instance, tuf, recA and pheS, were successfully used as target
64 sequences for a variety of bacterial species in fermented foods (Falentin et al. 2010; Masco et al.
65 2007; Scheirlinck et al. 2009). Today, the steadily increasing number of prokaryotic draft
66 genomes facilitates the identification of new and unique target regions.
67 Various commercial and open source programs facilitate the design of specific primers for a
68 target sequence, such as the standard tools Primer3 and Primer-BLAST (Untergasser et al. 2012;
69 Ye et al. 2012). Primer3 predicts suitable PCR primers for an input target sequence, while
70 Primer-BLAST combines Primer3 with a BLAST search in a selected nucleotide sequence
71 database to assess the specificity of the primers for the target sequence. Additional tools and
72 pipelines that encompass both the identification of target sequences from bacterial draft genomes
73 and the design of primer candidates include, for instance, RUCS and TOPSI (Thomsen et al.
74 2017; Vijaya Satya et al. 2010). RUCS is able to identify unique core sequences in a positive set
75 of genomes (target) compared to a negative set of genomes (non-target). It designs primers for
76 the core sequences and validates them with an in silico PCR validation method against the
77 positive and negative reference set. TOPSI is an automated high-throughput pipeline for the
78 design of primers, primarily developed for pathogen-diagnostic assays. It identifies sequences
79 present in all input genomes and designs specific primers accordingly.
80 We aimed to design a series of primers that function with the same qPCR cycling conditions and
81 primer concentrations for later usage in a high-throughput microfluidic qPCR platform. Although
82 TOPSI and RUCS were initially considered for the automated design of primers, TOPSI could
83 not be used because no Linux-based cluster was available. RUCS was easily installed, but we
84 were not able to create primer pairs in initial tests with a small set of positive (target) and
85 negative (non-target) genomes. The example in the original publication of RUCS (using
86 Escherichia coli genomes as positive and negative sets) indicates that RUCS works best for very
87 similar genome assemblies in the positive and the negative sets. From this example and the initial
88 test, we inferred that RUCS requires a carefully selected training set of positive and negative
89 genomes to identify target sequences, which is a demanding task in the case of complex
90 microbial systems such as those involved in the production of fermented foods and was therefore
91 not suitable for our high-throughput approach.
92 This study presents a pipeline for automated high-throughput screening for species-specific
93 target regions combined with the design of primer candidates for these sequences. The process of
94 primer design is fully automated from the download of bacterial genomes to the quality control
95 of primer candidates. The pipeline runs on a standard computer with a multi-core processor and a
96 minimum of 16 GB RAM. We have applied the SpeciesPrimer pipeline to a set of four bacterial
97 species occurring in cheese and other dairy products and validated the primers in silico and in
98 vitro by performing qPCR experiments with a variety of target and non-target strains.
99 Description
100 Overview
101 The SpeciesPrimer pipeline consists of three main parts (Table 1). First, genome assemblies are
102 downloaded, annotated and then subjected to quality control. Second, a pan-genome analysis is
103 performed to identify single copy core genes. Conserved sequences of these core genes are then
104 extracted and the specificity for the target species is assessed. Finally, primers are designed for
105 these species-specific conserved core gene sequences and subsequently evaluated in a primer
106 quality control step.
107 Part 1: Input genome assemblies
108 The minimal command line input for the pipeline is the species name. Further, a list of non-target
109 species names can be specified (e.g., species found in the investigated ecosystem but that should
110 not be detected in the specific qPCR assay). For downloading genome assemblies from the
111 National Center for Biotechnology Information (NCBI) automatically, a valid e-mail address is
112 required for accessing the NCBI E-utilities services (Sayers 2009). The pipeline works with a
113 pre-formatted NCBI BLAST database (nt), containing partially non-redundant nucleotide
114 sequences. A local copy of the nt database is required. It can be downloaded from NCBI using
115 the update_blastdb.pl script from the BLAST+ package (Altschul et al. 1990), via FTP from the
116 NCBI FTP server or with the pipeline script (getblastdb.py).
117 The user-provided species name is used to search for genome assemblies in the NCBI database.
118 The Biopython Entrez module (Cock et al. 2009) searches the NCBI taxonomic identity (taxid)
119 for the target species in the taxonomy database and downloads the genome assembly summary
120 report. Afterwards, SpeciesPrimer downloads the genome assemblies in FASTA format from the
121 NCBI RefSeq FTP server using the links specified in the summary report. Finally, the
122 downloaded genome assemblies are annotated with Prokka (Seemann 2014).
123 The quality of the genome assemblies is a crucial factor for the pan-genome analysis. Genome
124 assemblies deposited with the wrong taxonomic label or low-quality assemblies drastically
125 reduce the number of identified core genes and of conserved sequences for primer design. The
126 initial quality control step is intended to remove such assemblies from the subsequent analysis.
127 For the verification of the taxonomic classification, the user can choose one or several genes
128 from five conserved housekeeping genes (16S rRNA, tuf, recA, dnaK and pheS). Genome
129 assemblies without an annotation for the specified conserved housekeeping genes or genome
130 assemblies consisting of more than 500 contigs are removed from the downstream pan-genome
131 analysis. The sequences of the specified conserved housekeeping genes are blasted against the
132 local nt database. Genome assemblies pass the quality control if the best BLAST hit for all
133 sequences is a sequence arising from the target species.
134 Part 2: Identification of target sequences for primer design
135 A pan-genome analysis is performed using Roary (Page et al. 2015) to identify the core genes of
136 the target species. Based on the results of the pan-genome analysis, single copy core genes are
137 identified. The gene_presence_absence.csv produced by Roary reports the presence (or absence)
138 of every annotated gene for every input genome assembly. Single copy core genes are the genes
139 for which the number of assemblies harboring the sequence and the number of total identified
140 sequences equals the number of total input assemblies. An sqlite3 database containing all
141 annotated sequences of all assemblies is compiled
142 (https://github.com/EnzoAndree/tutorials/blob/patch-1/DBGenerator.py). This database is
143 queried for single copy core genes and the nucleotide sequences are saved in multi-FASTA
144 format. Each multi-FASTA file contains the sequences of one single copy core gene from each
145 input genome assembly. These sequences are aligned using the probabilistic multiple alignment
146 program Prank (Löytynoja 2014). A consensus sequence with ambiguous bases is then created
147 using consambig from the EMBOSS package (Rice et al. 2000). The alignments and extraction
148 of the consensus sequence are performed in parallel for several core genes using GNU parallel
149 (Tange 2011). Continuous consensus sequences longer than the minimal PCR product length,
150 harboring less than two ambiguous bases in the range of 20 bases are used for the subsequent
151 steps of the pipeline.
152 These conserved consenus sequences are used for a BLAST search against the local nt database
153 using the discontiguous BLAST algorithm and an e-value cutoff of 500. For all hits in the
154 BLAST results, the species name is extracted from the sequence description and compared with
155 the names in the species list (non-target species). If any species name in the species list matches
156 a hit in the BLAST results the corresponding query sequence is discarded, otherwise the
157 sequence is classified as specific for the target and considered for primer design.
158 Part 3: Primer design
159 Primer3 is used to design primers for the unique single copy core gene sequences. As pipeline
160 default the optimum primer melting temperature is set to 60 °C and the maximal primer length is
161 set to 26 bases, all other settings are the default settings of the primer3web version
162 (http://primer3.ut.ee, accessed November 29, 2018). The minimal and maximal amplicon size of
163 the PCR product can be specified individually for every target species through the command line
164 options. The other parameters for primer3 cannot be changed individually, but the general
165 primer3 settings can be changed by modifying the primer3 settings file.
166 The primer quality control consists of three parts, an in silico PCR to evaluate the specificity of
167 the primer for the template, an estimation of secondary structures of the amplicon sequence and
168 the estimation of the potential to form primer dimers. The specificity check for each primer pair
169 is performed with MFEprimer 2.0 (Qu et al. 2012). For the evaluation of the specificity, three
170 indexed databases are generated: the target template database, the non-target sequence database
171 and the target genome database. The target template database consists of the unique conserved
172 core gene sequences used as template for primer design. The non-target sequence database is
173 compiled from sequences of non-target species, which show similarities to the primer sequences.
174 To identify these sequences, a BLAST search with all primers against the local nt database is
175 performed. BLAST hits with a species name in the description matching a name in the user-
176 specified non-target species list are selected. These selected sequences and 4000 base pairs up-
177 and downstream are extracted from the nt database using the blastdbcmd tool. The target genome
178 database is composed of maximal 10 of the input genome assemblies. If the assembly summary
179 report from the automatic download of genome assemblies from NCBI is available the genome
180 assemblies as complete as possible are preferred (assembly status: complete > chromosome >
181 scaffold > contig). The target sequence database is used to evaluate the maximum primer pair
182 coverage (PPC), a value used by MFEprimer 2.0 to score the ability of the primer pair to bind to
183 a DNA template. The maximum value of the PPC is 100, all primer pairs with a PPC value lower
184 than the specified threshold (mfethreshold, default = 90) for their template are excluded. Next,
185 MFEprimer 2.0 is used to score the binding of the primer pairs to the sequences of the non-target
186 sequence and the target genome database. The difference of the PPC for the DNA template and
187 the specified threshold (Δthreshold = PPC – mfethreshold) is used as a threshold for the
188 maximum PPC value a primer pair is allowed to have for a non-target sequence. Strong
189 secondary structures at the 5'- or the 3'- end of the PCR product could impair efficient primer
190 binding. Therefore, the PCR products of the primer pairs are submitted to mfold (Zuker et al.
191 1999) to exclude PCR products with strong secondary structures at the annealing temperature of
192 60 °C. Moreover, as primer dimers can yield unspecific signals during the qPCR run, the 3'- ends
193 of the primer pairs are checked for their potential to form homo- or hetero-dimers using a Perl
194 script (MPprimer_dimer_check.pl) from MPprimer (Shen et al. 2010).
195 The pipeline output is a list containing the primer name, primer pair coverage (MFEprimer) and
196 penalty values, primer and template sequences and melting temperatures (Primer3). Further, a
197 report of the genome assembly quality control, a file containing the pipeline run statistics, the
198 core gene alignment and the phylogeny in newick format can be found in the output directory.
199 Materials & Methods
200 Primer design
201 SpeciesPrimer pipeline runs were performed on a virtual machine (Oracle VM VirtualBox 5.2.8)
202 with Ubuntu 16.04 (64-bit) and docker installed, using 22 of 24 logical processors from two Intel
203 Xeon E5-2643 CPUs and 32 GB of RAM. The used docker image is available from
204 https://hub.docker.com/r/biologger/speciesprimer.
205 The species list consisted of 259 species and subspecies names detected in dairy products,
206 namely from species names collected from data of 16S rRNA meta-genome sequencing studies
207 in milk and cheese varieties (Marco Meola Agroscope, pers. comm.) and dairy-related bacteria
208 from the list of bacterial species and subspecies with technological beneficial use in food
209 products (Almeida et al. 2014).
210 The SpeciesPrimer pipeline was run with the input genome assemblies, parameters and the
211 species list specified in the supplemental information (Data S1). Genome assemblies from the
212 strain collection of Agroscope were included for the Pediococci.
213 In silico validation
214 For the in silico validation, PCR products for the designed primer pairs were used for an online
215 BLAST search against the RefSeq Genomes Database (refseq_genomes) limited to bacterial
216 genomes. The search was performed by qblast (biopython), using blastn, the maximum hitlist
217 size was set to 5000 and the expect threshold (e-value) was set to 500.
218 Primer pairs were tested for specificity using online Primer-BLAST (Ye et al. 2012). The
219 primers were blasted against the nucleotide collection BLAST database (nr) limited to sequences
220 from bacteria. Default settings were used, except for the primer specificity stringency that was
221 set to ignore targets that have nine or more mismatches to the primer.
222 In vitro validation
223 The inclusivity of the primer pairs was assayed by performing qPCR with 2 ng DNA of 20 to 25
224 strains of the target species in technical duplicates. The linear amplification of genomic DNA
225 and PCR efficiency was examined by ten-fold dilution series of the type strain DNA in a range
226 from 106 to 101 genome copies per reaction. DNA concentration for the corresponding number of
227 genome copies was estimated by taking the genome size of the type strain
228 (https://www.ncbi.nlm.nih.gov/genome) and an average weight of 1.096 ∙ 10-21 g per base pair.
229 The exclusivity of the primer pairs was assayed by performing qPCR with 2 ng DNA from
230 various bacterial species in technical duplicates found in dairy products in four qPCR runs
231 including three strains of the target species as positive control.
232 Bacterial strains
233 Strains stored within the Agroscope Culture Collection at -80 °C in sterile reconstituted skim
234 milk powder (10 %, w/v), were reactivated and cultivated according to the conditions specified
235 in Data S2.
236 DNA extraction
237 Unless otherwise noted, all reagents were purchased from Merck, Darmstadt, Germany.
238 Bacterial pellets harvested from 1 ml culture by centrifugation (10000 x g, 5 min, room
239 temperature) were used for DNA extraction. For a pre-lysis treatment, the bacterial cells were
240 incubated in 1 ml of 50 mM sodium hydroxide for 15 min at room temperature. Afterwards cells
241 were collected by centrifugation (10000 x g, 5 min, room temperature) and then treated with
242 lysozyme (2.5 mg/ml dissolved in 100 mM Tris(hydroxymethyl)aminomethane, 10 mM
243 ethylendiaminetetraacetic acid (Calbiochem, San Diego, USA), 25 % (w/v) sucrose, pH 8.0) for
244 1 hour at 37 °C. After the pre-lysis treatment, the bacterial cells were collected by centrifugation
245 (10000 x g, 5 min, room temperature). Cell lysis and genomic DNA extraction was performed
246 using the EZ1 DNA Tissue kit and a BioRobot® EZ1 workstation (Qiagen, Hilden, Germany)
247 according to the manufacturer's instructions and eluted in a volume of 100 µl. The DNA
248 concentration was measured using a NanoDrop® ND-1000 Spectrophotometer (NanoDrop
249 Technologies, Thermo Fisher Scientific, Waltham, MA, USA).
250 Quantitative real-time PCR
251 The qPCR assays were performed in a total reaction mix volume of 12 µl containing 6 µl 2x
252 SsoFast™ EvaGreen® Supermix with low ROX (Biorad, Cressier, Switzerland), 500 nM of
253 forward and reverse primers, respectively, and 2 µl of DNA. Each sample was measured in
254 technical duplicates. The qPCR cycling conditions were an initial denaturation at 95 °C for 1
255 minute followed by 35 cycles of 95 °C for 5 seconds and 60 °C for 1 minute. For the melting
256 curve analysis, a gradient from 60 – 95 °C with 1 °C steps per 3 seconds was performed. All
257 qPCR assays were run on a Corbett Rotor-Gene 3000 (Qiagen). The analysis was performed
258 using Rotor-Gene 6000 Software 1.7 with dynamic tube normalization and a threshold of 0.05
259 for quantification cycle (Cq) value calculation, the five first cycles were ignored for the
260 determination of the Cq values. The peak calling threshold for the melt curve analysis was set to
261 -2 dF/dT and a temperature threshold was set 2 °C lower than the positive control peak.
262 Average nucleotide identity calculations
263 Average nucleotide identity (ANI) calculations were performed with OrthoANIu (Yoon et al.
264 2017)). All Enterococcus faecium genome assemblies were compared to the E. faecium reference
265 sequence (NC_017960.1), while all Pediococcus acidilactici genome assemblies were compared
266 to the P. acidilactici reference sequence (NZ_CP015206.1). The genome assemblies were
267 grouped based on the phylogeny tree of the core gene sequences built with FastTree (Price et al.
268 2010), the ANI values were collected and the average, minimum and maximum was calculated
269 for each group.
270 Results
271 Primer design
272 Primer design for four bacterial species commonly found in cheese was performed with the
273 SpeciesPrimer pipeline. The pipeline runs were completed in two to eight hours, excluding the
274 time required for downloading and annotation of the genome assemblies. Depending on the
275 number of genome assemblies, downloading and annotation of the genome assemblies took from
276 24 minutes (27) to 12 hours 27 minutes (575). The average time for downloading and annotation
277 was two seconds and one minute six seconds, respectively. The analysis of the Enterococcus
278 faecalis, Enterococcus faecium, Pediococcus acidilactici and Pediococcus pentosaceus
279 assemblies resulted in 15, 2, 2 and 160 identified primer pair candidates, respectively (Table 2).
280 The primer pair candidates for E. faecalis and P. pentosaceus were filtered for the highest primer
281 pair coverage score (E. faecalis: 2; P. pentosaceus: 29); for P. pentosaceus only the two primer
282 pairs with the lowest primer pair penalty values were selected.
283 The phylogeny trees of the core gene alignments from E. faecium and P. acidilactici were
284 created using Roary and FastTree (Figure 1). The unrooted tree from the concatenated core genes
285 of E. faecium shows the phylogenetic distance of two distinct groups of sequences, a main
286 cluster with 531 sequences and a subcluster with 44 sequences. The tree made with the
287 concatenated core gene sequences of P. acidilactici shows the phylogenetic distance of one
288 sequence from all other sequences. From this observation, the existence of different species or
289 subspecies was suspected. Calculation of the average nucleotide identity (ANI) has been
290 proposed as a valuable tool to determine species boundaries (Richter & Rossello-Mora 2009).
291 Therefore, we performed ANI calculations for the genome assemblies and the reference
292 sequence for E. faecium (NC_017960.1) using the tool OrthoANIu. The genome assemblies of
293 the E. faecium subcluster have an average ANI of 94.67 %. The ANI between the genome
294 assembly of the P. acidilactici strain FAM 18987 and the NCBI reference sequence for P.
295 acidilactici (NZ_CP015206.1) was only 89.21 %. The OrthoANI values (Table 3) of the
296 assemblies in the subclusters of E. faecium (94.15 - 95.60 %) are at the border and the value for
297 the P. acidilactici strain FAM 18987 (89.21%) is below the proposed species threshold cutoffs
298 (95 - 96 %) (Kim et al. 2014; Richter & Rossello-Mora 2009). P. acidilactici strain FAM 18987
299 should therefore probably be assigned to a new species or subspecies. However, for certain
300 species also lower boundary cutoffs might be reasonable (Ciufo et al. 2018). According to the
301 current taxonomic classification, we proceeded with the assumption that these genome
302 assemblies reflected the actual diversity of strains and thus included the assemblies for the
303 primer design.
304 Two test cases were generated to exemplify the influence of the input genome assemblies on the
305 pipeline results. Firstly, a single genome assembly with a wrong taxonomic label was used as
306 input in addition to the correctly labelled genome assemblies. Introducing a genome assembly
307 with a wrong taxonomic label (GCF_000415325.2, E. faecalis) into the pool of E. faecium
308 genome assemblies resulted in a decrease of identified core genes (from 1131 to 43) and
309 provided no species-specific sequence. Secondly, the genome assembly of the P. acidilactici
310 strain (FAM 18987) that was distinct from the other assemblies in the phylogenetic tree with an
311 ANI to the reference sequence below 90 % was excluded from the pipeline run. This resulted in
312 an increased number of identified core genes (from 921 to 1238), of species-specific sequences
313 (from 54 to 516) and of reported primer pairs (from 2 to 53). The results of the two test cases
314 illustrate that the SpeciesPrimer pipeline performs best on closely related genome assemblies
315 with a good overall quality.
316 In silico validation
317 Two parameters were selected as criteria for the primer validation using web-based BLAST.
318 First, the BLAST hits for the predicted PCR product sequence should only match the target
319 species. If sequences of other bacterial species matched to parts of the sequence, the
320 corresponding primer pairs were discarded, unless more than three mismatches were found in
321 each primer-binding region for the forward and reverse primers. Second, the primer binding sites
322 in the target sequences were not allowed to have mismatches in the 3’-end region. The criterion
323 for the primer validation by Primer-BLAST was that no predicted PCR products for other
324 bacterial species had been reported by Primer-BLAST. Primer pairs exclusively binding to the
325 target sequence of the target species were classified as specific. The results of the in silico
326 validation are summarized in Table 4. With the exception of Ec_faeca_g3060_1_P0 and
327 Ec_faeci_cysS_3_P1, all primer pairs showed a perfect match to their target sequences. For
328 primer pair Ec_faeca_g3060_1_P0, the first three nucleotides of one sequence out of 690, are
329 missing in the forward primer-binding region. For Ec_faeci_cysS_3_P1, only one sequence out
330 of 1058 aligned sequences showed a single nucleotide transition in the reverse primer-binding
331 region (Data S3).
332 In vitro validation
333 The specificity of the qPCR assays was assessed with 21 to 25 strains of the target species
334 (inclusivity) and 121 non-target bacterial strains found in dairy products (exclusivity). The qPCR
335 assay performance was assessed by 10-fold dilution series of type strain DNA from 106 to 101
336 copies per reaction. The results of the qPCR runs were interpreted as positive if both qPCR
337 reactions (duplicates) reached the fluorescent threshold before quantification cycle 35 and the
338 peak of the melting curve analysis was above the peak calling threshold (-2 dF/dT). A summary
339 of the results is shown in Table 5. The primer sequences can be found in Table S1. The
340 inclusivity of the qPCR assays was 100 % for the assays Ec_faeca_acuI, Ec_faeca_g3060,
341 Ec_faeci_cysS, Pd_acidi_asnS, Pd_acidi_g1164, Pd_pento_nagK and Pd_pento_g4364. Only
342 one qPCR assay, Ec_faeca_purD was negative for one of the tested target strains.
343 Out of the 121 non-target strains analyzed to determine the exclusivity of the qPCR assays
344 (Figure 2), all strains were negative for Ec_faeca_acuI and Pd_acidi_asnS. Both assays targeting
345 E. faecium were positive solely for one L. fermentum strain (FAM 20347). Later it was found
346 that the stock culture of this strain was contaminated with an E. faecium strain (data not shown).
347 The assay Pd_pento_nagK targeting P. pentosaceus was positive for two out of three tested
348 Leuconostoc lactis strains, the fluorescence signal reached the threshold after Cq 26, and the
349 melting curve analysis showed a peak at 85 °C, while the positive control samples for this assay
350 displayed a peak at 83.5 °C. Nine out of the 121 non-target strains were positive for the
351 Ec_faeca_g3060 qPCR assay, for these samples the fluorescence signals reached the threshold
352 after Cq 26 and had a melting curve peak at a higher temperature than the target PCR product.
353 The assays Pd_acidi_g1164 and Pd_pento_g4364 were positive for five and eight non-target
354 strains, respectively. Notably, all three tested Lactobacillus paracasei strains were positive for
355 the Pd_acidi_g1164 assay, the fluorescence signal reached the threshold around Cq 21 and 22
356 and they showed a distinct melting curve peak at 86 °C.
357 The qPCR assays displayed linear results between 101 and 106 genome copies per reaction. The
358 calculated efficiency of the qPCR assays was between 92 and 100 %. The linear regression
359 equations ( ) had slopes between -3.329 and -3.523 and 𝐶𝑞= 𝑠𝑙𝑜𝑝𝑒 ∗ log (𝑐𝑜𝑝𝑖𝑒𝑠) + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡360 correlation coefficients of 0.990 or above.
361 Discussion
362 After setup of the SpeciesPrimer docker container, the download of the local BLAST database
363 and the selection of the SpeciesPrimer run settings, no further manual handling was required to
364 get primer pair candidates for all four bacterial species after a total time of 44 hours and 30
365 minutes. The number of input genomes and subsequently the number of retrieved primer pairs
366 for the specificity check have the highest impact on speed. During the specificity check, blasting
367 the primer sequences optimized for short sequences (blastn-short) and the subsequent
368 compilation and indexing of the non-target sequence database are the most time consuming
369 steps.
370 The results of the SpeciesPrimer pipeline for the four target species ranged from two to 160
371 identified primer pair candidates. Several factors can influence the number of identified primer
372 pairs, such as the quality of the input genome assemblies, assemblies with wrong taxonomic
373 labels and the genetic diversity within the species. A low-quality assembly with missing
374 sequences or contaminations can decrease the number of identified core genes. The initial quality
375 control helps to minimize the risk that such assemblies are included in the pipeline runs.
376 However, also an increased sequence diversity, either due to sequencing errors, assembly errors
377 or real diversity, limits the number and the length of identified conserved sequences.
378 Subsequently this affects the yield of reported primer pairs, since the pipeline selects highly
379 conserved sequences for primer design.
380 The specificity of the designed primers was evaluated in silico by BLAST with a more extensive
381 database (RefSeq Genome) than the one used for the specificity check during primer design. The
382 validation showed that the specificity of the tested amplicons was high and no other species than
383 the target species had an identical sequence. Most target sequences in the database showed a
384 perfect match for the primers in the primer-binding region. For all tested primer pairs, solely the
385 expected PCR products for the target species and no amplicons for other sequences were
386 predicted by Primer-BLAST. The results of Primer-BLAST indicate that the reported primer
387 pairs were very specific, even though the species list used for the specificity evaluation during
388 primer design covered only 259 non-target species.
389 In this work, 21 to 25 target strains for each target species and 121 non-target strains have been
390 tested to assess inclusivity and exclusivity of the qPCR assays, respectively. The in vitro
391 validation of primer pairs has shown that the in silico validation is not always able to predict
392 non-target PCR products. The fluorescence signals occurring at late quantification cycles (Cq >
393 30) are probably due to PCR products with suboptimal primer binding. Testing the qPCR assays
394 in mixtures and communities could be interesting to assess if these PCR products also
395 accumulate in presence of target DNA. The specificity could be sufficient in mixtures due to
396 competition for the primers and the difference in primer binding and amplification efficiency.
397 For many research applications, qPCR assays with a low signal in negative samples are
398 acceptable, assuming that low-level signals can be distinguished from low concentrations of
399 target species DNA by the melting curve analysis. Further, for many applications, the annealing
400 temperature can be optimized by empirical determination of a suitable annealing temperature and
401 the primer concentration can be adjusted to improve the specificity of the assay (www.bio-
402 rad.com/en-ch/applications-technologies/qpcr-assay-design-optimization). We did not try to
403 optimize our assays with these measures, because the aim was to design primers for high-
404 throughput qPCR, requiring the exact same PCR conditions. For the tested qPCR conditions the
405 most specific qPCR assays were Ec_faeca_acuI (E. faecalis), Ec_faeci_cysS (E. faecium),
406 Pd_acidi_asnS (P. acidilactici) and Pd_pento_nagK (P. pentosaceus). Further work will be
407 necessary in order to make these qPCR assays fully operational for the quantification of bacteria
408 in a complex system such as food. For instance, suitable qPCR standards should be designed and
409 validated, so that the limit of detection of each assay can be determined.
410 Primer-BLAST and RUCS allow designing primers for different applications, but demand
411 experience and manual manipulations. Primer-BLAST designs primers and performs specificity
412 checks, but requires a user provided target sequence. In the case of RUCS manual manipulation
413 and experience is needed to prepare the positive and negative reference sets. Compared to
414 primer-BLAST and RUCS, the task SpeciesPrimer performs is really specialized, namely to
415 design primers for species-specific sequences. In contrast, SpeciesPrimer requires no previous
416 knowledge about the input genome assemblies and no manual manipulation of sequences has to
417 be performed. The ability of SpeciesPrimer to run on standard computers with good performance
418 instead of specialized high-performance computers, will hopefully allow primer design for a
419 wide range of scientists. Docker containers simplify the installation procedure and should allow
420 non-bioinformaticians to setup and use the SpeciesPrimer pipeline.
421 Conclusions
422 In this work, we presented the SpeciesPrimer pipeline, which is a fully automated pipeline from
423 the download of bacterial genomes, the identification of conserved species-specific core genes to
424 primer design and subsequent quality control of primer candidates. Primers for four bacterial
425 species were designed and validated and have shown to perform adequately under the same
426 qPCR conditions.
427 A standard computer with good performance, good quality genome assemblies, a local copy of
428 the nt BLAST database and a list of non-target bacterial species are the only requirements for
429 primer design with SpeciesPrimer. A complete image, of a Linux OS with all dependencies and
430 the pipeline scripts, is available from Dockerhub. To simplify primer design for users not
431 familiar with command line tools, a graphic user interface is provided in the latest version of
432 SpeciesPrimer. SpeciesPrimer facilitates efficient primer design for species-specific
433 quantification, paving the way for a fast and accurate quantitative investigation of microbial
434 communities.
435 Acknowledgements
436 We would like to thank Marco Meola and Remo Schmidt for critically reading the manuscript
437 and many helpful comments and Daniel Marzohl, Nadine Sidler, Elvira Wagner and Kotchanoot
438 Srikham for their valuable technical help.
439 References
440 Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol
441 215:403-410. 10.1016/s0022-2836(05)80360-2
442 Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, and de
443 Hoon MJ. 2009. Biopython: freely available Python tools for computational molecular biology and
444 bioinformatics. Bioinformatics 25:1422-1423. 10.1093/bioinformatics/btp163
445 Cremonesi P, Pisani LF, Lecchi C, Ceciliani F, Martino P, Bonastre AS, Karus A, Balzaretti C, and Castiglioni B. 2014.
446 Development of 23 individual TaqMan(R) real-time PCR assays for identifying common foodborne
447 pathogens using a single set of amplification conditions. Food Microbiol 43:35-40.
448 10.1016/j.fm.2014.04.007
449 Curran T, Coyle PV, McManus TE, Kidney J, and Coulter WA. 2007. Evaluation of real-time PCR for the detection
450 and quantification of bacteria in chronic obstructive pulmonary disease. FEMS Immunol Med Microbiol
451 50:112-118. 10.1111/j.1574-695X.2007.00241.x
452 Falentin H, Postollec F, Parayre S, Henaff N, Le Bivic P, Richoux R, Thierry A, and Sohier D. 2010. Specific metabolic
453 activity of ripening bacteria quantified by real-time reverse transcription PCR throughout Emmental
454 cheese manufacture. Int J Food Microbiol 144:10-19. 10.1016/j.ijfoodmicro.2010.06.003
455 Garrido-Maestu A, Azinheiro S, Carvalho J, and Prado M. 2018. Rapid and sensitive detection of viable Listeria
456 monocytogenes in food products by a filtration-based protocol and qPCR. Food Microbiol 73:254-263.
457 10.1016/j.fm.2018.02.004
458 Hermann-Bank ML, Skovgaard K, Stockmarr A, Larsen N, and Molbak L. 2013. The Gut Microbiotassay: a high-
459 throughput qPCR approach combinable with next generation sequencing to study gut microbial diversity.
460 BMC Genomics 14:788. 10.1186/1471-2164-14-788
461 Ishii S, Segawa T, and Okabe S. 2013. Simultaneous quantification of multiple food- and waterborne pathogens by
462 use of microfluidic quantitative PCR. Appl Environ Microbiol 79:2891-2898. 10.1128/aem.00205-13
463 Kleyer H, Tecon R, and Or D. 2017. Resolving Species Level Changes in a Representative Soil Bacterial Community
464 Using Microfluidic Quantitative PCR. Front Microbiol 8:2017. 10.3389/fmicb.2017.02017
465 Löytynoja A. 2014. Phylogeny-aware alignment with PRANK. In: Russell DJ, ed. Multiple Sequence Alignment
466 Methods. Totowa, NJ: Humana Press, 155-170.
467 Masco L, Vanhoutte T, Temmerman R, Swings J, and Huys G. 2007. Evaluation of real-time PCR targeting the 16S
468 rRNA and recA genes for the enumeration of bifidobacteria in probiotic products. Int J Food Microbiol
469 113:351-357. http://dx.doi.org/10.1016/j.ijfoodmicro.2006.07.021
470 Moyaert H, Pasmans F, Ducatelle R, Haesebrouck F, and Baele M. 2008. Evaluation of 16S rRNA Gene-Based PCR
471 Assays for Genus-Level Identification of Helicobacter Species. J Clin Microbiol 46:1867-1869.
472 10.1128/jcm.00139-08
473 Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MT, Fookes M, Falush D, Keane JA, and Parkhill J. 2015.
474 Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31:3691-3693.
475 10.1093/bioinformatics/btv421
476 Postollec F, Falentin H, Pavan S, Combrisson J, and Sohier D. 2011. Recent advances in quantitative PCR (qPCR)
477 applications in food microbiology. Food Microbiol 28:848-861.
478 http://dx.doi.org/10.1016/j.fm.2011.02.008
479 Price MN, Dehal PS, and Arkin AP. 2010. FastTree 2--approximately maximum-likelihood trees for large alignments.
480 PLoS One 5:e9490. 10.1371/journal.pone.0009490
481 Qu W, Zhou Y, Zhang Y, Lu Y, Wang X, Zhao D, Yang Y, and Zhang C. 2012. MFEprimer-2.0: a fast thermodynamics-
482 based program for checking PCR primer specificity. Nucleic Acids Res 40:W205-208. 10.1093/nar/gks552
483 Ramirez M, Castro C, Palomares JC, Torres MJ, Aller AI, Ruiz M, Aznar J, and Martin-Mazuelos E. 2009. Molecular
484 detection and identification of Aspergillus spp. from clinical samples using real-time PCR. Mycoses 52:129-
485 134. 10.1111/j.1439-0507.2008.01548.x
486 Rice P, Longden I, and Bleasby A. 2000. EMBOSS: The European Molecular Biology Open Software Suite. Trends in
487 Genetics 16:276-277. https://doi.org/10.1016/S0168-9525(00)02024-2
488 Sayers E. 2009. The E-utilities In-Depth: Parameters, Syntax and More. Available at
489 https://www.ncbi.nlm.nih.gov/books/NBK25499/.
490 Scheirlinck I, Van der Meulen R, De Vuyst L, Vandamme P, and Huys G. 2009. Molecular source tracking of
491 predominant lactic acid bacteria in traditional Belgian sourdoughs and their production environments. J
492 Appl Microbiol 106:1081-1092. 10.1111/j.1365-2672.2008.04094.x
493 Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068-2069.
494 10.1093/bioinformatics/btu153
495 Shen Z, Qu W, Wang W, Lu Y, Wu Y, Li Z, Hang X, Wang X, Zhao D, and Zhang C. 2010. MPprimer: a program for
496 reliable multiplex PCR primer design. BMC Bioinformatics 11:143. 10.1186/1471-2105-11-143
497 Tange O. 2011. GNU Parallel - The Command-Line Power Tool. login: The USENIX Magazine 36:42-47.
498 Thomsen MCF, Hasman H, Westh H, Kaya H, and Lund O. 2017. RUCS: rapid identification of PCR primers for
499 unique core sequences. Bioinformatics 33:3917-3921. 10.1093/bioinformatics/btx526
500 Torriani S, Felis GE, and Dellaglio F. 2001. Differentiation of Lactobacillus plantarum, L. pentosus, and L.
501 paraplantarum by recA gene sequence analysis and multiplex PCR assay with recA gene-derived primers.
502 Appl Environ Microbiol 67:3450-3454. 10.1128/aem.67.8.3450-3454.2001
503 Untergasser A, Cutcutache I, Koressaar T, Ye J, Faircloth BC, Remm M, and Rozen SG. 2012. Primer3--new
504 capabilities and interfaces. Nucleic Acids Res 40:e115. 10.1093/nar/gks596
505 Vijaya Satya R, Kumar K, Zavaljevski N, and Reifman J. 2010. A high-throughput pipeline for the design of real-time
506 PCR signatures. BMC Bioinformatics 11:340. 10.1186/1471-2105-11-340
507 Wang LT, Lee FL, Tai CJ, and Kasai H. 2007. Comparison of gyrB gene sequences, 16S rRNA gene sequences and
508 DNA-DNA hybridization in the Bacillus subtilis group. Int J Syst Evol Microbiol 57:1846-1850.
509 10.1099/ijs.0.64685-0
510 Ye J, Coulouris G, Zaretskaya I, Cutcutache I, Rozen S, and Madden TL. 2012. Primer-BLAST: a tool to design target-
511 specific primers for polymerase chain reaction. BMC Bioinformatics 13:134. 10.1186/1471-2105-13-134
512 Yoon SH, Ha SM, Lim J, Kwon S, and Chun J. 2017. A large-scale evaluation of algorithms to calculate average
513 nucleotide identity. Antonie Van Leeuwenhoek 110:1281-1286. 10.1007/s10482-017-0844-4
514 Zuker M, Mathews DH, and Turner DH. 1999. Algorithms and Thermodynamics for RNA Secondary Structure
515 Prediction: A Practical Guide. In: Barciszewski J, and Clark BFC, eds. RNA Biochemistry and Biotechnology.
516 Dordrecht: Springer Netherlands, 11-43.
517
Table 1(on next page)
Overview of the SpeciesPrimer pipeline workflow and the used software.
Pipeline workflow Tools Reference
Input genome assemblies
- download NCBI Entrez (Biopython) (Cock et al. 2009; Sayers 2009)
- annotation Prokka (Seemann 2014)
- quality control BLAST+ (Altschul et al. 1990)
Core gene sequences
- identification Roary (Page et al. 2015)
- phylogeny FastTree 2 (Price et al. 2010)
- selection of conserved
sequences
Prank
consambig (EMBOSS)
GNU parallel
(Löytynoja 2014)
(Rice et al. 2000)
(Tange 2011)
- evaluation of specificity BLAST+ (Altschul et al. 1990)
Primer
- design Primer3 (Untergasser et al. 2012)
- quality control BLAST+,
MFEPrimer 2.0,
MPprimer,
Mfold
(Altschul et al. 1990)
(Qu et al. 2012)
(Shen et al. 2010)
(Zuker et al. 1999)
1
Table 2(on next page)
Pipeline input and run statistics.
Species E. faecalis E. faecium P. acidilactici P. pentosaceus
Pipeline input
NCBI genomes 390 575 9 14
ACC genomes 0 0 118 13
Total genome assemblies 390 575 127 27
Download and annotation
(h:min)9:04 12:27 1:55 0:24
Pipeline statistics
Running time (h:min) 6:11 8:05 1:55 4:25
Core genes 1375 1131 921 1341
Single copy core genes 632 563 641 889
Conserved sequences 1128 624 566 2782
Species-specific sequences 329 36 54 672
Potential primer pairs 89 4 7 632
Primer pairs after QC 15 2 2 160
1 QC: primer quality control, ACC: Agroscope culture collection
Table 3(on next page)
Summarized results of the average nucleotide identity (ANI) calculations.
E. faecium
main cluster
E. faecium
subcluster
P. acidilactici
main cluster
P. acidilactici
subcluster
Assemblies 530 44 125 1
ANI (%) 89.21
average 99.43 94.67 98.28
maximum 99.86 95.60 98.83
minimum 98.19 94.15 96.88
1
Table 4(on next page)
Summary of the in silico validation of the selected primer pairs.
Target species Primer pair PPCBLAST
(perfect/total)
primer-BLAST
(perfect/total)
Ec_faeca_acuI_1_P0 100 specific (694/694) specific (24/24)E. faecalis
Ec_faeca_g3060_1_P0 100 specific (689/690) specific (24/24)
Ec_faeci_cysS_3_P1 96.7 specific (1057/1058) specific (63/63)E. faecium
Ec_faeci_purD_2_P0 93.3 specific(1083/1083) specific (63/63)
Pd_acidi_asnS_2_P0 90.1 specific (19/19) specific (5/5)P. acidilactici
Pd_acidi_g1164_1_P0 93.3 specific (23/23) specific (5/5)
Pd_pento_nagK_1_P0 100 specific (15/15) specific (7/7)P. pentosaceus
Pd_pento_g4364_1_P0 100 specific (15/15) specific (7/7)
1
Table 5(on next page)
Summarized results of the in vitro validation of the selected qPCR assays.
Inclusivity: Number of positive DNA samples / total number of target species DNA samples.Exclusivity: Number of DNA samples showing a fluorescence signal below quantification cycle35 and a melting curve peak above the threshold / total number of non-target DNA samples.
Calculated efficiency, slope, intercept and correlation coefficient (R2) of the linear regressionequation.
Species E. faecalis E. faecium P. acidilactici P. pentosaceus
Target gene acuI g3060 cysS purD asnS g1164 nagK g4364
Inclusivity 22/22 22/22 25/25 24/25 21/21 21/21 25/25 25/25
Exclusivity 0/121 9/121 0*/121 0*/121 0/121 5/121 2/121 8/121
Efficiency 98 % 97 % 92 % 97 % 99 % 100 % 94 % 92 %
Slope -3.382 -3.387 -3.539 -3.396 -3.356 -3.329 -3.470 -3.523
Intercept 32.107 32.694 32.006 31.051 30.835 30.282 32.286 33.211
R2 0.998 0.997 0.990 0.996 0.997 0.995 0.996 0.997
1 * Contamination of stock culture of the strain FAM20347 with E. faecium.
Figure 1Phylogeny of core gene alignments.
(A) E. faecium. (B) P. acidilactici. The phylogenies are displayed with SeaView(http://doua.prabi.fr/software/seaview) in circular view.
Figure 2qPCR assay quantification cycle heatmap.
Depicted are all tested non-target strains and their average quantification cycle (technicalduplicates). Bars represent results with a melt curve peak above the threshold and a Cqvalue below Cq 35. The gray shades represent the Cq values from 10 to 35 (if no fluorescentsignal was measured the value was set to Cq 35). Abbreviations: A.: Acidipropionici; Cl.:Clostridium; Lb.: Lactobacillus; Ln.: Leuconostoc; Pb.: Propionibacterium; Pd.: Pediococcus;Sc.: Streptococcus; NTC: no template control.