Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
UTAX accurately predicts taxonomy of marker gene sequences Robert C. Edgar
Independent Investigator
Tiburon, California, USA.
The UTAX algorithm accurately predicts the taxonomy of 16S ribosomal RNA and
other marker gene sequences targeted by next-generation metagenomics
experiments. UTAX has comparable sensitivity but much lower error rates compared
to most existing methods, predicting dramatically fewer false positives for novel
taxa.
Recent studies using next-generation sequencing of marker gene segments include the
Human Microbiome Project (HMP)1 and a survey of the Arabidopsis root microbiome2. A
fundamental step in such studies is to predict the taxonomy of sequences in the reads,
which are typically clustered into Operational Taxonomic Units (OTUs). Computational
taxonomy prediction is complicated by the fact that only a small minority of microbial
species have authoritative classifications and reference databases have sparse coverage so
that in practice, an OTU often does not have an exact match in the database (Supp. Note 3).
With the goal of improving taxonomy prediction accuracy, I developed a new algorithm,
UTAX, that accounts for sparseness in the database and for varying correlations between
rank and sequence identity in different groups. UTAX calculates a novel score combining k-
mer distances to the top hit and to the nearest neighbor at each rank, i.e. the most similar
sequence with a different name at that rank. For each rank, the probability that the query
belongs to the same group as the top hit is calculated from the distribution of scores over
all pairs in the reference database.
Available reference databases for the 16S ribosomal RNA gene (16S) include SILVA3,
Greengenes4 and the RDP Classifier5 (RDP) training set. The current RDP training set (v14,
here called RDP14) contains 10,679 sequences. Greengenes and SILVA are larger, giving
better coverage than RDP14 but not as much as might be expected from the numbers of
sequences (Supp. Note 3). Most of the annotations in SILVA and Greengenes are not
authoritative classifications but predictions generated by a combination of automated and
manual methods6,7 which I estimated to have error rates of ~6% and ~18% respectively
for genus (Supp. Note 4). Also, SILVA and Greengenes are not compatible with some
programs because many sequences lack species and genus names (Supp. Note 5), and I
therefore chose to use RDP14 for comparative validation on 16S and the RDP Warcup
training set8 version 4 (War4) for the fungal internal transcribed spacer (ITS) region.
Given sequences from a biological sample (here called OTUs without necessarily implying
clustering) and a reference database, I defined coverage at each taxonomic rank to be the
fraction of OTUs that belong to a known group. Here, known means that the group is
present in the reference database, regardless of whether the group has been named, and
novel that the group is not present. I defined the lowest known rank (LKR) of an OTU as its
lowest rank having at least one reference sequence and the LKR frequency λr as the fraction
of OTUs having LKR = r. For example, if λgenus = 0.4, then 40% of the OTUs belong to a novel
species in a known genus. LKR frequencies can be interpreted as a profile summarizing
taxonomic novelty in the OTUs with respect to the database. I estimated LKR frequencies
for soil, human gut and mouse gut reads of the 16S V4 region from a recent study9 using
sequence identity thresholds determined by Yarza et al.10: ≥95% for genus, ≥86% for
family, etc. (Supp. Note 7). While identity gives only an approximate indication of rank,
averaging over OTUs for a typical sample should give frequencies that are realistic even if
not accurate for that particular sample. Using RDP14 as a reference and OTUs constructed
by UPARSE11, I estimated the fraction of OTUs with novel genera to be 83% for soil, 63%
for mouse gut and 57% for human gut, showing that coverage is sparse in practice (Fig.1
and Supp. Notes 7 and 16).
At high identities and high ranks, an OTU almost certainly belongs to the same group as the
top database hit, and at low ranks and low identities, an OTU almost certainly belongs to a
different group. The most challenging cases occur when identity is close to the average for
the rank, for example attempting to predict genus when identity is ~95%. This is a twilight
zone for taxonomy prediction (Fig. 1) analogous to the twilight zone for protein homology
prediction12. In principle, it might be possible to identify genus-specific sequence features,
but not when reference data is too sparse. For example, almost half (913 / 1,948) of the
genera in RDP14 have only one reference sequence, and in these cases it is impossible to
predict whether a human expert would assign another species to the same group from its
sequence alone. Thus, in the twilight zone, predictions of known genera will often be false
positives while non-predictions will often be false negatives (see also Supp. Note 15).
Identity distributions for typical samples (Fig. 1 and Supp. Fig. SN6.2) show that twilight
zone OTUs are common in practice, underscoring the difficulty of accurate taxonomy
prediction and the importance of providing a confidence estimate. With this in mind, I
designed UTAX to predict the mean number of errors per query (EPQ) for each rank (see
Methods). For testing, I set a threshold of P = (1 – EPQ) ≥ 0.9 on the assumption that ~10%
is an acceptable error rate for a typical study.
The RDP authors measured accuracy using leave-one-out validation5, which I believe is
inappropriate in this context (Supp. Note 6). I used a different strategy that has been
applied to validation of shotgun metagenomics taxonomy prediction13 by constructing
datasets where LKRs are known from trusted annotations, as follows. For k=genus, family ...
phylum I divided RDP14 into two subsets (rank splits) Xk and Yk such that the LKR between
the subsets is k. For example, with LKR = family, I discarded families with only one genus
and randomly assigned the remaining genera to Xfamily or Yfamily with the constraint that at
least one genus from every family must be present in both (Supp. Fig. SN13.1). For each k
and for each region of interest (full-length gene, V4 etc.), I measured prediction
performance for all ranks using Xk as the query and Yk as the reference and vice versa. I
included a null split XN = YN = RDP14 to measure performance when the sequence is known.
I followed the same procedure for War4. For every split at rank k I calculated the following
accuracy metrics for each rank r (see Supp. Note 8 for discussion). Sensitivity (Srk) is the
fraction of known names at rank r that were correctly predicted. The misclassification
error rate (Mrk) is the fraction of known names at rank r that were incorrectly predicted.
The overclassification error rate (Ork) is the fraction of novel r's that were incorrectly
predicted to be known. Given the LKR frequencies λk, the total sensitivity Sensr and errors
per query EPQr at rank r for a set of OTUs can be estimated by assuming that the
sensitivities and error rates at each rank are approximately the same as those measured on
the rank splits:
Sensr = Σ λk Srk, (Eq.1)
EPQr = Σ λk (Ork + Mrk). (Eq.2)
To obtain sensitivities and error rates for typical data, I used Eqs. 1 and 2 with the
estimated LKR frequencies for the soil, human gut and mouse gut OTUs. While the
frequencies may be inaccurate for those samples, and the sensitivities and error rates for
each LKR in a given set of OTUs may differ somewhat from those measured on the rank
splits, this procedure should nevertheless give good estimates in the sense that they fall
comfortably within the range of true values for typical data in practice, giving a far more
realistic indication of algorithm accuracy than leave-one-out testing (Supp. Note 6).
Using this method, I compared the accuracy of UTAX with GAST14, RDP and methods
supported by mothur15 and QIIME16 (see Supp. Note 11 for method name abbreviations,
software versions and command lines). Representative results are given in Table 1; the
underlying performance metrics are given in the Supplementary Files and Supp. Note 12.
Mothur-rdp gave very similar results to RDP (Supp. Note 1). The only method to
consistently achieve an estimated EPQgenus below 10% was mothur-knn, but its sensitivity
was also much lower than the other methods (Sensgenus < 40% on all samples). The
estimated EPQgenus of UTAX was ~10% on all three samples, remarkably close to the rate
predicted by the P ≥ 0.9 threshold given that P is calculated by an independent method that
k
k
does not use identity thresholds or rank splits (Methods). All other algorithms had
substantially higher EPQgenus, ranging from EPQgenus ~17% for RDP at 80% bootstrap to
QIIME-blast which consistently had the highest error rate (EPQgenus 62% to 78%). The
default QIIME method, QIIME-uc, had EPQgenus = 39% to 45% and QIIME-rdp, which sets the
bootstrap cutoff at 50% by default, had EPQgenus = 36% to 40%. Sensphylum was >90% for all
methods except QIIME-uc (78% on soil, 87% on mouse gut) and QIIME-sm (79% on soil,
87% on mouse gut).
Methods
Given a pair of sequences Q and R, I defined the lowest common rank (LCR) of Q and R to be
the lowest rank where Q and R have the same name. Given a similarity measure d(Q, R, k),
P(LCR=k | d) is the probability that the LCR is k. For example, if d is sequence identity then
P(LCR=phylum | d=93%) will be close to one but P(LCR=genus | d=93%) will be lower.
To obtain a discrete range, UTAX converts a real-valued similarity d taking values zero to
one to an integer percentage D = ⌊100 d⌋. Considering all pairs of sequences in a reference
database B, let the number of pairs with a given D be HD and the number of those pairs with
LCR=k be hD,k, UTAX calculates an a-posteriori estimate for P(LCR=k | D) from B as the
fraction of pairs having distance D which also have LCR=k, i.e.
P(LCR=k | D) ~ hD,k/HD. (Eq.3)
For motivation and visualization of Eq.3 see Supp. Note 9. UTAX calculates the matrix CD,k =
hD,k/HD from B and stores it for use in run-time prediction.
Let P(CR(k) | D) be the probability that two sequences have a common rank at level k, i.e.
have the same name at that rank. Let taxon(Q, k) be the name of Q at rank k. Q and R have
the same name at rank k if their LCR is not > k, hence
P(CR(k) | D)
= P(taxon(Q, k) = taxon(R, k) | D)
= 1 – P(LCR(Q, R) > k | D) = 1 – Σ CD,r. (Eq.4)
Thus, given a reference sequence R and an integer similarity D, Eq.4 gives the probability
that the taxon name of Q is the same as R at rank k. This gives a framework for constructing
a taxonomy prediction algorithm based on a similarity measure d. Natural choices for d
include identity calculated from an alignment or a word-counting distance. However, these
would not take into account that the correlation varies in different groups due to differing
evolutionary rates and lumping or splitting by taxonomists. I therefore also considered the
similarity of a reference sequence R with its nearest neighbor NNk(R) for each k, i.e. the
sequence in B with highest similarity to R and a different name at rank k. If NNk(R) is close
to R, then the confidence that taxon(Q, k) = taxon(R, k) should be reduced because of the
increased likelihood that taxon(Q, k) = taxon(NNk(R), k). I chose to use similarities
calculated from the set w8(Q) of 8-mers in Q. I defined the unique word similarity (U) of a
pair of sequences Q and R as
U(Q, R) = |w8(Q) ∩ w8(R)|/min(|w8(Q)|, |w8(R)|). (Eq.5)
I designed a similarity measure (dUTAX) that increases with higher similarity between Q and
R, decreases with higher similarity between R and Hk(R), and takes real values between
zero and one,
dUTAX(Q, R, k) = (2 U(Q, R) – U(R, NNk(R))/2. (Eq.6)
(See Supp. Note 14 for comparison with other measures). Given a query sequence Q, UTAX
identifies the top hit T by unique word similarity, i.e. T = argmaxR { U(Q, R), R ∈ B }. The
rank names of Q are predicted to be the same as those of T with probabilities calculated by
Eq.4 using the dUTAX similarity measure.
r > k
Figures and tables
Fig. 1. Estimated Lowest Known Ranks (LKRs) for soil OTUs. The upper graph shows
lowest common rank (LCR) probabilities as a function of sequence identity calculated for
the V4 region of RDP14 (using Eq.3, see also Supp. Note 9). The lower histogram shows
frequencies of integer-rounded sequence identities of top hits of OTUs to the RDP14
database. Histogram bars are colored to indicate estimated LKRs according to the Yarza
thresholds. While identity thresholds are not reliable indicators of rank, the fraction of
OTUs in a Yarza identity range nevertheless gives a realistic indication of how many OTUs
with the corresponding LKR might be found in a similar sample. The "twilight zone" is a
region around 95% identity where high sensitivity for genus prediction cannot be achieved
without high false positive rates because if the closest reference sequence has ~95%
identity, then it is unlikely that there are enough training examples to identify genus-
specific sequence features, and identity correlates only approximately with taxonomic
rank, noting e.g. that P(LCR=genus | 95%) = 0.34, P(LCR=family | 95%) = 0.33 and
P(LCR=order | 95%) = 0.23.
Table 1. Estimated accuracy for soil, mouse gut and human gut OTUs. The table shows
estimated sensitivity and errors per query (EPQ) for genus and phylum predictions,
expressed as percentages. Error rates >10% are highlighted yellow and >30% magenta.
Genus sensitivities <50% are highlighted magenta and phylum sensitivities <90% yellow.
Results for UTAX are shown for threshold P≥0.9. Results for RDP are shown with 80%
bootstrap cutoff (recommended by the authors) and 50% bootstrap (the default for QIIME-
rdp).
References
1. HMP Consortium. Structure , function and diversity of the healthy human microbiome.
Nature 486, 207–214 (2012).
2. Lundberg, D. S. et al. Defining the core Arabidopsis thaliana root microbiome. Nature
488, 86–90 (2012).
3. Pruesse, E. et al. SILVA: A comprehensive online resource for quality checked and
aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 35,
7188–7196 (2007).
4. DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and
workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–72 (2006).
5. Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Naive Bayesian classifier for rapid
assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ.
Microbiol. 73, 5261–7 (2007).
6. Yilmaz, P. et al. The SILVA and ‘all-species Living Tree Project (LTP)’ taxonomic
frameworks. Nucleic Acids Res. 42, (2014).
7. McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological
and evolutionary analyses of bacteria and archaea. ISME J 6, 610–618 (2012).
8. Deshpande, V. et al. Fungal identification using a Bayesian classifier and the Warcup
training set of internal transcribed spacer sequences. Mycologia 14–293– (2015).
doi:10.3852/14-293
9. Kozich, J. J., Westcott, S. L., Baxter, N. T., Highlander, S. K. & Schloss, P. D.
Development of a dual-index sequencing strategy and curation pipeline for analyzing
amplicon sequence data on the miseq illumina sequencing platform. Appl. Environ.
Microbiol. 79, 5112–5120 (2013).
10. Yarza, P. et al. Uniting the classification of cultured and uncultured bacteria and archaea
using 16S rRNA gene sequences. Nat. Rev. Microbiol. 12, 635–645 (2014).
11. Edgar, R. C. UPARSE: highly accurate OTU sequences from microbial amplicon reads.
Nat. Methods 10, 996–8 (2013).
12. Rost, B. Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999).
13. Patil, K. R. et al. Taxonomic metagenome sequence assignment with structured output
models. Nat. Methods 8, 191–2 (2011).
14. Huse, S. M. et al. Exploring microbial diversity and taxonomy using SSU rRNA
hypervariable tag sequencing. PLoS Genet. 4, e1000255 (2008).
15. Schloss, P. D. et al. Introducing mothur: open-source, platform-independent, community-
supported software for describing and comparing microbial communities. Appl. Environ.
Microbiol. 75, 7537–41 (2009).
16. Caporaso, J. G. et al. QIIME allows analysis of high-throughput community sequencing
data. Nat. Methods 7, 335–6 (2010).
Author contributions
R.C.E. conceived of the study, performed the analysis and wrote the manuscript.
UTAX accurately predicts taxonomy of marker gene sequences Supplementary Notes
Note 1. Mothur-rdp is effectively equivalent to RDP.
Note 2. Genus predictions for the Soil86 set.
Note 3. Coverage of SILVA and Greengenes.
Note 4. Error rates of SILVA and Greengenes taxonomy annotations.
Note 5. Reference database compatibility.
Note 6. Leave-one-out and leave-10%-out validation.
Note 7. Estimated LKR frequencies for in vivo samples.
Note 8. Accuracy metrics for taxonomy prediction.
Note 9. Calculation of LCR probabilities and S/E.
Note 10. Compute time and memory use of the tested methods.
Note 11. Software versions and command lines.
Note 12. Performance metrics on RDP14 and War4.
Note 13. Construction of a rank split.
Note 14. Sensitivity/ EPQ plots for similarity measures.
Note 15. Non-predictions and blank names.
Note 16. LKR estimates and OTU error rates.
Supplementary References
Note 1. Mothur-rdp is effectively equivalent to RDP.
I compared the predictions of mothur-rdp and RDP for all rank splits of the RDP14 V4
region. At a bootstrap cutoff of 80%, 241,585 taxon names were predicted by one or both
algorithms. Of these, 234,877 (97%) were identical. At 50% bootstrap, 280,334
/ 291,705 (96%) were identical. A rate of disagreement of 3 to 4% is consistent with
differences due to the use of random numbers in the bootstrapping procedure. I concluded
that mothur-rdp and RDP are effectively equivalent implementations of the same algorithm
and did not consider mothur-rdp separately for the rest of this work.
Note 2. Genus predictions for the Soil86 dataset.
Method Genus predictions
UTAX 0
QIIME-uc 3 (0.1%)
QIIME-sm 19 (0.5%)
mothur-knn 283 (8%)
RDP (80% bootstrap) 561 (15%)
RDP (50% bootstrap) 942 (26%)
GAST 3,048 (84%)
QIIME-blast 3,637 (89%)
Table SN2.1. Genus predictions on the Soil86 dataset. Soil86 contains 3,637 UPARSE
OTU sequences from the soil sample of Kozich et al.1 with ≤86% identity to the RDP14
reference database, suggesting a lowest known rank of order or higher. Few genus
predictions would be expected for this set considering that the Yarza threshold is 95% for
genus, but some methods predicted many genera, the most by QIIME-blast which predicted
genera for 89% of the sequences.
Note 3. Coverage of the Greengenes and SILVA databases.
The Greengenes2 and SILVA3 reference databases are larger than RDP14: Greengenes
v.13.5 has 1.3 × 106 sequences with taxonomy annotations and SILVA v123 has 1.8 × 106,
compared to 1 × 104 for RDP14. The default database for the QIIME methods is a subset of
Greengenes (GG-QIIME, 99,322 sequences) obtained by clustering at 97% identity, while
one of several suggested databases for use with mothur is a subset of SILVA (SILVA-
mothur, 172,418 sequences) [http://www.mothur.org/wiki/Taxonomy_outline, retrieved
12th Dec 2015]. The GG-QIIME and SILVA-mothur databases thus have an order of
magnitude more annotated sequences than RDP14 and a priori might provide a better
reference set for compatible algorithms, noting that RDP is not compatible because it
requires names for the lowest rank for all training sequences (Note 5) while most
sequences in GG-QIIME and SILVA-mothur lack species and genus names.
Fig. SN3.1 shows the identity distributions for soil OTUs against GG-QIIME and SILVA-
mothur compared with RDP14, showing that GG-QIIME and SILVA-mothur have less sparse
coverage than RDP14 though there are still many OTUs with estimated LKR>genus.
Coverage is less sparse in the sense there are more OTUs with high identities / fewer with
low identities, and this gives the appearance of more known ranks. However, while almost
all ranks are named in RDP14 (Note 11), the interpretation of lowest known rank is
different for GG-QIIME and SILVA-mothur where most sequences lack names for low ranks,
so "known" (present in the database) does not necessarily imply "named". Most
annotations in those databases were predicted using sequence analysis methods, so
"named" does not imply "authoritatively named" by conventional standards. I estimate the
genus annotation error rate to be ~6% for SILVA-mothur and ~18% for GG-QIIME (Note
4). It is therefore difficult to assess whether using one of the larger databases improves or
degrades prediction accuracy for a given algorithm compared to using RDP14, but
especially in the case of GG-QIIME it appears that the annotation error rate of the database
may be high enough to substantially degrade prediction performance, noting that
annotation errors of the database will be compounded by the inherent error rate of a
prediction algorithm, and confidence will be systematically overestimated because the
database error rate is not considered.
Fig SN3.1. Identity distributions of soil OTUs. Histograms show frequencies of integer-
rounded sequence identities of top hits of OTUs to RDP14, GG-QIIME (the subset of
Greengenes which is the default reference in QIIME) and SILVA-mothur, one of the
reference databases provided for use by mothur. Colors indicate estimated lowest known
ranks according to the Yarza thresholds (see main text for methods).
Note 4. Error rates of Greengenes and SILVA taxonomy annotations.
Henri Poincaré famously described mathematics as the art of giving the same name to
different things4. In taxonomy, this is a bad idea.
Most taxonomy annotations in Greengenes and SILVA databases were predicted for
uncultured sequences using a combination of automated and manual methods5,6. I don't
fully understand their guiding principles or exactly how they were implemented, but
presumably they work something like the following. The starting point is a set of sequences
obtained from authoritatively classified organisms (gold-standard sequences). Other
annotations are made using a predicted phylogentic tree. If a non-gold sequence is in the
same subtree as a gold sequence at a given rank, the name at that ranks is inferred to be the
same.
To the best of my knowledge, neither Greengenes nor SILVA documents which sequences
were used as gold standards or the evidence supporting a given annotation (is it a gold
standard sequence? an automated prediction? an automated prediction which was
manually adjusted, and if so why?), making the reliability of any given annotation difficult
to evaluate or verify independently.
There are several differences in taxonomic nomenclatures and procedures for reconciling
conflicts between taxonomy and sequence evidence. Greengenes is based on the NCBI
taxonomy, RDP14 on Bergey's7 and SILVA on LSPN8. While RDP14 strictly adheres to
Bergey's to the best of my knowledge, Greengenes and SILVA modify their base taxonomies
to address inconsistencies with phylogenies determined from sequence. For example,
Greengenes deletes the genera Escherichia and Shigella, which are believed to overlap9,
leaving their sequences classified to family level only (Enterobacteriaceae). SILVA deals
with this issue in a different way by defining a combined genus (Escherichia-Shigella) and
retaining well-known species names such as Escherichia coli, while Greengenes leaves their
species names blank.
Both databases maintain large multiple alignments of 16S sequences, many of which have
incorrect and ambiguous bases and some of which are undetected chimeras10. The
Greengenes alignment is fixed at 7,682 columns using the NAST approach2 which
intentionally introduces misalignments (i.e., errors) to avoid increasing the number of
columns. Construction of RNA alignments is challenging, especially for large and diverse
datasets, and the best current alignment algorithms have substantial error rates when
challenged with highly diverged sequences11. Perfect tree inference from a sequence
alignment is generally not possible due to alignment errors and information loss12. Tree
construction error rates are difficult to estimate but can be substantial on large datasets13.
Given these issues, it is plausible that the Greengenes and SILVA trees could have
substantial error rates, raising the question whether these, perhaps together with other
imperfections in their annotation methods, have caused substantial numbers of taxonomy
annotation errors. This cannot be assessed directly because the ground truth is not known.
Instead, I identified errors by noting that annotations for identical sequences should agree,
so if two databases have different annotations for the same sequence then one or both of
them must be wrong.
Implementing this analysis is complicated by the fact that the databases use taxonomic
systems with different sets of names. Another complication is the interpretation of blank
names. Does a blank name indicate assignment to a sub-tree that has not been named, that
a name cannot be assigned due to overlapping named groups (like Escherichia-Shigella), or
low confidence in a prediction (i.e., the name might be known, or there are two candidate
known names which do not overlap but which are hard to distinguish)? (see also Note 15).
In consideration of these issues, I counted only names used by both systems (common
names), excluding names which do not correspond to clades such as unclassified,
uncultured, candidatus and incertae sedis. If one or both names were blank, the pair was not
counted.
Results are summarized in Table SN4.1, which shows that SILVA-mothur and GG-QIIME
disagree on 24% of genus annotations and 2% of phylum annotations for identical
sequences. This provides an lower bound on the sum of the annotation error rates for both
databases. The lower bound is achieved when every incorrect annotation is correct in the
other database. It should be rare for annotations to be wrong in both databases by chance
(if errors are random at a rate of ~10%, then ~1% will be wrong in both). Given that
distinctly different methods are used for alignment and tree construction, I would guess
that the errors have low correlation between the databases and the true combined rate is
close to this lower bound.
A pair-wise comparison measures the combined error rate without indicating the relative
rate, i.e. whether one database has a higher or lower error rate than the other. This can be
investigated using pair-wise comparisons with a third database, RDP14. Genus annotation
disagreement rates with RDP14 are 11% for GG-QIIME and 3% for SILVA-mothur. This
indicates that GG-QIIME has a higher error rate than SILVA-mothur because the error rate
of RDP14 should be roughly the same in both pair-wise comparisons, adding approximately
the same term to both combined rates. Also, all RDP14 sequences have genus annotations
and its much smaller size is more amenable to curation, suggesting that it has a high
frequency of gold-standard sequences and is likely to have a much lower error rate. This
hypothesis is supported by the lower pair-wise disagreements of RDP with the other two
databases. If we assume that the error rate of RDP14 is smaller than the other two
databases, then we can infer that the error rate of GG-QIIME is roughly 11% / 3% ≈ 3× to
4× larger than SILVA-mothur. Assuming a factor of three implies that the total error rates
are 24% × 3/4 = 18% for GG-QIIME and 24% × 1/4 = 6% for SILVA-mothur. While these
estimates are uncertain, the combined rate of 24% is robust and it is reasonable to
conclude that the minimum plausible genus annotation error rates are 5% for SILVA-
mothur (minimum determined by assuming a maximum of 4× more errors in GG-QIIME)
and 12% for GG-QIIME (minimum determined as half of the 24% combined rate, given that
the comparison with RDP14 indicates a higher rate for GG-QIIME).
1. GG-QIIME and SILVA-mothur
Rank Common Names
Same Name
Different Name
Phylum 29098 28616 (98.3%) 481 (1.7%)
Class 24476 21592 (88.2%) 1201 (4.9%)
Order 21919 17121 (78.1%) 2804 (12.8%)
Family 15805 13141 (83.1%) 1428 (9.0%)
Genus 7735 5352 (69.2%) 1868 (24.1%)
2. GG-QIIME and RDP14
Rank Common Names
Same Name
Different Name
Phylum 477 475 (99.6%) 2 (0.4%)
Class 1761 1678 (95.3%) 27 (1.5%)
Order 1786 1583 (88.6%) 79 (4.4%)
Family 1545 1423 (92.1%) 78 (5.0%)
Genus 1404 1253 (89.2%) 151 (10.8%)
3. SILVA-mothur and RDP14
Rank Common Names
Same Name
Different Name
Phylum 1030 1028 (99.8%) 2 (0.2%)
Class 4324 4299 (99.4%) 17 (0.4%)
Order 3359 3148 (93.7%) 57 (1.7%)
Family 4291 4070 (94.8%) 141 (3.3%)
Genus 4510 4386 (97.3%) 124 (2.7%)
Table SN4.1. Pair-wise comparisons of taxonomy annotations. The table shows the rate
of agreement and disagreement between taxonomy annotations for identical sequences
found in each pair of reference databases. Common Names is the number of identical
sequences having a common name for the given rank in one or both databases, Same Name
is the number of these sequences for which the name was the same and Different Name is
the number for which the name was different. A common name is a taxon name found in
the taxonomy systems for both databases.
Note 5. Reference database compatibility.
The tested programs place different constraints on taxonomy annotations. Mothur does not
allow a species name, which ruled out testing at species rank on War4. RDP requires that
the lowest rank is named for all reference sequences, which ruled out testing on
Greengenes or SILVA where genus and species names are often omitted. The mothur re-
implementation of the RDP algorithm does allow missing genus names. RDP14 includes
reference sequences with optional ranks (suborder and subclass) and missing ranks (e.g.,
sometimes only phylum and genus are specified with no intermediate ranks). These
variations are supported by RDP but not by some other programs. UTAX requires that
names correspond to clades so that the LCR can be determined for all pairs of sequences.
This means that names such as unclassified, uncultured, candidatus and incertae sedis
should be excluded for training. I therefore constructed subsets of the reference databases
with taxonomies that were compatible with all programs to enable testing on the same
reference data. This was done by filtering out special cases such as "uncultured", deleting
optional ranks (suborder, subclass) and discarding annotations with any missing or blank
names for required ranks (genus, family, class, order and phylum for RDP14 and species,
family, class, order and phylum for War4). This required discarding 506 / 10,049
sequences (5%) from RDP14 and 9,546 / 24,500 (40%) from War4. The compatible
versions of the reference databases are included in the Supplementary Files.
Note 6. Leave-one-out and leave-10%-out validation.
In their 2007 paper describing the RDP Naive Bayesian Classifier14, Wang et al. state in the
Abstract that "…results from leave-one-out testing … show that the overall accuracies at all
levels of confidence for near-full-length and 400-base segments were 89% or above down
to the genus level". In my opinion, this approach is not appropriate for microbial taxonomy
prediction because an informative leave-one-out validation requires that all categories are
known and training data is dense (Fig. SN6.1). With microbial taxonomy, training data is
sparse and many microbial genera and higher ranks are novel in typical data (Fig. SN6.2).
In addition, accuracy was measured using a bootstrap cutoff of zero rather than the
authors' recommended cutoff of 80%. Roughly half of the genera in RDP14 have only a
single sequence (913/1,948, Table SN6.1) and therefore cannot be predicted if left out of
the training set, but this is not taken into account. Accuracy as measured by this test is thus
the maximum possible sensitivity in a scenario where a large majority of query sequences
have identity >97% (Fig. SN6.2), which is unrealistic, and where the maximum achievable
accuracy is not 100% as would be expected by convention. At RDP14 genus level, RDP and
UTAX have 86% accuracy by this definition, close to the maximum possible of 91% (Table
SN6.1), as would be expected for sequences with >97% identity. The observation that
accuracy is less than 100% is mostly explained by classifications that are impossible due to
singletons (9%) with a smaller contribution by misclassification errors (5%). It is therefore
clear that accuracy as measured by the RDP leave-one-out test methodology is not
predictive of sensitivity or error rates on typical biological data. Leave-one-out accuracies
for RDP and UTAX are reported in Table SN6.2.
In a recent preprint [https://doi.org/10.7287/peerj.preprints.934v2], Bokulich et al.
describe a taxonomy prediction validation framework designed to enable reproducible
results. I was unable to install the framework or download the test data. The framework
has several dependencies on third-party code including Python packages which failed to
install. One of the described tests uses leave-10%-out validation where 10% of sequences
are extracted from Greengenes for use as a query set with the remaining sequences used as
a reference. I followed the methodology described in the preprint by extracting the V4
region of Greengenes v13.5 using the 515F/806R primers and extracting 10% subsets
chosen at random. I found the identity distribution shown in Fig SN6.2 (lower-right) which
shows that a large majority of sequences in the query sets have ≥99% identity with their
corresponding reference sets. This distribution is even more strongly skewed towards
100% identity than the RDP leave-one-out test, which is explained by stronger sampling
biases; for example, the most abundant genus in Greengenes v13.5 is Staphylococcus with
135,711 sequences, comprising more than 10% of the database. Therefore, this test is not
predictive of sensitivity and error rates on typical biological data.
Fig. SN6.1. Microbial taxonomy prediction is not a textbook problem. In a textbook
classification problem (left), all categories are known (handwritten digits, in this example)
and have many training examples. Leave-one-out and leave-10%-out validation is
informative in a textbook case because they are realistic models of classification in practice.
With microbial taxonomy, reference data is sparse (right). In this analogy, the task of an
algorithm is to predict handwritten characters when the full alphabet is not known and
training data is sparse. If leave-one-out validation is used, the algorithm is not challenged
by realistic amounts of novel data (9, A, B…). Characters with only one training example (4
though 8) cannot be predicted when they are left out. If accuracy is measured as the
fraction of characters that are correctly predicted in a leave-one-out test, the highest
possible accuracy is less than 100% due to the singletons. Taxonomy has additional
complications. There is strong sampling bias in the reference data, e.g., human pathogens
are overrepresented (like digits 0, 1 and 2 on the right). Some training examples have
multiple labels because multiple genera can have the same V4 sequence, analogous to the
problem that 0 and I can be digits or letters. Even if only one genus is known for a given V4
sequence, a novel genus in the same family might have the same sequence so a prediction
of genus for that sequence should have <100% confidence.
Fig. SN6.2. LKRs for in vivo samples, leave-one-out and leave-10%-out test data. This
figure compares the identity distribution of soil, mouse gut and human gut OTUs (left) with
the identity distribution of query-reference pairs used in the RDP leave-one-out test and
the Bokulich et al. leave-10%-out test on the 16S V4 region (right). Colors show lowest
known ranks (LKRs) estimated using Yarza identities as described in the main text. In the
distributions for the validation tests, a large majority of query sequences have >97%
identity to the reference set (right), while in practice most sequences belong to novel
genera (left).
War4 RDP14
Rank Names Singletons Max. acc. Names Singletons Max. acc.
Phylum 7 1 100% 39 3 100%
Class 30 2 100% 88 9 99.9%
Order 100 5 100% 123 20 99.8%
Family 287 16 99.9% 341 53 99.5%
Genus 1,308 157 99.0% 1,948 913 90.9%
Species 7,390 2,094 86.0%
Table SN6.2. Leave-one-out maximum accuracy. The table shows the maximum possible
accuracy of leave-one-out tests on the War4 (ITS) and RDP14 (16S) training sets which are
the defaults currently used by RDP. Names is the number of taxon names in the training set.
Singletons is the number of names having exactly one training sequence, which therefore
cannot be predicted when left out. Max. acc. is the maximum possible accuracy by the RDP
definition, which is <100% when there are singletons in the training set. Since there are
singletons at all ranks, the maximum accuracy is always <100% but appears as 100% in
some cases because values are shown to three significant figures.
Reference Method Phylum Class Order Family Genus Species
War4
(ITS1)
RDP 99.8 99.5 99.1 98.2 92.7 72.9
UTAX 99.9 99.7 99.3 98.6 93.4 74.8
War4
(ITS2)
RDP 99.9 99.6 99.3 98.3 92.4 71.8
UTAX 99.9 99.7 99.4 98.5 93.2 74.0
War4
(full-length)
RDP 99.9 99.7 99.4 98.5 93.3 73.9
UTAX 100.0 99.8 99.5 98.8 94.3 77.7
RDP14
(V4)
RDP 99.7 99.5 98.4 96.1 80.4
UTAX 99.7 99.5 98.6 96.4 80.5
RDP14
(full-length)
RDP 99.5 99.4 98.7 97.3 85.6
UTAX 99.9 99.6 99.0 97.5 85.9
Table SN6.2. Leave-one-out results for War4 and RDP14. The table shows accuracy as
defined by the RDP leave-one-out methodology, i.e. the fraction of query sequences for
which the rank is correctly predicted at >0% bootstrap confidence for RDP and P>0 for
UTAX. The maximum possible accuracy by this definition is <100% when there are
singleton taxa (i.e., those having only one reference sequence). At RDP14 genus level, RDP
and UTAX have 86% accuracy, close to the maximum possible of 91% (Table SN6.1).
Singletons in the reference database thus reduce accuracy below 100% more than
misclassification errors by the algorithms.
Note 7. Estimated LKR frequencies for in vivo samples.
Prediction error rates for known and novel taxa respectively were measured using data for
which LKRs are inferred from authoritative annotations. However, these rates do not
directly indicate overall error rates for typical biological samples. For example, if most
genera in a given sample are known, then most errors will be due to misclassifications and
the overclassification rate for genus will be largely irrelevant, but if novel genera are
common, then the genus overclassification rate is important. (See Note 8 for definitions).
Thus, in order to estimate realistic error rates for typical data, we also need to determine
realistic rates of novelty, i.e. realistic LKR frequencies. Once we have LKR frequencies, then
overall sensitivity and error rates can be estimated by summing over all ranks (Eqs. 1 and 2
in the main text).
I estimated LKR frequencies for soil, human gut and mouse gut samples from a recent study
by Kozich et al.1 The goal of this step was to obtain realistic frequencies, i.e. rates of novel
taxa at each rank that are representative for biological samples in practice, not to make an
accurate determination of the frequencies on those particular samples. LKR frequencies
were estimated using identity thresholds, as described in detail below. This method is not
expected to be very accurate, but this doesn't matter because the frequencies will be
realistic even if they are under- or over-estimated by quite large factors. For example, I
estimate that 37% of the genera in the soil sample are known. This number could be quite
far off -- perhaps the true number is 20% or 50%, but it is surely not 1% or 99%. As long as
the estimate is in the right ballpark, a sample with 37% known genera is not exceptional,
and this rate is reasonable for summarizing the performance of a taxonomy prediction
algorithm. To avoid any misunderstanding on this central point, it is also important to note
that my methodology does not use identity to determine LKRs of individual sequences—
when required, they are obtained using authoritative annotations. Identity thresholds were
used only to obtain realistic LKR frequencies for three representative samples.
Identity thresholds are commonly used to determine approximate taxonomic relationships.
For example, it is commonly assumed that ≥97% identity for two full-length 16S sequences
indicates that the species is probably the same and conversely, if the identity is <97%, then
the species is probably different. This gives us a method for estimating the frequency of
known species in a sample: it is the fraction of sequences with ≥97% identity with the
reference database. This approach can be generalized to other ranks, as in the work of
Yarza et al.15 who determined the number of novel taxa in large databases of full-length 16S
sequences. Their method was based on finding appropriate clustering thresholds for ranks
from species to phylum. Sequence identity correlates only approximately with taxonomic
rank, so clusters will not correspond one-to-one with names—some clusters will contain
more than one name (lumping) and some names will be found in several clusters
(splitting). Yarza et al. tuned their thresholds so that the number of clusters containing
known taxa agreed with the number of distinct taxon names. In other words, the tuning
balanced splitting and lumping so that (number of clusters) = (number of distinct names)
at the given rank. In this framework, the number of clusters which do not contain known
names is an operational definition of the number of unnamed taxa.
At genus rank, Yarza et al. found that the clustering threshold which balanced splitting and
lumping was 95%. Using this threshold, I estimated the number of known genera as the
number of sequences having ≥95% identity with the reference database. This test is not
reliable in any given case—some sequences with known genera will have <95% identity
and some novel genera will have ≥95% identity, but these will tend to balance each other
out (analogous to lumping and splitting of clusters). LKR frequencies at higher ranks were
estimated in the same way.
The Yarza identity thresholds were determined for full-length 16S sequences, which raises
the question of whether they are optimal for shorter gene segments such as the V4 region
used in this work. The thresholds are probably not optimal, but they are surely good
enough to give realistic frequencies. From Fig. 1 in the main text we can see that the genus
threshold (95%) appears to be too low because P(LCR=genus | 95%) = 0.34, so at 95%
identity the LKR is more likely to be family or higher. A better V4 threshold for genus
appears to be 96% or 97% with P(LCR=genus) = 0.51 and 0.61 respectively. Using a higher
identity would increase the estimated frequency of novel genera, so using the thresholds
determined on full-length sequences gives a conservative estimate of novel genus
frequency.
Rank Id. Sample Known Novel Novel% LKR%.
Phylum
Soil 7225 339 5% 5%
75% Mouse gut 716 41 5% 3%
Human gut 446 6 1% 0.2%
Class
Soil 6820 744 10% 7%
79% Mouse gut 691 66 9% 3%
Human gut 445 7 2% 0%
Order
Soil 6311 1253 17% 21%
82% Mouse gut 671 86 11% 10%
Human gut 445 7 2% 6%
Family
Soil 4692 2872 38% 45%
86% Mouse gut 597 160 21% 42%
Human gut 416 36 8% 49%
Genus
Soil 283 474 83% 16%
95% Mouse gut 193 259 63% 18%
Human gut 113 339 57% 8%
Species
Soil 468 7096 93% 4%
98% Mouse gut 161 596 79% 12%
Human gut 113 339 75% 8%
Sequence
Soil 141 7423 98% 2%
100% Mouse gut 74 683 90% 10%
Human gut 77 375 83% 17%
Table SN7.1 Estimated LKR frequencies for in vivo samples vs. RDP14. LKR
frequencies estimated for UPARSE OTUs constructed from the Kozich et al. samples of soil
(7,564 OTUs), mouse gut (757 OTUs) and human gut (452 OTUs). Column headings are: Id.,
the Yarza et al. cutoff identity threshold for the rank, LKR% the fraction of OTUs having an
LKR at this rank according to the thresholds, Known the number of known OTUs, Novel the
number of novel OTUs, Novel% the fraction of novel OTUs. Novel frequencies >20% are
highlighted.
Note 8. Accuracy metrics for taxonomy validation.
Algorithm predictions are often characterized as true positives (TP), false positives (FP),
false negatives (FN) and true negatives (TN). Prediction accuracy is conventionally
summarized using measures calculated from totals for given types of prediction, e.g.
Bokulich et al. (reference in Note 6) use the textbook metrics precision = TP/(TP+FP) and
recall = TP/(TP+FN). However, this is not a textbook case (Note 6), and I used different
metrics which I found to correspond better with intuitive concepts of accuracy relevant for
taxonomy.
UTAX and the other algorithms considered in this work do not predict novelty (Note 15).
The concept of a true negative therefore does not apply because predictions are never
negative in the sense that they are for a binary classifier.
To characterize false positive rates, I defined a misclassification as a false positive when the
rank is known (FPmis), and an overclassification as a false positive when the rank is novel
(FPover). An overclassification error occurs when the algorithm predicts too many ranks—it
should have climbed higher up the taxonomic tree. In this spirit, a false negative could be
described as an underclassification error because too few ranks are predicted, but this is
true of all FNs so there is no need for a new category.
To characterize the rate of true positives, I defined sensitivity = TP / Nknown where Nknown =
TP+FN+FPmis is the number of queries with known names. Sensitivity by my definition has
a maximum of 100% which could be achieved by an ideal algorithm, while the RDP
accuracy measure is necessarily <100% if there are novel query sequences (Note 6). My
definition of sensitivity captures the intuitive idea of "fraction of achievable predictions
which are correct". Precision and recall cannot do this because misclassification errors
(where an ideal algorithm could make a TP prediction) and overclassification errors
(impossible because there are no training examples) are not distinguished.
As a summary statistic for errors I chose to use errors per query (EPQ) = FP / NQ where NQ is
the total number of query sequences. False negatives are not counted as errors for
calculating EPQ because they are already accounted for in sensitivity.
When precision and recall are used, false positives are indicated by precision < 100%. As
errors increase, precision gets lower. The divisor for precision is (TP+FP) = number of
predictions, while the divisor for EPQ is the total number of queries. Some, or many,
queries may not get a prediction (which is not the same as a prediction that the rank is
novel as noted above; see also Note 15). Both precision and EPQ capture the FP rate, and
can readily be converted given the number of queries and number of predictions. In a
prediction task with dense reference data they capture a similar intuitive notion because all
FPs are misclassifications. However, with sparse reference data / novel query data there is
an important difference. If you continue to add novel queries, all new predictions are
overclassification errors and the precision is reduced indefinitely and approaches zero for
very high novelty, even if the algorithm has a low overclassification rate. In other words,
precision reflects a property of the query set as well as a property of the algorithm. For low
ranks, novelty may be high enough that overclassifications swamp misclassifications even
if the algorithm has low rates for both types of error, making precision hard to interpret. By
contrast, when there is high novelty EPQ will converge on the overclassification rate, an
intrinsic property of the algorithm.
Note 9. Calculation of LCR probabilities and S/E.
If a married couple has a height difference of 2cm, what is the probability that the taller
spouse is male? To answer this, collect information about a large number of couples,
extract the subset where the height difference is 2cm, and calculate the fraction where the
taller spouse is a man. If 80% of those couples have a taller man, we conclude that the
probability is 0.8. Implicitly, this procedure assumes we have observed events generated
by a hidden stochastic process, and the best estimate we can make of the underlying
probability distribution (given some reasonable assumptions) is the observed frequency in
those samples. This is called an a-posteriori estimate.
If a pair of sequences has 90% identity, what is the probability that their lowest common
rank is family? To answer this, collect a large number of pairs of sequences, extract the
subset with 90% identity and calculate the fraction with LCR=family.
Fig. SN9.1 shows schematically how UTAX calculates LCR probabilities from a reference
database, using sequence identity as the similarity measure for this example. (In practice,
UTAX uses dUTAX defined by Eq.6 in the main text). An all-pairs triangular matrix (a) is
constructed containing pair-wise sequence identities, indicated by colors (green=100%,
yellow=95% and orange=90%). The lowest common rank (LCR) is determined for each
pair by comparing taxonomy annotations and marked as s (species), g (genus) or f (family).
For each identity, the corresponding pairs are identified: (b) for 100%, (c) for 95% and (d)
for 90%. For a given identity, the fraction of pairs having each LCR is calculated, i.e. the LCR
frequencies. For example, in (c) there are nine pairs with 95% identity. Of these, one has
LCR=species, five have LCR=genus and three have LCR=family. The LCR probabilities are
estimated to be the observed frequencies, so P(LCR=species | 95%) ~ 1/9, P(LCR=genus |
95%) ~ 5/9 and P(LCR=family | 95%) ~ 3/9 (the symbol ~ means "is estimated to be").
Using integer-rounded percent identities ensures that the set of pairs for a given identity is
usually large enough to make a good estimate of its LCR probabilities. Missing values are
filled in by interpolation, e.g. if there are no pairs with 76% identity then P(LCR | 76%) ~
(P(LCR | 75%) + P(LCR | 77%))/2.
Fig. SN9.1. Calculation of LCR probabilities from a reference database.
Fig. SN9.2. Calculation of sensitivity vs. EPQ from a reference database.
This figure shows how a sensitivity vs. error plot for common rank (CR) is calculated for
genus, using the toy example from Fig. SN9.1. Pairs are considered in order of decreasing
identity. If LCR=s or LCR=g, the pair is a true positive CR prediction because the genus is
the same, or if LCR=f this is a false positive because the genus is different. At each identity,
the number of true positives and false positives (f, red outlines) are counted. There are 14
pairs with common genera (LCR=s or g) and there are 21 queries (the total number of
pairs), so the CR sensitivity at a given cutoff is TP/14 and EPQ is FP/21 (see Note 8 for
definitions and discussion of sensitivity and EPQ). Here, there are three possible thresholds
at identities 100%, 95% and 90% which incrementally include queries from pairs in groups
(b), (c) and (d) respectively.
Note 10. Software versions and command lines.
UTAX version 1.0. Source code and Linux binary are in the Supplementary Files. RDP: Stand-alone classifier version 2.11. RDP training: java -Xmx8g -cp /sw/rdp_classifier_2.11/rdp_classifier-2.11.jar edu/msu/cme/rdp/classifier/train/ClassifierTraineeMaker treefile dbfile 1 version1 name_not_used traindir/ RDP classification: java -Xmx1g -jar /sw/rdp_classifier_2.11/rdp_classifier-2.11.jar -t traindir/rRNAClassifier.properties -q query.fa -o output.txt QIIME: version 1.9.1. QIIME-uc: assign_taxonomy.py -i query.fa -m uclust -r db.fa -t taxonomy.txt QIIME-sm: assign_taxonomy.py -i query.fa -m sortmerna -r db.fa -t taxonomy.txt QIIME-blast: assign_taxonomy.py -i query.fa -m blast -r db.fa -t taxonomy.txt mothur-knn: classify.seqs(fasta=query.fa, template=db.fa, taxonomy=taxonomy.txt, method=knn, processors=6) GAST: Source dated 25 Feb 2011 (no version number given). gast -in query_fa -ref ref_fa -rtax taxonomy.txt -out output.txt
Note 11. Compute time and memory use.
Method Elapsed
time (secs.)
Maximum
memory
UTAX 359 110 Mb
RDP 41 320 Mb
QIIME-uc 11 450 Mb
QIIME-sm 19 150 Mb
QIIME-blast 25,260 800 Mb
GAST 30 140 Mb
mothur-knn 3 120 Mb
Table SN11.1. Execution time and maximum memory use of the tested methods. The
table reports elapsed time in seconds and maximum memory in megabytes for the tested
methods using the 9,364 sequences in the V4 reference database extracted from RDP14 as
both query and reference. Programs were run under Ubuntu Linux on an Intel Core i7-
3930K CPU running at 3.20GHz with 64Gb RAM.
Note 12. Performance metrics on RDP14 and War4.
Method sensitivities for RDP14 and War4 are given in Supp. Table SN12.1. UTAX is seen to
have a relatively low sensitivity yet maintains performance which I considered to be
acceptable and comparable to the best alternative with one exception: genus predictions
on War4 (~39% sensitivity compared to ~60-70% for RDP_80). I interpreted this anomaly
as an underestimate of EPQ by the algorithm, which I was not able to explain but found that
it could be addressed by setting P≥0.7, which gave 71% sensitivity and EPQ ~5%. Error
rates are shown in Table SN12.2. which shows that UTAX consistently achieves lower error
rates than most other methods, dramatically so in many cases, with the exception of
mothur-knn, which has much lower sensitivity. Genus overclassification rates with
LKR=genus increased from V4 to full-length for all methods except UTAX for which the
overclassification rate was lower (19% V4, 13% full-length; Supplementary Files). Notably,
the overclassification rate of RDP_80 jumped from 31% on V4 to 50% on full-length
sequences and RDP_50 to 81%.
Genus 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL)
UTAX 48.0 67.9 78.7 79.9 37.9 39.5 38.5
RDP_80 59.0 79.3 86.4 92.6 63.3 64.5 72.3
RDP_50 74.5 87.0 91.1 94.7 73.1 73.7 78.9
QIIME-uc 69.2 79.3 81.9 81.4 54.6 58.9 64.1
QIIME-sm 65.6 76.4 80.2 83.7 52.4 56.8 62.7
QIIME-blast 77.0 88.7 92.1 94.9 64.8 68.6 80.2
GAST 73.2 88.2 92.2 95.4 73.9 76.2 80.2
mothur-knn 26.6 34.3 37.2 39.6 33.6 34.0 38.0
Family 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL)
UTAX 81.6 94.9 96.3 96.6 83.1 84.2 87.2
RDP_80 84.9 95.1 97.8 98.9 82.5 85.0 90.7
RDP_50 93.2 97.3 98.7 99.2 89.0 90.4 93.6
QIIME-uc 91.6 95.6 96.8 96.2 63.4 69.5 74.8
QIIME-sm 90.2 94.9 96.2 96.7 62.2 68.3 75.3
QIIME-blast 93.6 97.6 98.3 99.0 78.4 82.2 94.4
GAST 93.1 97.9 98.7 99.1 88.1 91.2 93.9
mothur-knn 58.5 66.5 72.7 74.7 67.9 70.5 75.1
Order 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL)
UTAX 97.5 98.6 99.4 99.5 94.4 95.3 95.3
RDP_80 94.9 98.5 99.1 99.6 89.3 92.0 96.6
RDP_50 97.9 99.2 99.4 99.7 94.4 95.8 98.0
QIIME-uc 97.4 97.9 98.4 98.0 64.4 70.8 76.3
QIIME-sm 96.9 97.8 98.3 98.4 64.0 69.9 77.3
QIIME-blast 98.2 98.9 99.2 99.7 81.9 85.7 98.1
GAST 98.4 99.2 99.6 99.7 90.9 94.1 96.9
mothur-knn 77.8 82.1 86.1 87.2 85.2 87.3 90.7
Class 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL)
UTAX 99.6 99.8 99.9 99.9 97.5 98.2 98.0
RDP_80 97.9 99.4 99.8 99.9 93.4 95.6 98.5
RDP_50 99.3 99.7 99.9 100.0 96.7 97.9 99.1
QIIME-uc 99.1 99.0 99.0 98.3 64.8 71.3 76.7
QIIME-sm 99.0 99.1 99.0 98.9 64.8 70.5 78.0
QIIME-blast 99.5 99.5 99.5 99.9 82.9 86.7 99.2
GAST 99.7 99.8 99.9 99.9 91.6 95.0 97.7
mothur-knn 88.7 93.3 94.8 95.2 93.4 94.3 96.7
Phylum 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL)
UTAX 99.6 99.8 99.9 99.9 97.5 98.2 98.0
RDP_80 97.9 99.4 99.8 99.9 93.4 95.6 98.5
RDP_50 99.3 99.7 99.9 100.0 96.7 97.9 99.1
QIIME-uc 99.1 99.0 99.0 98.3 64.8 71.3 76.7
QIIME-sm 99.0 99.1 99.0 98.9 64.8 70.5 78.0
QIIME-blast 99.5 99.5 99.5 99.9 82.9 86.7 99.2
GAST 99.7 99.8 99.9 99.9 91.6 95.0 97.7
mothur-knn 88.7 93.3 94.8 95.2 93.4 94.3 96.7
Table SN12.1. Sensitivity with LKR=genus on RDP14 (16S) and War4 (ITS). The table
shows sensitivity (defined in Note 8) as a percentage with LKR=genus for predicted ranks
from genus to phylum. LKR=genus was chosen as representative of the in vivo samples
(Note 7). Sensitivities <75% are highlighted in yellow, <50% in orange. The complete
matrices for sensitivity, overclassification and misclassification for all pairs (prediction
rank, LKR) are included in the Supplementary Files. The V5 region of 16S was truncated to
120nt to simulate reads obtained by older NGS machines. The V4 region (~250nt) is
popular with current sequencing technologies. The V3V5 region (~520nt) was sequenced
on older 454 machines and models the longer reads which will be achieved by NGS
machines in the near future.
Predicted genus LKR=genus
(Mis.)
LKR=family
(Over.)
LKR=order
(Over.)
LKR=class
(Over.)
UTAX 1.6 19.4 5.1 1.1
RDP_80 3.5 31.1 9.5 3.0
RDP_50 7.8 66.8 30.6 21.1
QIIME-uc 10.3 66.5 51.3 28.3
QIIME-sm 7.7 61.1 48.3 26.6
QIIME-blast 11.2 99.0 95.3 90.6
GAST 7.8 88.6 87.6 85.4
mothur-knn 0.1 5.5 7.5 2.9
Predicted family LKR=genus
(Mis.)
LKR=family
(Mis.)
LKR=order
(Over.)
LKR=class
(Over.)
UTAX 1.1 6.4 31.5 7.6
RDP_80 0.9 3.6 30.1 11.6
RDP_50 1.5 9.6 59.8 46.9
QIIME-uc 2.9 16.0 65.1 34.6
QIIME-sm 2.5 15.4 65.3 36.4
QIIME-blast 2.3 24.1 95.3 90.6
GAST 1.4 18.1 95.1 93.1
mothur-knn 0.2 2.1 32.9 16.7
Predicted order LKR=genus
(Mis.)
LKR=family
(Mis.)
LKR=order
(Mis.)
LKR=class
(Over.)
UTAX 0.5 1.9 9.9 33.0
RDP_80 0.4 1.2 6.6 24.5
RDP_50 0.5 3.4 13.8 67.9
QIIME-uc 1.1 4.5 11.6 36.2
QIIME-sm 0.9 4.2 11.3 38.1
QIIME-blast 1.0 11.4 25.9 90.6
GAST 0.5 6.8 22.1 95.4
mothur-knn 0.1 0.9 8.7 36.4
Predicted class LKR=genus
(Mis.)
LKR=family
(Mis.)
LKR=order
(Mis.)
LKR=class
(Mis.)
UTAX 0.1 0.6 4.6 6.1
RDP_80 0.0 0.2 1.8 1.3
RDP_50 0.1 0.5 4.4 4.7
QIIME-uc 0.1 0.3 2.1 3.1
QIIME-sm 0.1 0.4 1.5 2.9
QIIME-blast 0.5 6.2 14.9 24.4
GAST 0.1 2.0 8.2 10.0
mothur-knn 0.0 0.4 2.0 4.2
Predicted phylum LKR=genus
(Mis.)
LKR=family
(Mis.)
LKR=order
(Mis.)
LKR=class
(Mis.)
UTAX 0.0 0.5 0.8 0.5
RDP_80 0.0 0.0 0.1 0.0
RDP_50 0.0 0.8 0.9 1.7
QIIME-uc 0.0 0.1 0.1 0.0
QIIME-sm 0.0 0.0 0.1 0.0
QIIME-blast 0.2 3.4 4.0 13.8
GAST 0.0 1.5 2.2 1.9
mothur-knn 0.0 0.3 0.3 0.5
Table SN12.2. Error rates measured on the RDP14 V4 region. The table shows
misclassification (Mis.) and overclassification (Over.) error rates as percentages for
predicted ranks from genus to phylum as defined in Note 8. LKRs from genus to class are
shown as novel phyla are rare in practice. With the predicted rank is <LKR, errors are
overclassifications and when the predicted rank is ≥LKR than the errors are
misclassifications (Note 8). Error rates ≥10% are highlighted in yellow, ≥20% in orange
and ≥50% in red.
Supplementary Note 13. Construction of a rank split.
Fig. SN13.1. Rank split with LKR=family. The reference database is divided into two
subsets X and Y, colored gold and blue respectively in the figure, such that LKR=family. For
each family, its genera are assigned at random to X or to Y . At least one genus from each
family must always be present in both X and Y. No genus is present in both. Families
containing only one genus are discarded. With LKR=family, ranks of family and above are
always known (i.e., present in both X and Y) while ranks of genus and below are always
novel (i.e., not present in both).
Note 14. Sensitivity vs. EPQ plots for similarity measures.
I chose dUTAX as the similarity measure after investigating alternatives. I noted that a good
measure will sort true positives ahead of false positives when predictions are sorted by
decreasing similarity (Fig. SN9.2). I therefore sorted all pairs in RDP14 and plotted
sensitivity (number of pairs with common genus and similarity ≥D divided by the number
of pairs with a common genus) vs. EPQ (number of pairs with different genera and
similarity ≥D divided by number of pairs) for each value of D for several different similarity
measures for the V4 region (Fig. SN14.1), giving a method for evaluating measures
independent of the rank-split benchmark methodology.
I defined the nearest neighbor NNk(R) of a reference sequence R to be its nearest neighbor
at rank k, i.e. the sequence in the reference database with highest similarity to R and a
different name at rank k. For a pair of sequences Q and R, I defined the unique word
similarity U(Q, R) as Eq.5 in the main text and Id(Q, R) as the identity calculated from a
global alignment, i.e. the number of columns containing identical letters divided by the
alignment length after discarding any columns containing terminal gaps. I defined IdUTAX as
follows,
IdUTAX(Q, R, k) = (2 Id(Q, R) – Id(R, NNk(R))/2.
Sensitivity vs. EPQ curves for these measures are plotted in Fig. SN14.1, showing that dUTAX
is the most accurate measure because its curve is lowest, implying that has lower error rate
at most sensitivity values. Surprisingly, the ranking from best to worst is dUTAX > U > Id ≈
IdUTAX. Typically, alignment-based measure to be more accurate than word-counting
measures, which are is usually motivated by computational efficiency with the expectation
that there will be some reduction in accuracy. I do not have a good explanation for this
result, but speculate that it may be related to the fact that for a given number of gaps U has
a higher value when gaps are contiguous than if gaps that are spaced apart, while
alignment identity by the usual definition gives the same value regardless of where the
gaps appear. Contiguous gaps are probably due to a single mutation (insertion or deletion
with length > 1), while gaps that are not adjacent are probably due to multiple events,
indicating a larger evolutionary distance. This suggests trying a modified definition of
alignment identity to count gaps differently or using a likelihood-based measure that
calculates a log-odds score with affine gap penalties. However, this approach would require
calculating thousands of alignments per query to reach the nearest neighbor at phylum
rank, which would be intractably slow for a practical high-throughput tool.
Fig. SN14.1. Genus sensitivity vs. EPQ plot for four similarity measures. The figure
shows sensitivity vs. EPQ as percentages measured on the V4 region of RDP14.
Note 15. Non-predictions and blank names
UTAX estimates the probability (P) that a query sequence has the predicted name at each
rank. Suppose P=0.8 for genus and P=0.95 for family. A common heuristic for processing
bioinformatics predictions is to set a confidence threshold (e.g., a maximum BLAST E-
value). Predictions below the threshold are discarded completely, predictions above the
threshold are kept and may be regarded as equally "correct" as a necessary or convenient
simplification for further analysis. This approach can be applied to UTAX predictions, as I
did in the validation tests by setting a threshold of P≥0.9. The simplest way to implement a
threshold is to omit the genus name or to use a reserved string meaning "not predicted"
(e.g., blank), depending on the file format. A similar approach is natural for implementing
an RDP bootstrap cutoff.
UTAX estimates the probability P(x) that the true name is x. This is equivalent to
probability P(!x) = 1 – P(x) that the true name is not x. If the name is not x, it may be a
different name, so it does not follow that the taxon is not named or novel (missing from the
database). For example, if the query has ~95% identity with reference sequences for two
different genera x and y, but all other known genera are <80% id, then combining
probabilities for LCRs from all reference sequences might give posteriors P(x) = 1/3, P(y) =
1/3, P(all other genera) ≈ 0, which in turn implies P(genus is not in database) = 1/3. As
implemented, UTAX reports only the name in the top hit, i.e. P(x) = 1/3. Therefore, low P
should not be interpreted as a prediction that the taxon is not found in the database -- in
this case, the theory predicts that the taxon is known with 2/3 probability and unknown
with 1/3 probability, but (i) these probabilities are not calculated by the current
implementation, and (ii) these probabilities cannot not be reported in a file format with
one name per rank. Thus, the implementation of UTAX described here does not make
negative predictions in the sense of a binary classifier, and it would be a conceptual
mistake to classify any of its predictions as true negatives.
Setting a confidence threshold is an all-or-nothing approach to prediction that discards
potentially useful information. For example, if we are interested in the frequencies of
known genera in a sample, then a better estimate can be made by keeping the probabilities.
For example, if there are 10 different OTUs with P(Streptococcus) = 0.8, then it would be
better to predict that eight of them contain Streptococcus rather than all or none of them.
A low confidence value for a predicted name does not necessarily indicate high uncertainty
due to low identity with the database; it may also indicate an unavoidable ambiguity. For
example, 182 / 1,948 (9%) of genera in RDP14 have a V4 sequence that is also found in
another genus. If the query has a sequence found in more than one genus, it is impossible in
principle to predict its genus name with certainty. A prediction could report multiple genus
names with an indication that the sequence is known but has multiple labels (Fig. SN6.1),
but current file formats and downstream analysis tools do not allow this.
With RDP14, all reference sequences have genus names. Other reference databases may
omit genus or other ranks because the sequence is annotated to fall outside a named clade,
e.g. because it belongs to an unnamed genus. The genus in this case is known (in the sense
of having a sequence in the database), but does not have a name.
Missing names also occur in SILVA and Greengenes annotations. Is a missing name a
positive prediction that the sequence is not in a named clade, or could it be a borderline
case, e.g. because the branching order of the tree has low bootstrap confidence? To the best
of my knowledge, this is not documented.
With the exception of RDP and UTAX, all of the tested prediction methods omit names or
predict blank names in some cases, but do not document how these should be interpreted
for further analysis. Is a blank a positive prediction that the sequence is not in a named
clade, or could it be an ambiguous or low-confidence case? Are these methods designed to
be aware that most genus or species names in the database are blank?
Note 16. LKR estimates and OTU error rates
I estimated the number of OTUs with novel genera as the number of OTU sequences with
<95% identity to the reference database. This approach requires a low rate of OTU errors
because incorrect sequences will have underestimated identities with the database. On
mock community tests using MiSeq 2×250 paired reads of the V4 region (similar to the
Kozich et al. V4 data used here for testing), I previously showed that all OTUs generated by
UPARSE were either error-free biological sequences from the designed community or
identifiable as contaminants16. I further demonstrated the high specificity of the UPARSE
pipeline by constructing OTUs from R2 (reverse) reads, which had substantially higher
error rates than the R1 (forward) reads. On three out of four mock samples, I found that all
R2 OTUs were error-free sequences from the designed community (supp. ref. 15, Table
SN1.4). While errors in the contaminant sequences cannot be identified with certainty
(because the contaminants are not known independently of their OTU sequences), most
likely all of the UPARSE OTU sequences on the mock community tests were error-free, and
certainly a large majority were error-free. To further improve specificity, for the results
reported in this paper I used an updated version of UPARSE incorporating expected error
quality filtering, which dramatically reduces the number of bad reads17.
In the Kozich et al. data, diversity is higher and there is a longer tail of low-abundance
OTUs, so results on mock tests should be extrapolated with caution. However, while the
UPARSE pipeline may discard some valid sequences due to stringent filtering, and may
merge multiple species with similar sequences into a single OTU, I am not aware of any
mechanism that would tend to cause an increase in the number of OTU sequence errors
with higher diversity.
While there may be a small bias due to OTU sequence errors causing underestimated
identities, the 95% threshold for genus was obtained by Yarza et al. for full-length
sequences and is probably too high for the V4 region. Using the V4 region of RDP14, I found
that P(LCR > genus | 95%) = 0.66, i.e. the common rank is probably higher than genus at
95% identity. Thus, the Yarza estimate of novel genus frequency is probably conservative,
even if a low rate of spurious OTUs is present.
Supplementary references
1. Kozich, J. J., Westcott, S. L., Baxter, N. T., Highlander, S. K. & Schloss, P. D.
Development of a dual-index sequencing strategy and curation pipeline for analyzing
amplicon sequence data on the miseq illumina sequencing platform. Appl. Environ.
Microbiol. 79, 5112–5120 (2013).
2. DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and
workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–72 (2006).
3. Pruesse, E. et al. SILVA: A comprehensive online resource for quality checked and
aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 35,
7188–7196 (2007).
4. Poiincaré, H. Science et méthode. (1908).
5. McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological
and evolutionary analyses of bacteria and archaea. ISME J 6, 610–618 (2012).
6. Yilmaz, P. et al. The SILVA and ‘all-species Living Tree Project (LTP)’ taxonomic
frameworks. Nucleic Acids Res. 42, (2014).
7. Anonymous. Bergey’s manual of systematic bacteriology. Bergey’s Man. Syst. Bacteriol.
(2001). doi:10.1016/0769-2609(87)90099-8
8. Parte, A. C. LPSN - List of prokaryotic names with standing in nomenclature. Nucleic
Acids Res. 42, (2014).
9. Escobar-Páramo, P., Giudicelli, C., Parsot, C. & Denamur, E. The evolutionary history of
Shigella and enteroinvasive Escherichia coli revised. J. Mol. Evol. 57, 140–148 (2003).
10. Ashelford, K. E., Chuzhanova, N. a, Fry, J. C., Jones, A. J. & Weightman, A. J. New
screening software shows that most recent large 16S rRNA gene clone libraries contain
chimeras. Appl. Environ. Microbiol. 72, 5734–41 (2006).
11. Wilm, A., Mainz, I. & Steger, G. An enhanced RNA alignment benchmark for sequence
alignment programs. Algorithms Mol. Biol. 1, 19 (2006).
12. Felsenstein, J. Inferring Phylogenies. Am. J. Hum. Genet. 74, 1074 (2004).
13. Philippe, H. et al. Resolving difficult phylogenetic questions: Why more sequences are
not enough. PLoS Biol. 9, (2011).
14. Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Naive Bayesian classifier for rapid
assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ.
Microbiol. 73, 5261–7 (2007).
15. Yarza, P. et al. Uniting the classification of cultured and uncultured bacteria and archaea
using 16S rRNA gene sequences. Nat. Rev. Microbiol. 12, 635–645 (2014).
16. Edgar, R. C. UPARSE: highly accurate OTU sequences from microbial amplicon reads.
Nat. Methods 10, 996–8 (2013).
17. Edgar, R. C. & Flyvbjerg, H. Error filtering, pair assembly and error correction for next-
generation sequencing reads. Bioinformatics 31, 3476–3482 (2014).