drive5.comUTAX accurately predicts taxonomy of marker gene sequencesThe UTAX algorithm accurately predicts the taxonomy of 16S ribosomal RNA and other marker gene sequences targeted

UTAX accurately predicts taxonomy of marker gene sequences Robert C. Edgar

Independent Investigator

Tiburon, California, USA.

[email protected]

The UTAX algorithm accurately predicts the taxonomy of 16S ribosomal RNA and

other marker gene sequences targeted by next-generation metagenomics

experiments. UTAX has comparable sensitivity but much lower error rates compared

to most existing methods, predicting dramatically fewer false positives for novel

taxa.

Recent studies using next-generation sequencing of marker gene segments include the

Human Microbiome Project (HMP)1 and a survey of the Arabidopsis root microbiome2. A

fundamental step in such studies is to predict the taxonomy of sequences in the reads,

which are typically clustered into Operational Taxonomic Units (OTUs). Computational

taxonomy prediction is complicated by the fact that only a small minority of microbial

species have authoritative classifications and reference databases have sparse coverage so

that in practice, an OTU often does not have an exact match in the database (Supp. Note 3).

With the goal of improving taxonomy prediction accuracy, I developed a new algorithm,

UTAX, that accounts for sparseness in the database and for varying correlations between

rank and sequence identity in different groups. UTAX calculates a novel score combining k-

mer distances to the top hit and to the nearest neighbor at each rank, i.e. the most similar

sequence with a different name at that rank. For each rank, the probability that the query

belongs to the same group as the top hit is calculated from the distribution of scores over

all pairs in the reference database.

Available reference databases for the 16S ribosomal RNA gene (16S) include SILVA3,

Greengenes4 and the RDP Classifier5 (RDP) training set. The current RDP training set (v14,

here called RDP14) contains 10,679 sequences. Greengenes and SILVA are larger, giving

better coverage than RDP14 but not as much as might be expected from the numbers of

sequences (Supp. Note 3). Most of the annotations in SILVA and Greengenes are not

authoritative classifications but predictions generated by a combination of automated and

manual methods6,7 which I estimated to have error rates of ~6% and ~18% respectively

for genus (Supp. Note 4). Also, SILVA and Greengenes are not compatible with some

programs because many sequences lack species and genus names (Supp. Note 5), and I

therefore chose to use RDP14 for comparative validation on 16S and the RDP Warcup

training set8 version 4 (War4) for the fungal internal transcribed spacer (ITS) region.

Given sequences from a biological sample (here called OTUs without necessarily implying

clustering) and a reference database, I defined coverage at each taxonomic rank to be the

fraction of OTUs that belong to a known group. Here, known means that the group is

present in the reference database, regardless of whether the group has been named, and

novel that the group is not present. I defined the lowest known rank (LKR) of an OTU as its

lowest rank having at least one reference sequence and the LKR frequency λr as the fraction

of OTUs having LKR = r. For example, if λgenus = 0.4, then 40% of the OTUs belong to a novel

species in a known genus. LKR frequencies can be interpreted as a profile summarizing

taxonomic novelty in the OTUs with respect to the database. I estimated LKR frequencies

for soil, human gut and mouse gut reads of the 16S V4 region from a recent study9 using

sequence identity thresholds determined by Yarza et al.10: ≥95% for genus, ≥86% for

family, etc. (Supp. Note 7). While identity gives only an approximate indication of rank,

averaging over OTUs for a typical sample should give frequencies that are realistic even if

not accurate for that particular sample. Using RDP14 as a reference and OTUs constructed

by UPARSE11, I estimated the fraction of OTUs with novel genera to be 83% for soil, 63%

for mouse gut and 57% for human gut, showing that coverage is sparse in practice (Fig.1

and Supp. Notes 7 and 16).

At high identities and high ranks, an OTU almost certainly belongs to the same group as the

top database hit, and at low ranks and low identities, an OTU almost certainly belongs to a

different group. The most challenging cases occur when identity is close to the average for

the rank, for example attempting to predict genus when identity is ~95%. This is a twilight

zone for taxonomy prediction (Fig. 1) analogous to the twilight zone for protein homology

prediction12. In principle, it might be possible to identify genus-specific sequence features,

but not when reference data is too sparse. For example, almost half (913 / 1,948) of the

genera in RDP14 have only one reference sequence, and in these cases it is impossible to

predict whether a human expert would assign another species to the same group from its

sequence alone. Thus, in the twilight zone, predictions of known genera will often be false

positives while non-predictions will often be false negatives (see also Supp. Note 15).

Identity distributions for typical samples (Fig. 1 and Supp. Fig. SN6.2) show that twilight

zone OTUs are common in practice, underscoring the difficulty of accurate taxonomy

prediction and the importance of providing a confidence estimate. With this in mind, I

designed UTAX to predict the mean number of errors per query (EPQ) for each rank (see

Methods). For testing, I set a threshold of P = (1 – EPQ) ≥ 0.9 on the assumption that ~10%

is an acceptable error rate for a typical study.

The RDP authors measured accuracy using leave-one-out validation5, which I believe is

inappropriate in this context (Supp. Note 6). I used a different strategy that has been

applied to validation of shotgun metagenomics taxonomy prediction13 by constructing

datasets where LKRs are known from trusted annotations, as follows. For k=genus, family ...

phylum I divided RDP14 into two subsets (rank splits) Xk and Yk such that the LKR between

the subsets is k. For example, with LKR = family, I discarded families with only one genus

and randomly assigned the remaining genera to Xfamily or Yfamily with the constraint that at

least one genus from every family must be present in both (Supp. Fig. SN13.1). For each k

and for each region of interest (full-length gene, V4 etc.), I measured prediction

performance for all ranks using Xk as the query and Yk as the reference and vice versa. I

included a null split XN = YN = RDP14 to measure performance when the sequence is known.

I followed the same procedure for War4. For every split at rank k I calculated the following

accuracy metrics for each rank r (see Supp. Note 8 for discussion). Sensitivity (Srk) is the

fraction of known names at rank r that were correctly predicted. The misclassification

error rate (Mrk) is the fraction of known names at rank r that were incorrectly predicted.

The overclassification error rate (Ork) is the fraction of novel r's that were incorrectly

predicted to be known. Given the LKR frequencies λk, the total sensitivity Sensr and errors

per query EPQr at rank r for a set of OTUs can be estimated by assuming that the

sensitivities and error rates at each rank are approximately the same as those measured on

the rank splits:

Sensr = Σ λk Srk, (Eq.1)

EPQr = Σ λk (Ork + Mrk). (Eq.2)

To obtain sensitivities and error rates for typical data, I used Eqs. 1 and 2 with the

estimated LKR frequencies for the soil, human gut and mouse gut OTUs. While the

frequencies may be inaccurate for those samples, and the sensitivities and error rates for

each LKR in a given set of OTUs may differ somewhat from those measured on the rank

splits, this procedure should nevertheless give good estimates in the sense that they fall

comfortably within the range of true values for typical data in practice, giving a far more

realistic indication of algorithm accuracy than leave-one-out testing (Supp. Note 6).

Using this method, I compared the accuracy of UTAX with GAST14, RDP and methods

supported by mothur15 and QIIME16 (see Supp. Note 11 for method name abbreviations,

software versions and command lines). Representative results are given in Table 1; the

underlying performance metrics are given in the Supplementary Files and Supp. Note 12.

Mothur-rdp gave very similar results to RDP (Supp. Note 1). The only method to

consistently achieve an estimated EPQgenus below 10% was mothur-knn, but its sensitivity

was also much lower than the other methods (Sensgenus < 40% on all samples). The

estimated EPQgenus of UTAX was ~10% on all three samples, remarkably close to the rate

predicted by the P ≥ 0.9 threshold given that P is calculated by an independent method that

k

k

does not use identity thresholds or rank splits (Methods). All other algorithms had

substantially higher EPQgenus, ranging from EPQgenus ~17% for RDP at 80% bootstrap to

QIIME-blast which consistently had the highest error rate (EPQgenus 62% to 78%). The

default QIIME method, QIIME-uc, had EPQgenus = 39% to 45% and QIIME-rdp, which sets the

bootstrap cutoff at 50% by default, had EPQgenus = 36% to 40%. Sensphylum was >90% for all

methods except QIIME-uc (78% on soil, 87% on mouse gut) and QIIME-sm (79% on soil,

87% on mouse gut).

Methods

Given a pair of sequences Q and R, I defined the lowest common rank (LCR) of Q and R to be

the lowest rank where Q and R have the same name. Given a similarity measure d(Q, R, k),

P(LCR=k | d) is the probability that the LCR is k. For example, if d is sequence identity then

P(LCR=phylum | d=93%) will be close to one but P(LCR=genus | d=93%) will be lower.

To obtain a discrete range, UTAX converts a real-valued similarity d taking values zero to

one to an integer percentage D = ⌊100 d⌋. Considering all pairs of sequences in a reference

database B, let the number of pairs with a given D be HD and the number of those pairs with

LCR=k be hD,k, UTAX calculates an a-posteriori estimate for P(LCR=k | D) from B as the

fraction of pairs having distance D which also have LCR=k, i.e.

P(LCR=k | D) ~ hD,k/HD. (Eq.3)

For motivation and visualization of Eq.3 see Supp. Note 9. UTAX calculates the matrix CD,k =

hD,k/HD from B and stores it for use in run-time prediction.

Let P(CR(k) | D) be the probability that two sequences have a common rank at level k, i.e.

have the same name at that rank. Let taxon(Q, k) be the name of Q at rank k. Q and R have

the same name at rank k if their LCR is not > k, hence

P(CR(k) | D)

= P(taxon(Q, k) = taxon(R, k) | D)

= 1 – P(LCR(Q, R) > k | D) = 1 – Σ CD,r. (Eq.4)

Thus, given a reference sequence R and an integer similarity D, Eq.4 gives the probability

that the taxon name of Q is the same as R at rank k. This gives a framework for constructing

a taxonomy prediction algorithm based on a similarity measure d. Natural choices for d

include identity calculated from an alignment or a word-counting distance. However, these

would not take into account that the correlation varies in different groups due to differing

evolutionary rates and lumping or splitting by taxonomists. I therefore also considered the

similarity of a reference sequence R with its nearest neighbor NNk(R) for each k, i.e. the

sequence in B with highest similarity to R and a different name at rank k. If NNk(R) is close

to R, then the confidence that taxon(Q, k) = taxon(R, k) should be reduced because of the

increased likelihood that taxon(Q, k) = taxon(NNk(R), k). I chose to use similarities

calculated from the set w8(Q) of 8-mers in Q. I defined the unique word similarity (U) of a

pair of sequences Q and R as

U(Q, R) = |w8(Q) ∩ w8(R)|/min(|w8(Q)|, |w8(R)|). (Eq.5)

I designed a similarity measure (dUTAX) that increases with higher similarity between Q and

R, decreases with higher similarity between R and Hk(R), and takes real values between

zero and one,

dUTAX(Q, R, k) = (2 U(Q, R) – U(R, NNk(R))/2. (Eq.6)

(See Supp. Note 14 for comparison with other measures). Given a query sequence Q, UTAX

identifies the top hit T by unique word similarity, i.e. T = argmaxR { U(Q, R), R ∈ B }. The

rank names of Q are predicted to be the same as those of T with probabilities calculated by

Eq.4 using the dUTAX similarity measure.

r > k

Figures and tables

Fig. 1. Estimated Lowest Known Ranks (LKRs) for soil OTUs. The upper graph shows

lowest common rank (LCR) probabilities as a function of sequence identity calculated for

the V4 region of RDP14 (using Eq.3, see also Supp. Note 9). The lower histogram shows

frequencies of integer-rounded sequence identities of top hits of OTUs to the RDP14

database. Histogram bars are colored to indicate estimated LKRs according to the Yarza

thresholds. While identity thresholds are not reliable indicators of rank, the fraction of

OTUs in a Yarza identity range nevertheless gives a realistic indication of how many OTUs

with the corresponding LKR might be found in a similar sample. The "twilight zone" is a

region around 95% identity where high sensitivity for genus prediction cannot be achieved

without high false positive rates because if the closest reference sequence has ~95%

identity, then it is unlikely that there are enough training examples to identify genus-

specific sequence features, and identity correlates only approximately with taxonomic

rank, noting e.g. that P(LCR=genus | 95%) = 0.34, P(LCR=family | 95%) = 0.33 and

P(LCR=order | 95%) = 0.23.

Table 1. Estimated accuracy for soil, mouse gut and human gut OTUs. The table shows

estimated sensitivity and errors per query (EPQ) for genus and phylum predictions,

expressed as percentages. Error rates >10% are highlighted yellow and >30% magenta.

Genus sensitivities <50% are highlighted magenta and phylum sensitivities <90% yellow.

Results for UTAX are shown for threshold P≥0.9. Results for RDP are shown with 80%

bootstrap cutoff (recommended by the authors) and 50% bootstrap (the default for QIIME-

rdp).

References

1. HMP Consortium. Structure , function and diversity of the healthy human microbiome.

Nature 486, 207–214 (2012).

2. Lundberg, D. S. et al. Defining the core Arabidopsis thaliana root microbiome. Nature

488, 86–90 (2012).

3. Pruesse, E. et al. SILVA: A comprehensive online resource for quality checked and

aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 35,

7188–7196 (2007).

4. DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and

workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–72 (2006).

5. Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Naive Bayesian classifier for rapid

assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ.

Microbiol. 73, 5261–7 (2007).

6. Yilmaz, P. et al. The SILVA and ‘all-species Living Tree Project (LTP)’ taxonomic

frameworks. Nucleic Acids Res. 42, (2014).

7. McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological

and evolutionary analyses of bacteria and archaea. ISME J 6, 610–618 (2012).

8. Deshpande, V. et al. Fungal identification using a Bayesian classifier and the Warcup

training set of internal transcribed spacer sequences. Mycologia 14–293– (2015).

doi:10.3852/14-293

9. Kozich, J. J., Westcott, S. L., Baxter, N. T., Highlander, S. K. & Schloss, P. D.

Development of a dual-index sequencing strategy and curation pipeline for analyzing

amplicon sequence data on the miseq illumina sequencing platform. Appl. Environ.

Microbiol. 79, 5112–5120 (2013).

10. Yarza, P. et al. Uniting the classification of cultured and uncultured bacteria and archaea

using 16S rRNA gene sequences. Nat. Rev. Microbiol. 12, 635–645 (2014).

11. Edgar, R. C. UPARSE: highly accurate OTU sequences from microbial amplicon reads.

Nat. Methods 10, 996–8 (2013).

12. Rost, B. Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999).

13. Patil, K. R. et al. Taxonomic metagenome sequence assignment with structured output

models. Nat. Methods 8, 191–2 (2011).

14. Huse, S. M. et al. Exploring microbial diversity and taxonomy using SSU rRNA

hypervariable tag sequencing. PLoS Genet. 4, e1000255 (2008).

15. Schloss, P. D. et al. Introducing mothur: open-source, platform-independent, community-

supported software for describing and comparing microbial communities. Appl. Environ.

Microbiol. 75, 7537–41 (2009).

16. Caporaso, J. G. et al. QIIME allows analysis of high-throughput community sequencing

data. Nat. Methods 7, 335–6 (2010).

Author contributions

R.C.E. conceived of the study, performed the analysis and wrote the manuscript.

UTAX accurately predicts taxonomy of marker gene sequences Supplementary Notes

Note 1. Mothur-rdp is effectively equivalent to RDP.

Note 2. Genus predictions for the Soil86 set.

Note 3. Coverage of SILVA and Greengenes.

Note 4. Error rates of SILVA and Greengenes taxonomy annotations.

Note 5. Reference database compatibility.

Note 6. Leave-one-out and leave-10%-out validation.

Note 7. Estimated LKR frequencies for in vivo samples.

Note 8. Accuracy metrics for taxonomy prediction.

Note 9. Calculation of LCR probabilities and S/E.

Note 10. Compute time and memory use of the tested methods.

Note 11. Software versions and command lines.

Note 12. Performance metrics on RDP14 and War4.

Note 13. Construction of a rank split.

Note 14. Sensitivity/ EPQ plots for similarity measures.

Note 15. Non-predictions and blank names.

Note 16. LKR estimates and OTU error rates.

Supplementary References

Note 1. Mothur-rdp is effectively equivalent to RDP.

I compared the predictions of mothur-rdp and RDP for all rank splits of the RDP14 V4

region. At a bootstrap cutoff of 80%, 241,585 taxon names were predicted by one or both

algorithms. Of these, 234,877 (97%) were identical. At 50% bootstrap, 280,334

/ 291,705 (96%) were identical. A rate of disagreement of 3 to 4% is consistent with

differences due to the use of random numbers in the bootstrapping procedure. I concluded

that mothur-rdp and RDP are effectively equivalent implementations of the same algorithm

and did not consider mothur-rdp separately for the rest of this work.

Note 2. Genus predictions for the Soil86 dataset.

Method Genus predictions

UTAX 0

QIIME-uc 3 (0.1%)

QIIME-sm 19 (0.5%)

mothur-knn 283 (8%)

RDP (80% bootstrap) 561 (15%)

RDP (50% bootstrap) 942 (26%)

GAST 3,048 (84%)

QIIME-blast 3,637 (89%)

Table SN2.1. Genus predictions on the Soil86 dataset. Soil86 contains 3,637 UPARSE

OTU sequences from the soil sample of Kozich et al.1 with ≤86% identity to the RDP14

reference database, suggesting a lowest known rank of order or higher. Few genus

predictions would be expected for this set considering that the Yarza threshold is 95% for

genus, but some methods predicted many genera, the most by QIIME-blast which predicted

genera for 89% of the sequences.

Note 3. Coverage of the Greengenes and SILVA databases.

The Greengenes2 and SILVA3 reference databases are larger than RDP14: Greengenes

v.13.5 has 1.3 × 106 sequences with taxonomy annotations and SILVA v123 has 1.8 × 106,

compared to 1 × 104 for RDP14. The default database for the QIIME methods is a subset of

Greengenes (GG-QIIME, 99,322 sequences) obtained by clustering at 97% identity, while

one of several suggested databases for use with mothur is a subset of SILVA (SILVA-

mothur, 172,418 sequences) [http://www.mothur.org/wiki/Taxonomy_outline, retrieved

12th Dec 2015]. The GG-QIIME and SILVA-mothur databases thus have an order of

magnitude more annotated sequences than RDP14 and a priori might provide a better

reference set for compatible algorithms, noting that RDP is not compatible because it

requires names for the lowest rank for all training sequences (Note 5) while most

sequences in GG-QIIME and SILVA-mothur lack species and genus names.

Fig. SN3.1 shows the identity distributions for soil OTUs against GG-QIIME and SILVA-

mothur compared with RDP14, showing that GG-QIIME and SILVA-mothur have less sparse

coverage than RDP14 though there are still many OTUs with estimated LKR>genus.

Coverage is less sparse in the sense there are more OTUs with high identities / fewer with

low identities, and this gives the appearance of more known ranks. However, while almost

all ranks are named in RDP14 (Note 11), the interpretation of lowest known rank is

different for GG-QIIME and SILVA-mothur where most sequences lack names for low ranks,

so "known" (present in the database) does not necessarily imply "named". Most

annotations in those databases were predicted using sequence analysis methods, so

"named" does not imply "authoritatively named" by conventional standards. I estimate the

genus annotation error rate to be ~6% for SILVA-mothur and ~18% for GG-QIIME (Note

4). It is therefore difficult to assess whether using one of the larger databases improves or

degrades prediction accuracy for a given algorithm compared to using RDP14, but

especially in the case of GG-QIIME it appears that the annotation error rate of the database

may be high enough to substantially degrade prediction performance, noting that

annotation errors of the database will be compounded by the inherent error rate of a

http://www.mothur.org/wiki/Taxonomy_outline

prediction algorithm, and confidence will be systematically overestimated because the

database error rate is not considered.

Fig SN3.1. Identity distributions of soil OTUs. Histograms show frequencies of integer-

rounded sequence identities of top hits of OTUs to RDP14, GG-QIIME (the subset of

Greengenes which is the default reference in QIIME) and SILVA-mothur, one of the

reference databases provided for use by mothur. Colors indicate estimated lowest known

ranks according to the Yarza thresholds (see main text for methods).

Note 4. Error rates of Greengenes and SILVA taxonomy annotations.

Henri Poincaré famously described mathematics as the art of giving the same name to

different things4. In taxonomy, this is a bad idea.

Most taxonomy annotations in Greengenes and SILVA databases were predicted for

uncultured sequences using a combination of automated and manual methods5,6. I don't

fully understand their guiding principles or exactly how they were implemented, but

presumably they work something like the following. The starting point is a set of sequences

obtained from authoritatively classified organisms (gold-standard sequences). Other

annotations are made using a predicted phylogentic tree. If a non-gold sequence is in the

same subtree as a gold sequence at a given rank, the name at that ranks is inferred to be the

same.

To the best of my knowledge, neither Greengenes nor SILVA documents which sequences

were used as gold standards or the evidence supporting a given annotation (is it a gold

standard sequence? an automated prediction? an automated prediction which was

manually adjusted, and if so why?), making the reliability of any given annotation difficult

to evaluate or verify independently.

There are several differences in taxonomic nomenclatures and procedures for reconciling

conflicts between taxonomy and sequence evidence. Greengenes is based on the NCBI

taxonomy, RDP14 on Bergey's7 and SILVA on LSPN8. While RDP14 strictly adheres to

Bergey's to the best of my knowledge, Greengenes and SILVA modify their base taxonomies

to address inconsistencies with phylogenies determined from sequence. For example,

Greengenes deletes the genera Escherichia and Shigella, which are believed to overlap9,

leaving their sequences classified to family level only (Enterobacteriaceae). SILVA deals

with this issue in a different way by defining a combined genus (Escherichia-Shigella) and

retaining well-known species names such as Escherichia coli, while Greengenes leaves their

species names blank.

Both databases maintain large multiple alignments of 16S sequences, many of which have

incorrect and ambiguous bases and some of which are undetected chimeras10. The

Greengenes alignment is fixed at 7,682 columns using the NAST approach2 which

intentionally introduces misalignments (i.e., errors) to avoid increasing the number of

columns. Construction of RNA alignments is challenging, especially for large and diverse

datasets, and the best current alignment algorithms have substantial error rates when

challenged with highly diverged sequences11. Perfect tree inference from a sequence

alignment is generally not possible due to alignment errors and information loss12. Tree

construction error rates are difficult to estimate but can be substantial on large datasets13.

Given these issues, it is plausible that the Greengenes and SILVA trees could have

substantial error rates, raising the question whether these, perhaps together with other

imperfections in their annotation methods, have caused substantial numbers of taxonomy

annotation errors. This cannot be assessed directly because the ground truth is not known.

Instead, I identified errors by noting that annotations for identical sequences should agree,

so if two databases have different annotations for the same sequence then one or both of

them must be wrong.

Implementing this analysis is complicated by the fact that the databases use taxonomic

systems with different sets of names. Another complication is the interpretation of blank

names. Does a blank name indicate assignment to a sub-tree that has not been named, that

a name cannot be assigned due to overlapping named groups (like Escherichia-Shigella), or

low confidence in a prediction (i.e., the name might be known, or there are two candidate

known names which do not overlap but which are hard to distinguish)? (see also Note 15).

In consideration of these issues, I counted only names used by both systems (common

names), excluding names which do not correspond to clades such as unclassified,

uncultured, candidatus and incertae sedis. If one or both names were blank, the pair was not

counted.

Results are summarized in Table SN4.1, which shows that SILVA-mothur and GG-QIIME

disagree on 24% of genus annotations and 2% of phylum annotations for identical

sequences. This provides an lower bound on the sum of the annotation error rates for both

databases. The lower bound is achieved when every incorrect annotation is correct in the

other database. It should be rare for annotations to be wrong in both databases by chance

(if errors are random at a rate of ~10%, then ~1% will be wrong in both). Given that

distinctly different methods are used for alignment and tree construction, I would guess

that the errors have low correlation between the databases and the true combined rate is

close to this lower bound.

A pair-wise comparison measures the combined error rate without indicating the relative

rate, i.e. whether one database has a higher or lower error rate than the other. This can be

investigated using pair-wise comparisons with a third database, RDP14. Genus annotation

disagreement rates with RDP14 are 11% for GG-QIIME and 3% for SILVA-mothur. This

indicates that GG-QIIME has a higher error rate than SILVA-mothur because the error rate

of RDP14 should be roughly the same in both pair-wise comparisons, adding approximately

the same term to both combined rates. Also, all RDP14 sequences have genus annotations

and its much smaller size is more amenable to curation, suggesting that it has a high

frequency of gold-standard sequences and is likely to have a much lower error rate. This

hypothesis is supported by the lower pair-wise disagreements of RDP with the other two

databases. If we assume that the error rate of RDP14 is smaller than the other two

databases, then we can infer that the error rate of GG-QIIME is roughly 11% / 3% ≈ 3× to

4× larger than SILVA-mothur. Assuming a factor of three implies that the total error rates

are 24% × 3/4 = 18% for GG-QIIME and 24% × 1/4 = 6% for SILVA-mothur. While these

estimates are uncertain, the combined rate of 24% is robust and it is reasonable to

conclude that the minimum plausible genus annotation error rates are 5% for SILVA-

mothur (minimum determined by assuming a maximum of 4× more errors in GG-QIIME)

and 12% for GG-QIIME (minimum determined as half of the 24% combined rate, given that

the comparison with RDP14 indicates a higher rate for GG-QIIME).

1. GG-QIIME and SILVA-mothur

Rank Common Names

Same Name

Different Name

Phylum 29098 28616 (98.3%) 481 (1.7%)

Class 24476 21592 (88.2%) 1201 (4.9%)

Order 21919 17121 (78.1%) 2804 (12.8%)

Family 15805 13141 (83.1%) 1428 (9.0%)

Genus 7735 5352 (69.2%) 1868 (24.1%)

2. GG-QIIME and RDP14

Rank Common Names

Same Name

Different Name

Phylum 477 475 (99.6%) 2 (0.4%)

Class 1761 1678 (95.3%) 27 (1.5%)

Order 1786 1583 (88.6%) 79 (4.4%)

Family 1545 1423 (92.1%) 78 (5.0%)

Genus 1404 1253 (89.2%) 151 (10.8%)

3. SILVA-mothur and RDP14

Rank Common Names

Same Name

Different Name

Phylum 1030 1028 (99.8%) 2 (0.2%)

Class 4324 4299 (99.4%) 17 (0.4%)

Order 3359 3148 (93.7%) 57 (1.7%)

Family 4291 4070 (94.8%) 141 (3.3%)

Genus 4510 4386 (97.3%) 124 (2.7%)

Table SN4.1. Pair-wise comparisons of taxonomy annotations. The table shows the rate

of agreement and disagreement between taxonomy annotations for identical sequences

found in each pair of reference databases. Common Names is the number of identical

sequences having a common name for the given rank in one or both databases, Same Name

is the number of these sequences for which the name was the same and Different Name is

the number for which the name was different. A common name is a taxon name found in

the taxonomy systems for both databases.

Note 5. Reference database compatibility.

The tested programs place different constraints on taxonomy annotations. Mothur does not

allow a species name, which ruled out testing at species rank on War4. RDP requires that

the lowest rank is named for all reference sequences, which ruled out testing on

Greengenes or SILVA where genus and species names are often omitted. The mothur re-

implementation of the RDP algorithm does allow missing genus names. RDP14 includes

reference sequences with optional ranks (suborder and subclass) and missing ranks (e.g.,

sometimes only phylum and genus are specified with no intermediate ranks). These

variations are supported by RDP but not by some other programs. UTAX requires that

names correspond to clades so that the LCR can be determined for all pairs of sequences.

This means that names such as unclassified, uncultured, candidatus and incertae sedis

should be excluded for training. I therefore constructed subsets of the reference databases

with taxonomies that were compatible with all programs to enable testing on the same

reference data. This was done by filtering out special cases such as "uncultured", deleting

optional ranks (suborder, subclass) and discarding annotations with any missing or blank

names for required ranks (genus, family, class, order and phylum for RDP14 and species,

family, class, order and phylum for War4). This required discarding 506 / 10,049

sequences (5%) from RDP14 and 9,546 / 24,500 (40%) from War4. The compatible

versions of the reference databases are included in the Supplementary Files.

Note 6. Leave-one-out and leave-10%-out validation.

In their 2007 paper describing the RDP Naive Bayesian Classifier14, Wang et al. state in the

Abstract that "…results from leave-one-out testing … show that the overall accuracies at all

levels of confidence for near-full-length and 400-base segments were 89% or above down

to the genus level". In my opinion, this approach is not appropriate for microbial taxonomy

prediction because an informative leave-one-out validation requires that all categories are

known and training data is dense (Fig. SN6.1). With microbial taxonomy, training data is

sparse and many microbial genera and higher ranks are novel in typical data (Fig. SN6.2).

In addition, accuracy was measured using a bootstrap cutoff of zero rather than the

authors' recommended cutoff of 80%. Roughly half of the genera in RDP14 have only a

single sequence (913/1,948, Table SN6.1) and therefore cannot be predicted if left out of

the training set, but this is not taken into account. Accuracy as measured by this test is thus

the maximum possible sensitivity in a scenario where a large majority of query sequences

have identity >97% (Fig. SN6.2), which is unrealistic, and where the maximum achievable

accuracy is not 100% as would be expected by convention. At RDP14 genus level, RDP and

UTAX have 86% accuracy by this definition, close to the maximum possible of 91% (Table

SN6.1), as would be expected for sequences with >97% identity. The observation that

accuracy is less than 100% is mostly explained by classifications that are impossible due to

singletons (9%) with a smaller contribution by misclassification errors (5%). It is therefore

clear that accuracy as measured by the RDP leave-one-out test methodology is not

predictive of sensitivity or error rates on typical biological data. Leave-one-out accuracies

for RDP and UTAX are reported in Table SN6.2.

In a recent preprint [https://doi.org/10.7287/peerj.preprints.934v2], Bokulich et al.

describe a taxonomy prediction validation framework designed to enable reproducible

results. I was unable to install the framework or download the test data. The framework

has several dependencies on third-party code including Python packages which failed to

install. One of the described tests uses leave-10%-out validation where 10% of sequences

are extracted from Greengenes for use as a query set with the remaining sequences used as

a reference. I followed the methodology described in the preprint by extracting the V4

https://doi.org/10.7287/peerj.preprints.934v2

region of Greengenes v13.5 using the 515F/806R primers and extracting 10% subsets

chosen at random. I found the identity distribution shown in Fig SN6.2 (lower-right) which

shows that a large majority of sequences in the query sets have ≥99% identity with their

corresponding reference sets. This distribution is even more strongly skewed towards

100% identity than the RDP leave-one-out test, which is explained by stronger sampling

biases; for example, the most abundant genus in Greengenes v13.5 is Staphylococcus with

135,711 sequences, comprising more than 10% of the database. Therefore, this test is not

predictive of sensitivity and error rates on typical biological data.

Fig. SN6.1. Microbial taxonomy prediction is not a textbook problem. In a textbook

classification problem (left), all categories are known (handwritten digits, in this example)

and have many training examples. Leave-one-out and leave-10%-out validation is

informative in a textbook case because they are realistic models of classification in practice.

With microbial taxonomy, reference data is sparse (right). In this analogy, the task of an

algorithm is to predict handwritten characters when the full alphabet is not known and

training data is sparse. If leave-one-out validation is used, the algorithm is not challenged

by realistic amounts of novel data (9, A, B…). Characters with only one training example (4

though 8) cannot be predicted when they are left out. If accuracy is measured as the

fraction of characters that are correctly predicted in a leave-one-out test, the highest

possible accuracy is less than 100% due to the singletons. Taxonomy has additional

complications. There is strong sampling bias in the reference data, e.g., human pathogens

are overrepresented (like digits 0, 1 and 2 on the right). Some training examples have

multiple labels because multiple genera can have the same V4 sequence, analogous to the

problem that 0 and I can be digits or letters. Even if only one genus is known for a given V4

sequence, a novel genus in the same family might have the same sequence so a prediction

of genus for that sequence should have <100% confidence.

Fig. SN6.2. LKRs for in vivo samples, leave-one-out and leave-10%-out test data. This

figure compares the identity distribution of soil, mouse gut and human gut OTUs (left) with

the identity distribution of query-reference pairs used in the RDP leave-one-out test and

the Bokulich et al. leave-10%-out test on the 16S V4 region (right). Colors show lowest

known ranks (LKRs) estimated using Yarza identities as described in the main text. In the

distributions for the validation tests, a large majority of query sequences have >97%

identity to the reference set (right), while in practice most sequences belong to novel

genera (left).

War4 RDP14

Rank Names Singletons Max. acc. Names Singletons Max. acc.

Phylum 7 1 100% 39 3 100%

Class 30 2 100% 88 9 99.9%

Order 100 5 100% 123 20 99.8%

Family 287 16 99.9% 341 53 99.5%

Genus 1,308 157 99.0% 1,948 913 90.9%

Species 7,390 2,094 86.0%

Table SN6.2. Leave-one-out maximum accuracy. The table shows the maximum possible

accuracy of leave-one-out tests on the War4 (ITS) and RDP14 (16S) training sets which are

the defaults currently used by RDP. Names is the number of taxon names in the training set.

Singletons is the number of names having exactly one training sequence, which therefore

cannot be predicted when left out. Max. acc. is the maximum possible accuracy by the RDP

definition, which is <100% when there are singletons in the training set. Since there are

singletons at all ranks, the maximum accuracy is always <100% but appears as 100% in

some cases because values are shown to three significant figures.

Reference Method Phylum Class Order Family Genus Species

War4

(ITS1)

RDP 99.8 99.5 99.1 98.2 92.7 72.9

UTAX 99.9 99.7 99.3 98.6 93.4 74.8

War4

(ITS2)

RDP 99.9 99.6 99.3 98.3 92.4 71.8

UTAX 99.9 99.7 99.4 98.5 93.2 74.0

War4

(full-length)

RDP 99.9 99.7 99.4 98.5 93.3 73.9

UTAX 100.0 99.8 99.5 98.8 94.3 77.7

RDP14

(V4)

RDP 99.7 99.5 98.4 96.1 80.4

UTAX 99.7 99.5 98.6 96.4 80.5

RDP14

(full-length)

RDP 99.5 99.4 98.7 97.3 85.6

UTAX 99.9 99.6 99.0 97.5 85.9

Table SN6.2. Leave-one-out results for War4 and RDP14. The table shows accuracy as

defined by the RDP leave-one-out methodology, i.e. the fraction of query sequences for

which the rank is correctly predicted at >0% bootstrap confidence for RDP and P>0 for

UTAX. The maximum possible accuracy by this definition is <100% when there are

singleton taxa (i.e., those having only one reference sequence). At RDP14 genus level, RDP

and UTAX have 86% accuracy, close to the maximum possible of 91% (Table SN6.1).

Singletons in the reference database thus reduce accuracy below 100% more than

misclassification errors by the algorithms.

Note 7. Estimated LKR frequencies for in vivo samples.

Prediction error rates for known and novel taxa respectively were measured using data for

which LKRs are inferred from authoritative annotations. However, these rates do not

directly indicate overall error rates for typical biological samples. For example, if most

genera in a given sample are known, then most errors will be due to misclassifications and

the overclassification rate for genus will be largely irrelevant, but if novel genera are

common, then the genus overclassification rate is important. (See Note 8 for definitions).

Thus, in order to estimate realistic error rates for typical data, we also need to determine

realistic rates of novelty, i.e. realistic LKR frequencies. Once we have LKR frequencies, then

overall sensitivity and error rates can be estimated by summing over all ranks (Eqs. 1 and 2

in the main text).

I estimated LKR frequencies for soil, human gut and mouse gut samples from a recent study

by Kozich et al.1 The goal of this step was to obtain realistic frequencies, i.e. rates of novel

taxa at each rank that are representative for biological samples in practice, not to make an

accurate determination of the frequencies on those particular samples. LKR frequencies

were estimated using identity thresholds, as described in detail below. This method is not

expected to be very accurate, but this doesn't matter because the frequencies will be

realistic even if they are under- or over-estimated by quite large factors. For example, I

estimate that 37% of the genera in the soil sample are known. This number could be quite

far off -- perhaps the true number is 20% or 50%, but it is surely not 1% or 99%. As long as

the estimate is in the right ballpark, a sample with 37% known genera is not exceptional,

and this rate is reasonable for summarizing the performance of a taxonomy prediction

algorithm. To avoid any misunderstanding on this central point, it is also important to note

that my methodology does not use identity to determine LKRs of individual sequences—

when required, they are obtained using authoritative annotations. Identity thresholds were

used only to obtain realistic LKR frequencies for three representative samples.

Identity thresholds are commonly used to determine approximate taxonomic relationships.

For example, it is commonly assumed that ≥97% identity for two full-length 16S sequences

indicates that the species is probably the same and conversely, if the identity is <97%, then

the species is probably different. This gives us a method for estimating the frequency of

known species in a sample: it is the fraction of sequences with ≥97% identity with the

reference database. This approach can be generalized to other ranks, as in the work of

Yarza et al.15 who determined the number of novel taxa in large databases of full-length 16S

sequences. Their method was based on finding appropriate clustering thresholds for ranks

from species to phylum. Sequence identity correlates only approximately with taxonomic

rank, so clusters will not correspond one-to-one with names—some clusters will contain

more than one name (lumping) and some names will be found in several clusters

(splitting). Yarza et al. tuned their thresholds so that the number of clusters containing

known taxa agreed with the number of distinct taxon names. In other words, the tuning

balanced splitting and lumping so that (number of clusters) = (number of distinct names)

at the given rank. In this framework, the number of clusters which do not contain known

names is an operational definition of the number of unnamed taxa.

At genus rank, Yarza et al. found that the clustering threshold which balanced splitting and

lumping was 95%. Using this threshold, I estimated the number of known genera as the

number of sequences having ≥95% identity with the reference database. This test is not

reliable in any given case—some sequences with known genera will have <95% identity

and some novel genera will have ≥95% identity, but these will tend to balance each other

out (analogous to lumping and splitting of clusters). LKR frequencies at higher ranks were

estimated in the same way.

The Yarza identity thresholds were determined for full-length 16S sequences, which raises

the question of whether they are optimal for shorter gene segments such as the V4 region

used in this work. The thresholds are probably not optimal, but they are surely good

enough to give realistic frequencies. From Fig. 1 in the main text we can see that the genus

threshold (95%) appears to be too low because P(LCR=genus | 95%) = 0.34, so at 95%

identity the LKR is more likely to be family or higher. A better V4 threshold for genus

appears to be 96% or 97% with P(LCR=genus) = 0.51 and 0.61 respectively. Using a higher

identity would increase the estimated frequency of novel genera, so using the thresholds

determined on full-length sequences gives a conservative estimate of novel genus

frequency.

Rank Id. Sample Known Novel Novel% LKR%.

Phylum

Soil 7225 339 5% 5%

75% Mouse gut 716 41 5% 3%

Human gut 446 6 1% 0.2%

Class

Soil 6820 744 10% 7%

79% Mouse gut 691 66 9% 3%

Human gut 445 7 2% 0%

Order

Soil 6311 1253 17% 21%

82% Mouse gut 671 86 11% 10%

Human gut 445 7 2% 6%

Family

Soil 4692 2872 38% 45%

86% Mouse gut 597 160 21% 42%

Human gut 416 36 8% 49%

Genus

Soil 283 474 83% 16%

95% Mouse gut 193 259 63% 18%

Human gut 113 339 57% 8%

Species

Soil 468 7096 93% 4%

98% Mouse gut 161 596 79% 12%

Human gut 113 339 75% 8%

Sequence

Soil 141 7423 98% 2%

100% Mouse gut 74 683 90% 10%

Human gut 77 375 83% 17%

Table SN7.1 Estimated LKR frequencies for in vivo samples vs. RDP14. LKR

frequencies estimated for UPARSE OTUs constructed from the Kozich et al. samples of soil

(7,564 OTUs), mouse gut (757 OTUs) and human gut (452 OTUs). Column headings are: Id.,

the Yarza et al. cutoff identity threshold for the rank, LKR% the fraction of OTUs having an

LKR at this rank according to the thresholds, Known the number of known OTUs, Novel the

number of novel OTUs, Novel% the fraction of novel OTUs. Novel frequencies >20% are

highlighted.

Note 8. Accuracy metrics for taxonomy validation.

Algorithm predictions are often characterized as true positives (TP), false positives (FP),

false negatives (FN) and true negatives (TN). Prediction accuracy is conventionally

summarized using measures calculated from totals for given types of prediction, e.g.

Bokulich et al. (reference in Note 6) use the textbook metrics precision = TP/(TP+FP) and

recall = TP/(TP+FN). However, this is not a textbook case (Note 6), and I used different

metrics which I found to correspond better with intuitive concepts of accuracy relevant for

taxonomy.

UTAX and the other algorithms considered in this work do not predict novelty (Note 15).

The concept of a true negative therefore does not apply because predictions are never

negative in the sense that they are for a binary classifier.

To characterize false positive rates, I defined a misclassification as a false positive when the

rank is known (FPmis), and an overclassification as a false positive when the rank is novel

(FPover). An overclassification error occurs when the algorithm predicts too many ranks—it

should have climbed higher up the taxonomic tree. In this spirit, a false negative could be

described as an underclassification error because too few ranks are predicted, but this is

true of all FNs so there is no need for a new category.

To characterize the rate of true positives, I defined sensitivity = TP / Nknown where Nknown =

TP+FN+FPmis is the number of queries with known names. Sensitivity by my definition has

a maximum of 100% which could be achieved by an ideal algorithm, while the RDP

accuracy measure is necessarily <100% if there are novel query sequences (Note 6). My

definition of sensitivity captures the intuitive idea of "fraction of achievable predictions

which are correct". Precision and recall cannot do this because misclassification errors

(where an ideal algorithm could make a TP prediction) and overclassification errors

(impossible because there are no training examples) are not distinguished.

As a summary statistic for errors I chose to use errors per query (EPQ) = FP / NQ where NQ is

the total number of query sequences. False negatives are not counted as errors for

calculating EPQ because they are already accounted for in sensitivity.

When precision and recall are used, false positives are indicated by precision < 100%. As

errors increase, precision gets lower. The divisor for precision is (TP+FP) = number of

predictions, while the divisor for EPQ is the total number of queries. Some, or many,

queries may not get a prediction (which is not the same as a prediction that the rank is

novel as noted above; see also Note 15). Both precision and EPQ capture the FP rate, and

can readily be converted given the number of queries and number of predictions. In a

prediction task with dense reference data they capture a similar intuitive notion because all

FPs are misclassifications. However, with sparse reference data / novel query data there is

an important difference. If you continue to add novel queries, all new predictions are

overclassification errors and the precision is reduced indefinitely and approaches zero for

very high novelty, even if the algorithm has a low overclassification rate. In other words,

precision reflects a property of the query set as well as a property of the algorithm. For low

ranks, novelty may be high enough that overclassifications swamp misclassifications even

if the algorithm has low rates for both types of error, making precision hard to interpret. By

contrast, when there is high novelty EPQ will converge on the overclassification rate, an

intrinsic property of the algorithm.

Note 9. Calculation of LCR probabilities and S/E.

If a married couple has a height difference of 2cm, what is the probability that the taller

spouse is male? To answer this, collect information about a large number of couples,

extract the subset where the height difference is 2cm, and calculate the fraction where the

taller spouse is a man. If 80% of those couples have a taller man, we conclude that the

probability is 0.8. Implicitly, this procedure assumes we have observed events generated

by a hidden stochastic process, and the best estimate we can make of the underlying

probability distribution (given some reasonable assumptions) is the observed frequency in

those samples. This is called an a-posteriori estimate.

If a pair of sequences has 90% identity, what is the probability that their lowest common

rank is family? To answer this, collect a large number of pairs of sequences, extract the

subset with 90% identity and calculate the fraction with LCR=family.

Fig. SN9.1 shows schematically how UTAX calculates LCR probabilities from a reference

database, using sequence identity as the similarity measure for this example. (In practice,

UTAX uses dUTAX defined by Eq.6 in the main text). An all-pairs triangular matrix (a) is

constructed containing pair-wise sequence identities, indicated by colors (green=100%,

yellow=95% and orange=90%). The lowest common rank (LCR) is determined for each

pair by comparing taxonomy annotations and marked as s (species), g (genus) or f (family).

For each identity, the corresponding pairs are identified: (b) for 100%, (c) for 95% and (d)

for 90%. For a given identity, the fraction of pairs having each LCR is calculated, i.e. the LCR

frequencies. For example, in (c) there are nine pairs with 95% identity. Of these, one has

LCR=species, five have LCR=genus and three have LCR=family. The LCR probabilities are

estimated to be the observed frequencies, so P(LCR=species | 95%) ~ 1/9, P(LCR=genus |

95%) ~ 5/9 and P(LCR=family | 95%) ~ 3/9 (the symbol ~ means "is estimated to be").

Using integer-rounded percent identities ensures that the set of pairs for a given identity is

usually large enough to make a good estimate of its LCR probabilities. Missing values are

filled in by interpolation, e.g. if there are no pairs with 76% identity then P(LCR | 76%) ~

(P(LCR | 75%) + P(LCR | 77%))/2.

Fig. SN9.1. Calculation of LCR probabilities from a reference database.

Fig. SN9.2. Calculation of sensitivity vs. EPQ from a reference database.

This figure shows how a sensitivity vs. error plot for common rank (CR) is calculated for

genus, using the toy example from Fig. SN9.1. Pairs are considered in order of decreasing

identity. If LCR=s or LCR=g, the pair is a true positive CR prediction because the genus is

the same, or if LCR=f this is a false positive because the genus is different. At each identity,

the number of true positives and false positives (f, red outlines) are counted. There are 14

pairs with common genera (LCR=s or g) and there are 21 queries (the total number of

pairs), so the CR sensitivity at a given cutoff is TP/14 and EPQ is FP/21 (see Note 8 for

definitions and discussion of sensitivity and EPQ). Here, there are three possible thresholds

at identities 100%, 95% and 90% which incrementally include queries from pairs in groups

(b), (c) and (d) respectively.

Note 10. Software versions and command lines.

UTAX version 1.0. Source code and Linux binary are in the Supplementary Files. RDP: Stand-alone classifier version 2.11. RDP training: java -Xmx8g -cp /sw/rdp_classifier_2.11/rdp_classifier-2.11.jar edu/msu/cme/rdp/classifier/train/ClassifierTraineeMaker treefile dbfile 1 version1 name_not_used traindir/ RDP classification: java -Xmx1g -jar /sw/rdp_classifier_2.11/rdp_classifier-2.11.jar -t traindir/rRNAClassifier.properties -q query.fa -o output.txt QIIME: version 1.9.1. QIIME-uc: assign_taxonomy.py -i query.fa -m uclust -r db.fa -t taxonomy.txt QIIME-sm: assign_taxonomy.py -i query.fa -m sortmerna -r db.fa -t taxonomy.txt QIIME-blast: assign_taxonomy.py -i query.fa -m blast -r db.fa -t taxonomy.txt mothur-knn: classify.seqs(fasta=query.fa, template=db.fa, taxonomy=taxonomy.txt, method=knn, processors=6) GAST: Source dated 25 Feb 2011 (no version number given). gast -in query_fa -ref ref_fa -rtax taxonomy.txt -out output.txt

Note 11. Compute time and memory use.

Method Elapsed

time (secs.)

Maximum

memory

UTAX 359 110 Mb

RDP 41 320 Mb

QIIME-uc 11 450 Mb

QIIME-sm 19 150 Mb

QIIME-blast 25,260 800 Mb

GAST 30 140 Mb

mothur-knn 3 120 Mb

Table SN11.1. Execution time and maximum memory use of the tested methods. The

table reports elapsed time in seconds and maximum memory in megabytes for the tested

methods using the 9,364 sequences in the V4 reference database extracted from RDP14 as

both query and reference. Programs were run under Ubuntu Linux on an Intel Core i7-

3930K CPU running at 3.20GHz with 64Gb RAM.

Note 12. Performance metrics on RDP14 and War4.

Method sensitivities for RDP14 and War4 are given in Supp. Table SN12.1. UTAX is seen to

have a relatively low sensitivity yet maintains performance which I considered to be

acceptable and comparable to the best alternative with one exception: genus predictions

on War4 (~39% sensitivity compared to ~60-70% for RDP_80). I interpreted this anomaly

as an underestimate of EPQ by the algorithm, which I was not able to explain but found that

it could be addressed by setting P≥0.7, which gave 71% sensitivity and EPQ ~5%. Error

rates are shown in Table SN12.2. which shows that UTAX consistently achieves lower error

rates than most other methods, dramatically so in many cases, with the exception of

mothur-knn, which has much lower sensitivity. Genus overclassification rates with

LKR=genus increased from V4 to full-length for all methods except UTAX for which the

overclassification rate was lower (19% V4, 13% full-length; Supplementary Files). Notably,

the overclassification rate of RDP_80 jumped from 31% on V4 to 50% on full-length

sequences and RDP_50 to 81%.

Genus 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL)

UTAX 48.0 67.9 78.7 79.9 37.9 39.5 38.5

RDP_80 59.0 79.3 86.4 92.6 63.3 64.5 72.3

RDP_50 74.5 87.0 91.1 94.7 73.1 73.7 78.9

QIIME-uc 69.2 79.3 81.9 81.4 54.6 58.9 64.1

QIIME-sm 65.6 76.4 80.2 83.7 52.4 56.8 62.7

QIIME-blast 77.0 88.7 92.1 94.9 64.8 68.6 80.2

GAST 73.2 88.2 92.2 95.4 73.9 76.2 80.2

mothur-knn 26.6 34.3 37.2 39.6 33.6 34.0 38.0

Family 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL)

UTAX 81.6 94.9 96.3 96.6 83.1 84.2 87.2

RDP_80 84.9 95.1 97.8 98.9 82.5 85.0 90.7

RDP_50 93.2 97.3 98.7 99.2 89.0 90.4 93.6

QIIME-uc 91.6 95.6 96.8 96.2 63.4 69.5 74.8

QIIME-sm 90.2 94.9 96.2 96.7 62.2 68.3 75.3

QIIME-blast 93.6 97.6 98.3 99.0 78.4 82.2 94.4

GAST 93.1 97.9 98.7 99.1 88.1 91.2 93.9

mothur-knn 58.5 66.5 72.7 74.7 67.9 70.5 75.1

Order 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL)

UTAX 97.5 98.6 99.4 99.5 94.4 95.3 95.3

RDP_80 94.9 98.5 99.1 99.6 89.3 92.0 96.6

RDP_50 97.9 99.2 99.4 99.7 94.4 95.8 98.0

QIIME-uc 97.4 97.9 98.4 98.0 64.4 70.8 76.3

QIIME-sm 96.9 97.8 98.3 98.4 64.0 69.9 77.3

QIIME-blast 98.2 98.9 99.2 99.7 81.9 85.7 98.1

GAST 98.4 99.2 99.6 99.7 90.9 94.1 96.9

mothur-knn 77.8 82.1 86.1 87.2 85.2 87.3 90.7

Class 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL)

UTAX 99.6 99.8 99.9 99.9 97.5 98.2 98.0

RDP_80 97.9 99.4 99.8 99.9 93.4 95.6 98.5

RDP_50 99.3 99.7 99.9 100.0 96.7 97.9 99.1

QIIME-uc 99.1 99.0 99.0 98.3 64.8 71.3 76.7

QIIME-sm 99.0 99.1 99.0 98.9 64.8 70.5 78.0

QIIME-blast 99.5 99.5 99.5 99.9 82.9 86.7 99.2

GAST 99.7 99.8 99.9 99.9 91.6 95.0 97.7

mothur-knn 88.7 93.3 94.8 95.2 93.4 94.3 96.7

Phylum 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL)

UTAX 99.6 99.8 99.9 99.9 97.5 98.2 98.0

RDP_80 97.9 99.4 99.8 99.9 93.4 95.6 98.5

RDP_50 99.3 99.7 99.9 100.0 96.7 97.9 99.1

QIIME-uc 99.1 99.0 99.0 98.3 64.8 71.3 76.7

QIIME-sm 99.0 99.1 99.0 98.9 64.8 70.5 78.0

QIIME-blast 99.5 99.5 99.5 99.9 82.9 86.7 99.2

GAST 99.7 99.8 99.9 99.9 91.6 95.0 97.7

mothur-knn 88.7 93.3 94.8 95.2 93.4 94.3 96.7

Table SN12.1. Sensitivity with LKR=genus on RDP14 (16S) and War4 (ITS). The table

shows sensitivity (defined in Note 8) as a percentage with LKR=genus for predicted ranks

from genus to phylum. LKR=genus was chosen as representative of the in vivo samples

(Note 7). Sensitivities <75% are highlighted in yellow, <50% in orange. The complete

matrices for sensitivity, overclassification and misclassification for all pairs (prediction

rank, LKR) are included in the Supplementary Files. The V5 region of 16S was truncated to

120nt to simulate reads obtained by older NGS machines. The V4 region (~250nt) is

popular with current sequencing technologies. The V3V5 region (~520nt) was sequenced

on older 454 machines and models the longer reads which will be achieved by NGS

machines in the near future.

Predicted genus LKR=genus

(Mis.)

LKR=family

(Over.)

LKR=order

(Over.)

LKR=class

(Over.)

UTAX 1.6 19.4 5.1 1.1

RDP_80 3.5 31.1 9.5 3.0

RDP_50 7.8 66.8 30.6 21.1

QIIME-uc 10.3 66.5 51.3 28.3

QIIME-sm 7.7 61.1 48.3 26.6

QIIME-blast 11.2 99.0 95.3 90.6

GAST 7.8 88.6 87.6 85.4

mothur-knn 0.1 5.5 7.5 2.9

Predicted family LKR=genus

(Mis.)

LKR=family

(Mis.)

LKR=order

(Over.)

LKR=class

(Over.)

UTAX 1.1 6.4 31.5 7.6

RDP_80 0.9 3.6 30.1 11.6

RDP_50 1.5 9.6 59.8 46.9

QIIME-uc 2.9 16.0 65.1 34.6

QIIME-sm 2.5 15.4 65.3 36.4

QIIME-blast 2.3 24.1 95.3 90.6

GAST 1.4 18.1 95.1 93.1

mothur-knn 0.2 2.1 32.9 16.7

Predicted order LKR=genus

(Mis.)

LKR=family

(Mis.)

LKR=order

(Mis.)

LKR=class

(Over.)

UTAX 0.5 1.9 9.9 33.0

RDP_80 0.4 1.2 6.6 24.5

RDP_50 0.5 3.4 13.8 67.9

QIIME-uc 1.1 4.5 11.6 36.2

QIIME-sm 0.9 4.2 11.3 38.1

QIIME-blast 1.0 11.4 25.9 90.6

GAST 0.5 6.8 22.1 95.4

mothur-knn 0.1 0.9 8.7 36.4

Predicted class LKR=genus

(Mis.)

LKR=family

(Mis.)

LKR=order

(Mis.)

LKR=class

(Mis.)

UTAX 0.1 0.6 4.6 6.1

RDP_80 0.0 0.2 1.8 1.3

RDP_50 0.1 0.5 4.4 4.7

QIIME-uc 0.1 0.3 2.1 3.1

QIIME-sm 0.1 0.4 1.5 2.9

QIIME-blast 0.5 6.2 14.9 24.4

GAST 0.1 2.0 8.2 10.0

mothur-knn 0.0 0.4 2.0 4.2

Predicted phylum LKR=genus

(Mis.)

LKR=family

(Mis.)

LKR=order

(Mis.)

LKR=class

(Mis.)

UTAX 0.0 0.5 0.8 0.5

RDP_80 0.0 0.0 0.1 0.0

RDP_50 0.0 0.8 0.9 1.7

QIIME-uc 0.0 0.1 0.1 0.0

QIIME-sm 0.0 0.0 0.1 0.0

QIIME-blast 0.2 3.4 4.0 13.8

GAST 0.0 1.5 2.2 1.9

mothur-knn 0.0 0.3 0.3 0.5

Table SN12.2. Error rates measured on the RDP14 V4 region. The table shows

misclassification (Mis.) and overclassification (Over.) error rates as percentages for

predicted ranks from genus to phylum as defined in Note 8. LKRs from genus to class are

shown as novel phyla are rare in practice. With the predicted rank is <LKR, errors are

overclassifications and when the predicted rank is ≥LKR than the errors are

misclassifications (Note 8). Error rates ≥10% are highlighted in yellow, ≥20% in orange

and ≥50% in red.

Supplementary Note 13. Construction of a rank split.

Fig. SN13.1. Rank split with LKR=family. The reference database is divided into two

subsets X and Y, colored gold and blue respectively in the figure, such that LKR=family. For

each family, its genera are assigned at random to X or to Y . At least one genus from each

family must always be present in both X and Y. No genus is present in both. Families

containing only one genus are discarded. With LKR=family, ranks of family and above are

always known (i.e., present in both X and Y) while ranks of genus and below are always

novel (i.e., not present in both).

Note 14. Sensitivity vs. EPQ plots for similarity measures.

I chose dUTAX as the similarity measure after investigating alternatives. I noted that a good

measure will sort true positives ahead of false positives when predictions are sorted by

decreasing similarity (Fig. SN9.2). I therefore sorted all pairs in RDP14 and plotted

sensitivity (number of pairs with common genus and similarity ≥D divided by the number

of pairs with a common genus) vs. EPQ (number of pairs with different genera and

similarity ≥D divided by number of pairs) for each value of D for several different similarity

measures for the V4 region (Fig. SN14.1), giving a method for evaluating measures

independent of the rank-split benchmark methodology.

I defined the nearest neighbor NNk(R) of a reference sequence R to be its nearest neighbor

at rank k, i.e. the sequence in the reference database with highest similarity to R and a

different name at rank k. For a pair of sequences Q and R, I defined the unique word

similarity U(Q, R) as Eq.5 in the main text and Id(Q, R) as the identity calculated from a

global alignment, i.e. the number of columns containing identical letters divided by the

alignment length after discarding any columns containing terminal gaps. I defined IdUTAX as

follows,

IdUTAX(Q, R, k) = (2 Id(Q, R) – Id(R, NNk(R))/2.

Sensitivity vs. EPQ curves for these measures are plotted in Fig. SN14.1, showing that dUTAX

is the most accurate measure because its curve is lowest, implying that has lower error rate

at most sensitivity values. Surprisingly, the ranking from best to worst is dUTAX > U > Id ≈

IdUTAX. Typically, alignment-based measure to be more accurate than word-counting

measures, which are is usually motivated by computational efficiency with the expectation

that there will be some reduction in accuracy. I do not have a good explanation for this

result, but speculate that it may be related to the fact that for a given number of gaps U has

a higher value when gaps are contiguous than if gaps that are spaced apart, while

alignment identity by the usual definition gives the same value regardless of where the

gaps appear. Contiguous gaps are probably due to a single mutation (insertion or deletion

with length > 1), while gaps that are not adjacent are probably due to multiple events,

indicating a larger evolutionary distance. This suggests trying a modified definition of

alignment identity to count gaps differently or using a likelihood-based measure that

calculates a log-odds score with affine gap penalties. However, this approach would require

calculating thousands of alignments per query to reach the nearest neighbor at phylum

rank, which would be intractably slow for a practical high-throughput tool.

Fig. SN14.1. Genus sensitivity vs. EPQ plot for four similarity measures. The figure

shows sensitivity vs. EPQ as percentages measured on the V4 region of RDP14.

Note 15. Non-predictions and blank names

UTAX estimates the probability (P) that a query sequence has the predicted name at each

rank. Suppose P=0.8 for genus and P=0.95 for family. A common heuristic for processing

bioinformatics predictions is to set a confidence threshold (e.g., a maximum BLAST E-

value). Predictions below the threshold are discarded completely, predictions above the

threshold are kept and may be regarded as equally "correct" as a necessary or convenient

simplification for further analysis. This approach can be applied to UTAX predictions, as I

did in the validation tests by setting a threshold of P≥0.9. The simplest way to implement a

threshold is to omit the genus name or to use a reserved string meaning "not predicted"

(e.g., blank), depending on the file format. A similar approach is natural for implementing

an RDP bootstrap cutoff.

UTAX estimates the probability P(x) that the true name is x. This is equivalent to

probability P(!x) = 1 – P(x) that the true name is not x. If the name is not x, it may be a

different name, so it does not follow that the taxon is not named or novel (missing from the

database). For example, if the query has ~95% identity with reference sequences for two

different genera x and y, but all other known genera are <80% id, then combining

probabilities for LCRs from all reference sequences might give posteriors P(x) = 1/3, P(y) =

1/3, P(all other genera) ≈ 0, which in turn implies P(genus is not in database) = 1/3. As

implemented, UTAX reports only the name in the top hit, i.e. P(x) = 1/3. Therefore, low P

should not be interpreted as a prediction that the taxon is not found in the database -- in

this case, the theory predicts that the taxon is known with 2/3 probability and unknown

with 1/3 probability, but (i) these probabilities are not calculated by the current

implementation, and (ii) these probabilities cannot not be reported in a file format with

one name per rank. Thus, the implementation of UTAX described here does not make

negative predictions in the sense of a binary classifier, and it would be a conceptual

mistake to classify any of its predictions as true negatives.

Setting a confidence threshold is an all-or-nothing approach to prediction that discards

potentially useful information. For example, if we are interested in the frequencies of

known genera in a sample, then a better estimate can be made by keeping the probabilities.

For example, if there are 10 different OTUs with P(Streptococcus) = 0.8, then it would be

better to predict that eight of them contain Streptococcus rather than all or none of them.

A low confidence value for a predicted name does not necessarily indicate high uncertainty

due to low identity with the database; it may also indicate an unavoidable ambiguity. For

example, 182 / 1,948 (9%) of genera in RDP14 have a V4 sequence that is also found in

another genus. If the query has a sequence found in more than one genus, it is impossible in

principle to predict its genus name with certainty. A prediction could report multiple genus

names with an indication that the sequence is known but has multiple labels (Fig. SN6.1),

but current file formats and downstream analysis tools do not allow this.

With RDP14, all reference sequences have genus names. Other reference databases may

omit genus or other ranks because the sequence is annotated to fall outside a named clade,

e.g. because it belongs to an unnamed genus. The genus in this case is known (in the sense

of having a sequence in the database), but does not have a name.

Missing names also occur in SILVA and Greengenes annotations. Is a missing name a

positive prediction that the sequence is not in a named clade, or could it be a borderline

case, e.g. because the branching order of the tree has low bootstrap confidence? To the best

of my knowledge, this is not documented.

With the exception of RDP and UTAX, all of the tested prediction methods omit names or

predict blank names in some cases, but do not document how these should be interpreted

for further analysis. Is a blank a positive prediction that the sequence is not in a named

clade, or could it be an ambiguous or low-confidence case? Are these methods designed to

be aware that most genus or species names in the database are blank?

Note 16. LKR estimates and OTU error rates

I estimated the number of OTUs with novel genera as the number of OTU sequences with

<95% identity to the reference database. This approach requires a low rate of OTU errors

because incorrect sequences will have underestimated identities with the database. On

mock community tests using MiSeq 2×250 paired reads of the V4 region (similar to the

Kozich et al. V4 data used here for testing), I previously showed that all OTUs generated by

UPARSE were either error-free biological sequences from the designed community or

identifiable as contaminants16. I further demonstrated the high specificity of the UPARSE

pipeline by constructing OTUs from R2 (reverse) reads, which had substantially higher

error rates than the R1 (forward) reads. On three out of four mock samples, I found that all

R2 OTUs were error-free sequences from the designed community (supp. ref. 15, Table

SN1.4). While errors in the contaminant sequences cannot be identified with certainty

(because the contaminants are not known independently of their OTU sequences), most

likely all of the UPARSE OTU sequences on the mock community tests were error-free, and

certainly a large majority were error-free. To further improve specificity, for the results

reported in this paper I used an updated version of UPARSE incorporating expected error

quality filtering, which dramatically reduces the number of bad reads17.

In the Kozich et al. data, diversity is higher and there is a longer tail of low-abundance

OTUs, so results on mock tests should be extrapolated with caution. However, while the

UPARSE pipeline may discard some valid sequences due to stringent filtering, and may

merge multiple species with similar sequences into a single OTU, I am not aware of any

mechanism that would tend to cause an increase in the number of OTU sequence errors

with higher diversity.

While there may be a small bias due to OTU sequence errors causing underestimated

identities, the 95% threshold for genus was obtained by Yarza et al. for full-length

sequences and is probably too high for the V4 region. Using the V4 region of RDP14, I found

that P(LCR > genus | 95%) = 0.66, i.e. the common rank is probably higher than genus at

95% identity. Thus, the Yarza estimate of novel genus frequency is probably conservative,

even if a low rate of spurious OTUs is present.

Supplementary references

1. Kozich, J. J., Westcott, S. L., Baxter, N. T., Highlander, S. K. & Schloss, P. D.

Development of a dual-index sequencing strategy and curation pipeline for analyzing

amplicon sequence data on the miseq illumina sequencing platform. Appl. Environ.

Microbiol. 79, 5112–5120 (2013).

2. DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and

workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–72 (2006).

3. Pruesse, E. et al. SILVA: A comprehensive online resource for quality checked and

aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 35,

7188–7196 (2007).

4. Poiincaré, H. Science et méthode. (1908).

5. McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological

and evolutionary analyses of bacteria and archaea. ISME J 6, 610–618 (2012).

6. Yilmaz, P. et al. The SILVA and ‘all-species Living Tree Project (LTP)’ taxonomic

frameworks. Nucleic Acids Res. 42, (2014).

7. Anonymous. Bergey’s manual of systematic bacteriology. Bergey’s Man. Syst. Bacteriol.

(2001). doi:10.1016/0769-2609(87)90099-8

8. Parte, A. C. LPSN - List of prokaryotic names with standing in nomenclature. Nucleic

Acids Res. 42, (2014).

9. Escobar-Páramo, P., Giudicelli, C., Parsot, C. & Denamur, E. The evolutionary history of

Shigella and enteroinvasive Escherichia coli revised. J. Mol. Evol. 57, 140–148 (2003).

10. Ashelford, K. E., Chuzhanova, N. a, Fry, J. C., Jones, A. J. & Weightman, A. J. New

screening software shows that most recent large 16S rRNA gene clone libraries contain

chimeras. Appl. Environ. Microbiol. 72, 5734–41 (2006).

11. Wilm, A., Mainz, I. & Steger, G. An enhanced RNA alignment benchmark for sequence

alignment programs. Algorithms Mol. Biol. 1, 19 (2006).

12. Felsenstein, J. Inferring Phylogenies. Am. J. Hum. Genet. 74, 1074 (2004).

13. Philippe, H. et al. Resolving difficult phylogenetic questions: Why more sequences are

not enough. PLoS Biol. 9, (2011).

14. Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Naive Bayesian classifier for rapid

assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ.

Microbiol. 73, 5261–7 (2007).

15. Yarza, P. et al. Uniting the classification of cultured and uncultured bacteria and archaea

using 16S rRNA gene sequences. Nat. Rev. Microbiol. 12, 635–645 (2014).

16. Edgar, R. C. UPARSE: highly accurate OTU sequences from microbial amplicon reads.

Nat. Methods 10, 996–8 (2013).

17. Edgar, R. C. & Flyvbjerg, H. Error filtering, pair assembly and error correction for next-

generation sequencing reads. Bioinformatics 31, 3476–3482 (2014).

Documents

drive5.comUTAX accurately predicts taxonomy of marker gene sequencesThe UTAX algorithm accurately predicts the taxonomy of 16S ribosomal RNA and other marker gene sequences targeted