MUSiCC: a marker genes based framework for metagenomic ... · each sample, and some normalization method is required to allow comparison across samples. This is most com-monly achieved

Manor and Borenstein Genome Biology (2015) 16:53 DOI 10.1186/s13059-015-0610-8

RESEARCH Open Access

MUSiCC: a marker genes based framework formetagenomic normalization and accurateprofiling of gene abundances in the microbiomeOhad Manor1 and Elhanan Borenstein1,2,3*

Abstract

Functional metagenomic analyses commonly involve a normalization step, where measured levels of genes orpathways are converted into relative abundances. Here, we demonstrate that this normalization scheme introducesmarked biases both across and within human microbiome samples, and identify sample- and gene-specific propertiesthat contribute to these biases. We introduce an alternative normalization paradigm, MUSiCC, which combines universalsingle-copy genes with machine learning methods to correct these biases and to obtain an accurate and biologicallymeaningful measure of gene abundances. Finally, we demonstrate that MUSiCC significantly improves downstreamdiscovery of functional shifts in the microbiome.MUSiCC is available at http://elbo.gs.washington.edu/software.html.

BackgroundThe study of naturally occurring microbial communitiesthrough shotgun metagenomic assays has become a rou-tine procedure in recent years [1-6]. Such assays areused, for example, to catalog the collection of genes inthe metagenome, to estimate their abundances, and ul-timately, to characterize the functional capacity of thecommunity [1,3,7,8]. This process involves two steps.First, genomic DNA is extracted from the sample andsequenced using next-generation technologies. Next, se-quenced reads are aligned to a database of referencegenes or genomes, and the number of reads that map toeach gene is used as a proxy for its abundance in thesample [7,9,10]. Clearly, however, the resulting readcounts are highly dependent on the sequencing depth ineach sample, and some normalization method is requiredto allow comparison across samples. This is most com-monly achieved by a simple compositional normalizationprocess, whereby the obtained abundance value associatedwith each gene is divided by the sum of abundance valuesfor all genes identified in the sample (for example, [2,11]).

* Correspondence: [email protected] of Genome Sciences, University of Washington, Seattle, WA98195, USA2Department of Computer Science and Engineering, University ofWashington, Seattle, WA 98195, USAFull list of author information is available at the end of the article

© 2015 Manor and Borenstein; licensee BioMeCreative Commons Attribution License (http:/distribution, and reproduction in any mediumDomain Dedication waiver (http://creativecomarticle, unless otherwise stated.

The resulting normalized value therefore represents ameasure of relative abundance and is used in subsequentcomparative analyses of the samples.This normalization scheme, however, while extremely

prevalent, has several fundamental weaknesses that mayinfluence downstream analysis and ultimately impact theidentification of functional shifts across samples. First,the resulting relative abundance values are unitless anddo not necessarily represent a meaningful biologicalquantity. Second, in this normalization scheme, thescaled abundance of each gene crucially depends on themeasured abundances of all other genes. As many differ-ent sample-specific factors could affect these quantities,abundance values could be disproportionately scaled indifferent samples, dramatically biasing any downstreamcomparative analysis. Compositional normalization isalso associated with several statistical drawbacks andmay give rise to misleading patterns [4,12]. For example,as a marked increase in the abundance of one elementdecreases the apparent relative abundance of other in-variant elements, this normalization scheme tends to in-duce spurious correlations between various elements inthe sample. As a result, comparative analyses of thesevalues across samples may be hard to interpret. Thesedrawbacks call for an alternative normalization proced-ure, one that can produce accurate and easy to interpret

d Central. This is an Open Access article distributed under the terms of the/creativecommons.org/licenses/by/4.0), which permits unrestricted use,, provided the original work is properly credited. The Creative Commons Publicmons.org/publicdomain/zero/1.0/) applies to the data made available in this

http://elbo.gs.washington.edu/software.htmlmailto:[email protected]://creativecommons.org/licenses/by/4.0http://creativecommons.org/publicdomain/zero/1.0/

Manor and Borenstein Genome Biology (2015) 16:53 Page 2 of 20

abundance measures that can be reliably comparedacross samples.Notably, a few previous metagenomics-based studies

have already highlighted the challenges involved in com-positional normalization. Specifically, studies of speciescomposition have previously demonstrated that compos-itional normalization of taxonomic data could both masktrue correlations between pairs of taxa and introducefalse correlations [13-16]. Other studies of oceanic com-munities have further emphasized the biases introducedby compositional normalization of environmental meta-genomic samples, specifically highlighting the potentialcontribution of the average genome size in each sampleto these biases [17-19]. To date, however, the impact ofcompositional normalization on functional metagenomicstudies of the human microbiome has never been shownor characterized, nor have the various sample-specificproperties that may contribute to inaccuracies in abun-dance measures. Furthermore, previous studies ofenvironmental metagenomes that aimed specifically toaddress genome-size induced bias still failed to providebiologically meaningful and interpretable measures ofgene abundance.Finally, even within each sample, various gene-specific

properties may bias measured abundances. Compositionalnormalization, or for that matter, any normalization schemethat applies an identical processing protocol to all genes,will inevitably fail to account for such errors. Indeed, todate, no attempts to characterize or correct within-samplebiases in genes’ abundances have been introduced,potentially neglecting important factors that may fur-ther contribute to inaccuracies in gene abundancemeasures.In this study, we analyze samples from the Human

Microbiome Project (HMP) [2,7,11,20], as well as sam-ples from two additional independent studies of the hu-man gut microbiome [4,11,21], and demonstrate thatcompositional normalization has a clear and measurableeffect on the obtained metagenomic functional profiles.We specifically show that this normalization protocol in-duces spurious inter-sample variation in the calculatedabundances of various genes across samples from thesame body site, across samples from different body sites,and across samples from different studies. We identifythree sample-specific properties that play a key role ingenerating this spurious variation, including the averagegenome size, species richness, and mappability of ge-nomes in the sample, and suggest a simple and morebiologically meaningful normalization method that aimsto quantify the typical genomic copy number of eachgene in the sample. We additionally show that gene-specific properties, such as sequence conservation andnucleotide content, further induce spurious intra-samplevariation in the measured gene abundances, and provide

an additional machine learning-based method to correctthese inaccuracies. Finally, we demonstrate that our cor-rection paradigm indeed provides improved abundanceestimates and has clear and significant benefits fordownstream comparative analyses. Specifically, we showthat our method aids in the discovery of differentiallyabundant genes (and corrects false discovery of invariantgenes), markedly increases the power to detect disease-associated pathways, and supports cross-study compara-tive analyses. Combined, these benefits allow us to movetowards a more rigorous and unbiased estimate of theaverage genomic copy number of each gene in humanmicrobiome samples, facilitating an accurate identifica-tion of functional shifts that may be associated withdisease.

Results and discussionSpurious inter-sample variation in HMP samples and itsdeterminantsTo examine the impact of compositional normalizationon the analysis of human microbiome samples, we firstidentified a set of 76 universal single-copy genes (USiCGs;see Methods and Additional file 1: Table S1). These genesare found in the genomes of almost all microorganisms,and generally at a single copy and therefore represent aproxy for invariant genomic elements. Ideally, therefore,since the content of each metagenome represents a simplelinear combination of the genomic content of the constitu-ent species, one would expect such USiCGs to also be in-variant across metagenomes. Put differently, comparativemetagenomic analyses, which aim to identify shifts in thegenic composition of the microbiome, could be consideredaccurate only if genes that are universally present in everymember species in every sample exhibit no variation acrosssamples. However, examining the relative abundancesof USiCGs in metagenomic samples from the HMP(Methods), we found a marked variation both betweenand within body sites (Figure 1; Additional file 2:Figure S1). For example, the median relative abun-dance of USiCGs in tongue dorsum samples is onaverage 1.3-fold higher than the median relative abun-dance of these genes in retroauricular crease samples.Similarly, the median relative abundance of USiCGs insome stool samples is 1.9-fold higher than in otherstool samples. Importantly, this spurious variation inthe observed abundance of USiCGs indicates that thecalculated abundances of other genes may also bebiased, potentially affecting any downstream compara-tive analysis and the correct identification of differen-tially abundant pathways.Notably, the variation in the abundance of the various

USiCGs appears to be consistent and sample-specific; incertain samples the abundances of all USiCGs tend to behigh whereas in others the abundances of all USiCGs

A

B

Figure 1 Spurious inter-sample variation across HMP metagenomicsamples. (A) Variation in the relative abundance of USiCGs acrossdifferent body sites (P

Table 1 Correlations and controlled correlations betweenthe abundance of USiCGs and various sample-specificproperties in the stool samples

Sample-specificproperty

Correlation withUSiCGs abundance

Controlled correlationwith USiCGs abundancea

Genome size -0.82 (P

Table 2 Correlations between sample-specific properties

Property 1 Property 2 Correlation

Genome size Mappability 0.8 (P


gut microbiome. Focusing on a single body site also allowsus to assess the performance of MUSiCC on a morehomogenous set of samples and therefore to evaluateMUSiCC’s ability to correct even the relatively fine vari-ation observed in such samples. Notably, however, as theactual average copy number of each gene in these samplesis unknown, evaluating the impact of inter-MUSiCC andassessing whether it produces reasonable estimates of aver-age copy numbers in these metagenomes is a challengingtask.To address this challenge, we considered the set of genes

that occur in only one OTU per sample (Methods). Forsuch OTU-Specific Genes (OSGs), the relationship betweenthe abundance of each gene and the abundance of the cor-responding OTU is not complicated by the presence ofother OTUs in the sample, making it easier to evaluate theimpact of Inter-MUSiCC on the corrected abundancevalues. Specifically, as each OSG occurs in only one OTU,clearly, the abundance of the OSG across the various sam-ples should positively correlate with the abundance of therespective OTU. If Inter-MUSiCC indeed provides a moreaccurate estimation of gene abundance compared to com-positional normalization, this correlation between OSGs’and OTUs’ abundances should improve with the applica-tion of Inter-MUSiCC. Since many OTUs observed in eachsample are not yet associated with a fully sequenced gen-ome, we used a recently introduced tool to predict the gen-omic content of each OTU (Methods) [24]. In total, weidentified 3,821 OSGs in 993 OTUs across 65 HMP stoolsamples. For each such OSG, we calculated the correlationbetween its abundance across the various samples and theabundance of its respective OTU with and without the ap-plication of Inter-MUSiCC. We found that Inter-MUSiCCindeed significantly increased the average correlation be-tween the abundances of OSGs and their respective OTUs(P


represent true variation in the composition of the sam-ple. Moreover, the consistency in this variation suggeststhat it can be attributed to gene-specific properties thatsystemically bias the measured abundance.To test this hypothesis, we collected a set of 35 gene-

specific properties, including various conservation fea-tures (across the gene orthology group) and nucleotidecontent measures (Additional file 5: Table S2; Methods).Focusing again on HMP stool samples, we identified asubset of these properties that were significantly corre-lated with intra-sample variation in USiCGs (measuredas the fold-change between the abundance of eachUSiCG and the average abundance of all USiCGs inthe sample; Additional file 5: Table S2). For example,USiCGs with high GC content tended to exhibit lowerabundance compared to USiCGs with low GC content.Similarly, USiCGs with highly variable length (measuredas the standard deviation in the length of the gene acrossall members of the gene orthology group) were morelikely to have lower abundance compared to USiCGswith a more consistent length. To examine the predict-ive power of such gene-specific properties on spuriousintra-sample variation and to account for the potentiallystrong dependencies between the various properties, wefurther used a machine learning approach with L1regularization to obtain for each sample a linear modelthat links these gene-specific properties to observed vari-ation (Methods). Remarkably, we found that this linearmodel correctly predicts and can correct on average>50% of the observed intra-sample variability in USiCGs’abundance on held-out test data across HMP stoolsamples (Figure 5A). Moreover, examining which gene-specific properties were selected by the linear model in eachsample, we found a clear agreement between models acrossthe various samples (Figure 5B), highlighting the robustness

Figure 4 Spurious intra-sample variation in the abundance ofUSiCGs is consistent across samples. The heatmap illustrates theabundance of each of the 76 USiCGs (x-axis) across 134 HMP stoolsamples (y-axis). The color gradient denotes the abundance aftercorrecting for spurious inter-sample variation (using Inter-MUSiCC),and hence represents the average copy number of the USiCG inthe sample.

of this model. Among the properties that were repeatedlyselected by the model were the median GC nucleotidecontent of the gene orthology group and the number ofspecies in which this gene could be detected using KEGG’shidden-Markov model.

Intra-MUSiCC: correcting spurious intra-sample variation andevaluating its impact on functional metagenomic profilesOur findings above suggest that spurious intra-samplevariation in USiCGs could be corrected by learning apredictive linear model that accounts for the impact ofvarious gene-specific properties on measured gene abun-dances. Assuming that such spurious intra-sample vari-ation also occurs in other, non-USiCG genes, we can usethe USiCGs-based model as a proxy for the effect ofgene-specific properties and apply this model to correctthe measured abundances of all other genes in the sam-ple. Since this method again relies on the set of USiCGsidentified above, we term it ‘Intra-sample MetagenomicUniversal Single-Copy Correction’ (Intra-MUSiCC). Not-ably, this predictive model can be learned once (for ex-ample, using a large set of metagenomic samples) andapplied to every new sample without modification. In whatfollows, however, we used a somewhat more sophisticatedapproach wherein a predictive model is learned in eachsample separately (based on the abundances of USiCGs inthat sample), cross-validated on unseen USiCGs abun-dances, and used to correct variation within that sample.This approach assumes that the impact of gene-specificproperties on measured abundances may slightly vary fromsample to sample and aims to capture the potentiallyunique effect of gene-specific properties in the sample. Weconfirmed, however, that the more generic approach, inwhich the predictive model is learned once and applied toall samples, produced qualitatively the same results as re-ported below.In the previous section, we already demonstrated that

Intra-MUSiCC corrects much of the spurious intra-sample variation in USiCGs’ abundances (see Figure 5).However, to confirm that Intra-MUSiCC is a useful andapplicable correction scheme, we next set out to validatethat it indeed yields improved abundance estimates alsofor non-USiCGs genes, focusing, as before, on the set ofHMP stool samples (see above). As was the case for ouranalysis of Inter-MUSiCC, this task is challenging sincethe true abundance values of non-USiCGs across sam-ples is not available. We therefore used several differentapproaches, each focusing on a specific set of genes, toevaluate the performance of Intra-MUSiCC and to con-firm that the corrected gene abundance values are in-deed more accurate and more informative than thewidely-used relative abundance values.First, we identified a set of 72 genes that did not meet

our criteria for USiCGs, but that could still be considered

A

B

Figure 5 L1 regularized linear modeling of spurious intra-sample variation in the abundance of USiCGs. (A) The proportion of explainedvariation (R2) in each HMP stool sample on held-out test data. (B) The weights (color gradient) assigned to the various gene-specific properties byan L1-regularized linear regression model across the different samples.


universal single-copy genes under a more relaxed set of re-quirements (Additional file 6: Table S3; Methods). Specific-ally, such ‘semi’-USiCGs tend to have a single copy in thevast majority of genomes, but are not as prevalent as theUSiCGs described above. While this set is expected to bemore variable than USiCGs (due to true variation in theiroccurrence across genomes), we hypothesized that at leastsome of the observed intra-sample variation among thesegenes might be spurious and accounted for by the variousgene-specific properties detected above. Indeed, we foundthat when using the USiCGs-based model (Intra-MUSiCC)to correct the abundance values of these genes across allHMP stool samples, 17% of the variation in these semi-USiCGs was corrected. Notably, this reduction in variationis observed even though these semi-USiCGs were neverused in the construction of the model.Next, we focused on pairs of genes whose occurrence

is highly correlated across genomes. Specifically, examin-ing the gene content of all reference genomes in KEGG[1,3], we identified 1,074 pairs of genes whose presence/absence profiles agree across >95% of the genomes inwhich they appear (Methods). Importantly, we excludedall USiCGs (as well as the semi-USiCGs describedabove) from this analysis. For each such pair, we com-puted the average absolute difference in abundanceacross HMP stool samples, before and after applyingIntra-MUSiCC. Clearly, high co-occurrence across ge-nomes does not necessarily imply perfectly correlatedabundance across metagenomes. A metagenome may,for example, include a relatively large proportion ofexactly those genomes in which the two genes do not

co-occur. Yet, it is reasonable to assume that such geno-mically co-occurring genes will tend to exhibit similarabundances across metagenomes. Accordingly, we hy-pothesized that the observed differences between theabundances of such gene pairs can be partly accountedfor by spurious intra-sample variation that arises fromgene-specific bias as opposed to true biological vari-ation. In line with this hypothesis, we indeed foundthat both the mean and variance of the pairwise differ-ences in the abundance of each of the 1,074 gene-pairswere significantly lower with the application of Intra-MUSiCC (P


variance explained within each cluster. We found thaton average Intra-MUSiCC corrected 8.2% of the variancewithin each gene cluster (Additional file 7: Table S4). Asan example, one such gene cluster contained all eightgenes in the F-type ATPase structural module. Althoughthese eight genes are highly consistent in their presence/absence patterns across reference genomes (median pair-wise Jaccard similarity of >98%), their abundances acrossHMP stool samples vary by up to approximately 2.1-foldon average (and >13.4-fold in some samples). Intra-MUSiCC (which notably employs a model learned onUSiCGs alone and for which these eight genes can beconceived as unseen data) corrected on average 32% ofthis variance.

MUSiCC markedly enhances the discovery of functionalshifts in the microbiomeAbove, we demonstrated that MUSiCC successfully cor-rects both inter- and intra-sample spurious variation ingene abundances. Clearly, however, the key goal of anymetagenomic normalization scheme is to facilitate com-parative analysis and to enable the discovery of func-tional shifts in the metagenome that may be associatedwith a given host phenotype. Next, we therefore set outto examine whether MUSiCC (that is, the combinedapplication of Inter-MUSiCC and Intra-MUSiCC) has aconcrete and significant impact on such comparativeanalyses and whether it affects the identified set ofdifferentially abundant genes or pathways in varioussettings.We first examined the impact of MUSiCC on the de-

tection of differentially abundant genes. Specifically, weused standard comparative analysis (Methods) to identifygenes that exhibit differential abundance between HMPstool samples and tongue dorsum samples, with andwithout the application of MUSiCC to correct the abun-dances of genes. Comparing the obtained sets of differ-entially abundant genes, we found an overall agreement,with 90% of the detected genes identified both with andwithout MUSiCC. Notably, however, there were alsosubstantial differences between the two sets, with 382genes found to be differentially abundant only whenusing MUSiCC and 343 genes found to be differentiallyabundant only when using standard compositionalnormalization (Figure 6A). Interestingly, among the 382genes identified only with MUSiCC, genes involved inLipopolysaccharide biosynthesis - a known gut metabolicpathway [9,10] - were over-represented (P

A B C

Figure 6 The impact of MUSiCC on the discovery of differentially abundant genes. The number of genes identified as differentiallyabundant between (A) HMP stool samples and tongue dorsum samples, (B) type 2 diabetes (T2D) cases and controls, and (C) inflammatorybowel disease (IBD) cases and controls, are illustrated. Each Venn diagram describes the overlap between the set of genes identified whenusing standard compositional normalization (cyan, left) and the set of genes identified when using MUSiCC (maroon, right). Pathways that areover-represented in the set of genes identified by only one of the two methods are listed.


previously reported as linked to T2D (for example, methanemetabolism [21] and glycolysis [13,15]). A similar patternwas observed in our analysis of the IBD dataset (Figure 7B).Specifically, all seven pathways identified as IBD-associatedwith compositional normalization were also identifiedwith MUSiCC, with six out of these seven pathways(86%) becoming more significant with the applicationof MUSiCC. Similarly, 28 additional pathways wereidentified only with the application of MUSiCC, in-cluding previously reported IBD-associated pathways(such as Riboflavin metabolism [17]).Considering the overall increase in the significance

level of disease-associated pathways observed above withMUSiCC (Figure 7A and B) and since many disease-related studies can obtain or process only a limitednumber of samples, we further examined whether MUSiCCcan indeed enhance the power to detect disease-associatedpathways when data are limited. We therefore repeated thecomparative analysis above, focusing on two disease-associated pathways of interest, and measured how signifi-cant the obtained association signal was when using only asubset of the available samples. Indeed, we found that withMUSiCC, significance is reached with far fewer samplesthan with compositional normalization (Figure 7C and D),suggesting that MUSiCC can successfully uncover noveldisease-associated pathways that may be masked by noiseor by sparse sampling when using standard compositionalnormalization. Using this disease-associated pathway dis-covery as the ultimate benchmark for the applicability ofthe various normalization methods, we additionally testedan alternative approach that was previously used to processa set of oceanic samples ([19]; Methods). We found that

this method outperformed the standard compositionalnormalization approach but was nonetheless far lesssuccessful than MUSiCC and still failed to identify manyof the relevant pathways as significant (Additional file 9:Figure S5).Finally, having shown that MUSiCC promotes the dis-

covery of disease-associated pathways (and with fewersamples), we set out to examine whether it can alsoallow researchers to combine data from multiple inde-pendent studies. Pooling sample sets across studiescould dramatically increase the amount of data availablefor future comparative analyses and has the potentialto significantly enhance efforts to discover disease-associated shifts in the microbiome. Specifically, thisapproach could be used to harness a large collectionof healthy samples (such as those obtained by HMP)as controls for many smaller disease-focused studies.Notably, however, we already identified above markedspurious variation between different studies of the hu-man gut microbiome (see Figure 3), suggesting that anaïve pooling of data from multiple studies without carefulcorrection of sample-specific biases may be challenging.Indeed, comparing the pathway-level functional profiles ofthe healthy samples from the three studies and performinga principal coordinate analysis (PCoA) to examine thevariation in this pooled set of samples, it was clear that thevast majority of the variation is study specific (Figure 8A).In such settings, using healthy samples from one study asthe set of controls for another study would potentiallymask genuine disease-associated shifts and would mostlyhighlight study-specific differences. To examine whetherMUSiCC can alleviate this problem, we again performed a

A B

C D

Figure 7 The impact of MUSiCC on the discovery of disease-associated pathways. Pathways identified to be associated with (A) T2D, orwith (B) IBD, when using standard compositional normalization (cyan) or MUSiCC (maroon). Bars denote the significance level of the association.The dots to the right of each bar indicate whether this association reached significance with FDR

A B

C

Figure 8 The impact of MUSiCC on pooling samples from multiple studies. (A) A principal Coordinate Analysis (PCoA) plot, illustrating thevariation within the pathway-level functional profiles of healthy individuals from three independent studies of the human gut microbiome, usingcompositional normalization. Each dot represents a single sample and the proportion of variance explained by each of the first two principal co-ordinates is indicated on the axes’ labels. (B) A similar PCoA plot after the application of MUSiCC. (C) The number of T2D-associated pathwaysthat are recovered (see Methods) when using as control the original T2D study control samples (left), HMP samples (middle), or IBD control sam-ples (right). For each case, the number of pathways recovered with compositional normalization and with MUSiCC is illustrated.


controls (Figure 8C). Evidently, while inter-study variationis still a challenging problem, MUSiCC clearly and effect-ively improves our ability to pool samples from multiplestudies and to increase the power to detect biological path-ways associated with different diseases.

MUSiCC significantly reduces spurious variation insimulated bacterial samplesAbove, we demonstrated that MUSiCC reduced spuriousvariation and improved our ability to detect disease-associated pathways in real metagenomic datasets. Suchdatasets provide a means to assess the full range of fac-tors that potentially impact functional profile measure-ments. However, since the exact underlying taxonomicand functional compositions in these real datasets areunknown they cannot serve as a gold standard for evalu-ating our method or for comparison. We thereforewanted to further examine the ability of MUSiCC to re-duce spurious variations on a synthetic dataset, where

the true abundances of genes and pathways are available.To this end, we generated 20 simulated metagenomicsamples, each of which consisted of 500,000 readsgenerated at random from a mock community of ref-erence genomes that were randomly assigned differentrelative abundances (see Methods and Additional file10: Table S5). We then mapped the reads in each sam-ple to the KEGG database and calculated the readcount of each KEGG Orthology group (KO) in eachsample. We finally normalized and corrected the ob-tained KO abundances using either MUSiCC or stand-ard compositional normalization.First, we compared the calculated abundance of each KO

(using either MUSiCC or compositional normalization)with its real underlying average copy number in each sam-ple, to examine the variation in estimated abundancesacross samples. We found that with compositionalnormalization, KOs with identical average copy numbersin different samples exhibit a wide range of normalized


abundances across the 20 different simulated communities(Figure 9A). In sharp contrast, correcting KO abundanceswith MUSiCC resulted in a markedly narrower range ofestimated abundance values for KOs with the sameaverage copy number (Figure 9B). Moreover, as dem-onstrated in Figure 9B, MUSiCC not only reducedspurious inter-sample variation, but also provided anaccurate estimation of the average copy number valuesof the various KOs, offering clear and biologically in-terpretable abundance values.Next, since many comparative analyses of metagen-

omes are performed at the pathway level (for example,by summing the abundances of all the KOs associatedwith a given pathway), we wanted to specifically examinethe impact of MUSiCC on spurious pathway abundancevariation. To this end, in generating the simulated sam-ples discussed above, we intentionally limited our selec-tion of bacterial genomes to those that contain theentire set of flagellar assembly genes with the same exactcopy numbers. Accordingly, any observed variation in

A

C

Figure 9 Evaluation of different normalization methods across 20 simthe average copy number of each KO in each sample to the corrected abunormalization (A) and by MUSiCC (B). Each sample is represented by a diffmarkedly reduced variability between the slopes of the regression lines infor compositional normalization) highlights the beneficial impact of MUSiCin inferring underlying copy numbers. (C) Comparison of the observed coeassembly pathway across simulated samples using different normalization mexhibit no variation. Each dot represents the CoV calculated based on a ranof CoV values across 100 such subsets. Each box plot represents the 25th ato approximately ±2.7 standard deviations. The red dot represents the CoV

the flagellar assembly pathway across the various sam-ples is, by construction, spurious. Indeed (see Figure 9C),the coefficient of variation (CoV; standard deviation overmean) of the corrected abundance of this pathway acrosssamples was markedly lower with MUSiCC (CoV = 0.03)than with compositional normalization (CoV = 0.175).To allow a more rigorous statistical analysis of the per-formances of each normalization method and to quantifythe robustness of the various methods to subsampling,we further used a bootstrapping approach, repeatedlyselecting a subset of the simulated samples and calculat-ing the CoV observed in the abundance of this pathwayacross each subset. Comparing the distribution of CoVobtained for 100 subsets, we confirmed that inter-samplevariation is indeed significantly lower with MUSiCC thanwith compositional normalization (Figure 9C; P


obtained with compositional normalization (Figure 9C;P


only provide a more accurate gene abundance measure,but also has a significant impact on the discovery of dif-ferentially abundant genes and of enriched pathways.Specifically, we showed that our MUSiCC pipeline mark-edly enhances the identification of disease-associated path-ways, offering substantially increased statistical power bothin terms of the number of pathways identified and thenumber of samples required for identification. Perhapsmost remarkably, we found that MUSiCC facilitates effortsto pool data from several human microbiome studies, cor-recting much of the study-specific variation and laying thefoundation for future cross-study comparative analyses.Notably, without MUSiCC, such study-specific variationwas so dramatic that it masked practically any genuine, dis-ease specific variation in the data.In addition, we also evaluated MUSiCC using a set of

simulated samples in which the underlying copy numberof each gene in each sample is known. We demonstratedthat MUSiCC significantly reduced spurious inter-samplevariation both at the KO level and at the pathway levelcompared to compositional normalization. We furtherdemonstrated that using the average genome size in eachsample (either real or predicted) for normalization is notsufficient for removing inter-sample variation and thatMUSiCC significantly outperformed such a normalizationapproach even in the ideal case of simulated communities.This finding highlights the benefit of a marker gene basednormalization scheme and specifically the advantage ofusing USiGCs as a yardstick for copy number estimation,since it can capture the full range of factors that may intro-duce inter-sample variation. Moreover, it should be notedthat methods for estimating average genome size often em-ploy scenario-specific optimized parameters or requirecomplete alignment, whereas MUSiCC is a parameter-freemethod that relies solely on KO measured abundances.Clearly, the development of robust computational and

statistical methods for an accurate characterization ofgene abundances is an ongoing effort. Specifically, whileinter-MUSiCC potentially corrects most of the inter-sample variation introduced by sample-specific proper-ties, correcting gene-specific intra-sample biases is amuch more challenging task. For one, the set of gene-specific properties that could affect the measured abun-dance of a gene may be markedly larger than the set weanalyzed. Moreover, the directionality and magnitude ofsuch effects may not be consistent across differentgroups of genes (note, for example, that a USiCGs-basedmodel corrected >50% of the variation in unseen USiCGs,but


length. Sample and gene (KO) abundance data for the in-flammatory bowel disease study [4] were obtained from asubsequent analysis of these samples performed by [29].Sample and gene (KO) abundance data for the type 2 dia-betes study were obtained from [11].

PICRUSt dataPiCRUSt [24] pre-calculated matrices for operationaltaxonomic units (OTUs) and their predicted KOs weredownloaded from the developer’s github site. OTUabundances mapped to GreenGenes IDs in the HMPbody sites were obtained through personal communica-tion with the lead author of PICRUSt.

Detecting Universal Single-Copy Genes (USiCGs)The list of USiCGs was compiled to include KOs thatare both universal and appear in a single-copy in eachgenome. To determine the level of universality required,we used the list of 31 marker genes from the PhlyoSiftpipeline [30], and examined the number of KEGG ge-nomes in which each of these genes appear. We foundthat at minimum (after removing one outlier) thesegenes appear in 91.5% (2,313) of the bacterial and ar-chaeal genomes in KEGG, and therefore considered asuniversal any gene that appears in at least 91.5% ofKEGG genomes. Of these genes, we further selected thosethat had an average number of copies per genome


when Inter-MUSiCC was applied. The Statistical signifi-cance of the reduction in the mean and variance of thisdistribution were computed using the ‘ttest’ and ‘vartest2’functions in MATLAB.

Examining pairwise gene correlation structure withcompositional normalization and with Inter-MUSiCCWe first downloaded all fully sequenced and annotatedgenomes from KEGG (downloaded on 15 July 2013). Foreach gene (KO), we computed the Jaccard similarity be-tween its presence/absence pattern across these genomesand the presence/absence pattern of all other genes. Foreach gene, we then also computed the Pearson correl-ation between its abundance across HMP stool samplesand the abundances of all other genes. Given these twometrics, finally, for each gene, we computed the Pearsoncorrelation between its correlation with all other genesacross genomes and its correlation with all other genesin metagenomes. Distances and statistical significancewere computed using the ‘pdist’ and ‘ttest’ functions inMATLAB.

Gene-specific propertiesGene length and species statistics properties were down-loaded from KEGG. Gene length was calculated as theaverage length of all genes labeled with the associatedKO. Conservation and alignment properties were calcu-lated by first downloading the gene sequences associatedwith each KO from KEGG. Next, MAFFT [31] was runon the set of genes for each KO, and statistics were cal-culated from the MAFFT multiple alignment output.Nucleotide content properties (for example, mean GC%)were calculated from the set of gene sequences assignedto each KO in KEGG. KO recall and precision propertieswere obtained from a large-scale study of short read an-notation [32] and describe the average recall and preci-sion (across multiple genomes) in annotating simulatedshort shotgun reads originating from each KO.

Regularized linear model linking gene-specific propertiesto intra-sample variationFor a given sample, we first defined the observed re-sponse as the fold-change between the abundance ofeach USiCG and the mean abundance of all USiCGs inthe sample. The model covariates were defined for allsamples as the standardized values of the variousUSiCGs’ gene-specific properties (Additional file 5: TableS2). When learning the model for a specific sample, weused a strict five-fold cross-validation (CV) scheme tolearn an Elastic-Net regularized linear model [33-35]that predicts the fold-change of each USiCG in the sam-ple. Importantly, in each CV partition to training andtest sets, we first learned the penalty parameter andmodel weights using solely the training data (by using an

internal CV scheme with the MATLAB version ofglmnet [33]), and evaluated the performance of ourlearned model by quantifying the fraction of variationthe model explains when predicting the fold-change ofUSiCGs held-out as test data. Running time for thelearning step on a typical sample (approximately 13,000KOs) took less than 1s on a single core processor.

Identifying Semi-Universal Single-Copy genes (semi-USiCGs)Semi-USiCGs were selected as genes that were present inat least 2,148 (85%) of bacterial and archaeal genomes inKEGG with an average number of copies per genome 0.95 (that is, theyagree on at least 95% of the genomes they appear in)were selected, resulting in 1,074 such pairs. Clusters weredefined as sets of five or more genes, all with pairwiseJaccard similarly >0.95.

Identifying differentially abundant genesFor a given gene and two sets of metagenomic samples(for example, stool vs. tongue or IBD cases vs. controls),we compared the distribution of abundances betweenthe two sets using the Wilcoxon rank-sum test. Genesthat passed the Bonferroni correction (for comparingHMP body sites) or the FDR correction (for T2D andIBD cases vs. controls) with corrected P values


each sample, using the set of eight universal genes sug-gested in [19]. For each gene we estimated the averagegenome size, G, as G = (g + L-2 m)/f, where f denotes therelative abundance of the gene in the sample, L denotesthe sequence read length in the data (set to 75 bp forthe IBD dataset), and m denotes a minimum overlapparameter (set to 90 as in [19]). Next, we averaged theeight estimations and corrected gene abundances bymultiplying the relative abundance of each gene by thisestimated average genome size of the sample. Finally, weused these corrected abundances and applied the samepathway-level comparative analysis to identify IBD-associated pathways. Since this normalization approachrequires raw read counts, we could not apply it the otherdatasets analyzed in our study.

Measuring T2D-associated pathway recovery when usingT2D cases with HMP or IBD controlsFirst, we identified the original T2D-associated path-ways as described above. Next, we repeated the sameprocess but as controls used either HMP stool samplesor healthy samples from the IBD study. To computethe number of recovered pathways, we counted thenumber of pathways that were discovered both in theoriginal setting and in the cross-study setting. In orderto prevent a situation where high recovery stems fromthe identification of many pathways, we limited thenumber of discovered pathways in the cross-study set-ting to be the number of pathways discovered in theoriginal setting.

Simulating microbial samplesTo evaluate the performance of MUSiCC on a dataset inwhich the underlying KO and pathway abundances areknown, we generated a dataset of synthetic metagenomicsamples, following the procedure described in [32]. Spe-cifically, we generated 20 simulated metagenomic sam-ples, each of which consisted of 500,000 101 bp readsgenerated at random from a collection of reference ge-nomes that were randomly assigned different relativeabundances (up to 100-fold) in each sample. To facilitateanalysis of pathway level variation, we limited the set ofgenomes used to 21 bacterial genomes from KEGG thatcontained the entire set of KOs associated with the flagellarassembly pathway (Additional file 10: Table S5), and eachsample harbored 10 genomes randomly selected from thisset. We then mapped the simulated reads to known KOsand calculated the read count of each KO in each sample,as previously described [35]. The underlying true averagecopy number for each KO was calculated based on thegenomes included in each sample and weighted by theirrelative abundances.

Using average genome size estimation methods forsample normalizationWe utilized two previously introduced methods for esti-mating the average genome size in each simulated sam-ple. First, to apply the method introduced in Raes et al.[23], we calculated the marker gene density for eachsample. To allow the processing of these samples in apractical and reasonable time, we used our mBLASTxmapping to the KEGG database rather than re-blastingto the STRING database. As marker density, we usedthe total number of bases from all reads that matchedany marker gene, divided by the length of the gene foreach read mapped. Next, we plugged the resultingmarker gene density, x, into the following equation forcalculating the effective genome size (EGS):

EGS ¼ aþ b� L−c

x

where L is the read length (101 bp), and a, b, and c areparameters optimized in the original study (a = 21.2, b =4230, c = 0.733) as described in Raes et al. [23]. The ef-fective genome sizes obtained were highly correlatedwith the real average genome sizes in each sample (r =0.99; Pearson correlation). Using other sets of parame-ters described in the original study did not improve thiscorrelation. Finally, we used these calculated effectivegenome sizes to normalize the measured KO abun-dances in each simulated sample.Second, to apply the method introduced in [22], we in-

stalled the GAAS software (version 0.17). Since GAASrequires the complete BLAST alignment files, we ranBLAST for each simulated sample against the 21 ge-nomes in our simulation (this represents both a bestcase scenario and is feasible computationally), using thesame parameters as described in the original study:blastn -query < sample fasta file > -db < simulation ge-nomes nucleotide database > -out -outfmt 6 -evalue 0.001 -gapopen 5 -gapextend 2-word_size 11 -penalty −3 -reward 1. Next, we ranGAAS on the alignment results with the command: gaas-f < sample fasta file > -d -m -gt 0 -gp 0 -gs 0 -j 1, andobtained GAAS estimated average genome size in eachsample. Overall, the median error in average gnome sizeestimation was


KEGG. This threshold was selected based on the list of markers in PhyloSift(http://phylosift.wordpress.com/tutorials/scripts-markers/).

Additional file 2: Figure S1. Spurious inter-sample variation acrossHMP metagenomic samples in various body sites. See Figure 1B fordefinition of box and whisker plot.

Additional file 3: Figure S2. Spurious inter-sample variation in HMPtongue dorsum samples is correlated with sample-specific properties.(A) The relative abundance of USiCGs across HMP tongue dorsumsamples. See Figure 1B for definition of box and whisker plot. The medianabundance of USiCGs across tongue dorsum samples is correlated with(B) the average genome size; R = -0.6, P


19. Frank JA, Sørensen SJ. Quantitative metagenomic analyses based on averagegenome size normalization. Appl Environ Microbiol. 2011;77:2513–21.

20. Le Chatelier E, Nielsen T, Qin J, Prifti E, Hildebrand F, Falony G, et al.Richness of human gut microbiome correlates with metabolic markers.Nature. 2013;500:541–6.

21. Mathur R, Goyal D, Kim G, Barlow GM, Chua KS, Pimentel M. Methane-producinghuman subjects have higher serum glucose levels during oral glucose challengethan non-methane producers: a pilot study of the effects of enteric methanogenson glycemic regulation. Res J Endocrinol Metab. 2014;2:2.

22. Angly FE, Willner D, Prieto-Davó A, Edwards RA, Schmieder R, Vega-Thurber R, et al. The GAAS metagenomic tool and its estimations ofviral and microbial average genome size in four major biomes. PLoSComput Biol. 2009;5:e1000593.

23. Raes J, Korbel JO, Lercher MJ, Von Mering C, Bork P. Prediction of effectivegenome size in metagenomic samples. Genome Biol. 2007;8:R10.

24. Langille MGI, Zaneveld J, Caporaso JG, McDonald D, Knights D, Reyes JA,et al. Predictive functional profiling of microbial communities using 16SrRNA marker gene sequences. Nat Biotechnol. 2013;31:814–21.

25. Benjamini Y, Speed TP. Summarizing and correcting the GC content bias inhigh-throughput sequencing. Nucleic Acids Res. 2012;40:e72.

26. Oh J, Byrd AL, Deming C, Conlan S, Program NCS, Kong HH, et al.Biogeography and individuality shape function in the human skinmetagenome. Nature. 2014;514:59–64.

27. Wylie KM, Mihindukulasuriya KA, Zhou Y, Sodergren E, Storch GA, WeinstockGM. Metagenomic analysis of double-stranded DNA viruses in healthyadults. BMC Biol. 2014;12:71.

28. Minot S, Bryson A, Chehoud C, Wu GD, Lewis JD, Bushman FD. Rapidevolution of the human gut virome. Proc Natl Acad Sci U S A.2013;110:12450–5.

29. Greenblum S, Turnbaugh PJ, Borenstein E. Metagenomic systems biology ofthe human gut microbiome reveals topological shifts associated withobesity and inflammatory bowel disease. Proc Natl Acad Sci U S A.2012;109:594–9.

30. Darling AE, Jospin G, Lowe E, Matsen IV FA, Bik HM, Eisen JA. PhyloSift:phylogenetic analysis of genomes and metagenomes. Peerl. 2014;2:e243.

31. Katoh K, Misawa K, Kuma K-I, Miyata T. MAFFT: a novel method for rapidmultiple sequence alignment based on fast Fourier transform. Nucleic AcidsRes. 2002;30:3059–66.

32. Carr R, Borenstein E. Comparative analysis of functional metagenomicannotation and the mappability of short reads. PLoS One. 2014;9:e105776.

33. Friedman J, Hastie T, Tibshirani R. glmnet: lasso and elastic-net regularizedgeneralized linear models. J Stat Softw. 2010;33:1–22.

34. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat SocSer B Methodol. 1996;58:267–88.

35. Zou H, Hastie T. Regularization and variable selection via the elastic net. J RStat Soc B Stat Meth. 2005;67:301–20.

Submit your next manuscript to BioMed Centraland take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color figure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit

AbstractBackgroundResults and discussionSpurious inter-sample variation in HMP samples and its determinantsSpurious inter-sample variation across different studiesInter-MUSiCC: correcting spurious inter-sample variationEvaluating the impact of Inter-MUSiCC on functional metagenomic profilesSpurious intra-sample variation and its determinantsIntra-MUSiCC: correcting spurious intra-sample variation and evaluating its impact on functional metagenomic profilesMUSiCC markedly enhances the discovery of functional shifts in the microbiomeMUSiCC significantly reduces spurious variation in simulated bacterial samples

ConclusionsMethodsSoftware implementation and distributionMetagenomic dataPICRUSt dataDetecting Universal Single-Copy Genes (USiCGs)Estimating average genome size in HMP samplesEstimating the species richness and genome mappability in HMP samplesControlled correlations between USiCGs’ abundances and sample-specific propertiesIdentifying and analyzing OTU-Specific Genes (OSGs)Examining pairwise gene correlation structure with compositional normalization and with Inter-MUSiCCGene-specific propertiesRegularized linear model linking gene-specific properties to intra-sample variationIdentifying Semi-Universal Single-Copy genes (semi-USiCGs)Identifying highly-correlated genes across genomesIdentifying differentially abundant genesIdentifying disease-associated pathwaysEvaluating an alternative normalization approachMeasuring T2D-associated pathway recovery when using T2D cases with HMP or IBD controlsSimulating microbial samplesUsing average genome size estimation methods for sample normalization

Additional filesCompeting interestsAuthors’ contributionsAcknowledgementsAuthor detailsReferences

Documents

MUSiCC: a marker genes based framework for metagenomic ... · each sample, and some normalization method is required to allow comparison across samples. This is most com-monly achieved