Supplementary Materials for...the current gold standards for high-throughput classification of bacteria l6S rRNA gene sequences; however, at this time, it does not provide species

www.advances.sciencemag.org/cgi/content/full/1/5/e1400216/DC1

Supplementary Materials for

Staphylococcus aureus and the ecology of the nasal microbiome

Cindy M. Liu, Lance B. Price, Bruce A. Hungate, Alison G. Abraham, Lisbeth A. Larsen,

Kaare Christensen, Marc Stegger, Robert Skov, Paal Skytt Andersen

Published 5 June 2015, Sci. Adv. 1, e1400216 (2015)

DOI: 10.1126/sciadv.1400216

This PDF file includes:

Fig. S1. Correlation of nasal microbiota composition among monozygotic and

among same-sex and opposite-sex dizygotic twin pairs in non-metric

multidimensional scaling ordination plots.

Fig. S2. Rates of S. aureus nasal colonization by sequencing and by culture and S.

aureus absolute abundance for the seven nasal CSTs.

Fig. S3. Results from decision tree model derivation and validation showing

threshold-dependent relationships between the absolute abundances of nasal

commensals and S. aureus.

Table S1. The indicator genera for each nasal CST identified on the basis of

proportional abundance, where CI < 0.80 denotes bacteria taxa assigned with

<80% bootstrap confidence level.

Table S2. Nasal bacterial density (median and IQR) of each nasal CST and the

prevalence of each CST by sex.

Table S3. Comparison of within-twin nasal microbiota composition correlation

between monozygotic and dizygotic twins based on pairwise ecological distance

(Jaccard, Bray-Curtis, and Euclidean distances).

Table S4. Results from standard biometrical heritability analysis using a

polygenic model to determine the contribution of additive genetic effects (A),

genetics effects due to dominance (D), shared environmental effects (C), and

nonshared environmental effects (E) to nasal bacterial density.

Table S5. The associations between nasal bacterial density and host factors,

including sex, assessed by Wilcoxon rank sum and Kolmogorov-Smirnov tests.

Table S6. Comparison of nasal bacterial density by sex, adjusted for nasal CST in

a quasi-Poisson model.

References (25–33)

SUPPLEMENTARY MATERIALS

LABORATORY METHODS

A. DNA isolation and purification. All samples from each subject were processed in the same

batch to control for inter-run variation in lysis and purification. The combined chemical and

mechanical lysis was performed as follows: Swab samples were thawed at 4°C and 100μl of swab

eluent from each sample was transferred to pre-labeled PCT MicroTube (Pressure Bioscience,

Inc., South Easton, MA, USA) containing 50μl of RLT lysis buffer (Qiagen, Valencia, CA, USA)

and capped with a 150μl PCT MicroCap (Pressure Bioscience, Inc.). Each capped MicroTube

was loaded onto the MicroTube holder and undergo mechanical lysis on the Barocycler NEP

3229 instrument (Pressure Bioscience, Inc.) using the following pressure cycling conditions at

25°C: increase pressure to 35,000 pounds per square inch (psi) for 15 seconds, then decrease to

14·696 psi for 15 seconds and repeat for 19 more cycles. The lysate was added to 550μl of RLT

lysis buffer and purified using the AllPrep DNA/RNA Mini Kit following manufacturer’s

instructions. DNA elution was performed with 100μl of Buffer EB. The purified DNA was used

in subsequent qPCR and pyrosequencing analysis.

B. 16S rRNA gene-based broad-coverage qPCR. Each 16S qPCR reaction was performed in 10

µl reaction volumes in PRISMTM 384-well Clear Optical Reaction Plates (Applied Biosystems by

Life Technologies, Grand Island, NY, USA) using methods as described previously (17). An in-

run standard curve spanning 102-108 in serial 10-fold dilutions was included in all runs and all

samples were analyzed in triplicate reactions. Raw experimental data, including the cycle

threshold (Ct) and gene copy number values for each reaction were exported from the Sequence

Detection Systems v2·3 software (Applied Biosystems) using a manual Ct threshold of 0·05 and

automatic baseline. The Ct standard deviation for each sample was further processed, where

samples with Ct standard deviation ≥ 0.25 were examined for outliers, defined as a single

replicate with Ct-value that is ≥ 0.25 away from the remaining two replicates, which were then

removed. The processed data was then used to calculate the finalized Ct-value, as well as the 16S

rRNA gene copy number by plotting the Ct-value against linear regression of the in-run standard

curve.

C. Generation of 16S rRNA gene V3V6 amplicons. Amplification of the V3V6 region of the 16S

rRNA genes in each DNA sample was performed in a 96-well format using 50 μl reactions and

thermocycling conditions as previously described (18). In each optimized 50 μl reaction, 10 μl of

DNA was added to 40 μl of PCR reaction mix with a final concentration of 400 nM of each broad

range fusion forward primer (5’- CCATCTCATCCCTGCGTGTCTCCGA-

CTCAGnnnnnnnnCCTACGGGDGGCWGCA-3’) and fusion reverse primer (5’-

CCTATCCCCTGTGTGCCTTGGCAGTCTCAGCTGACGACRRCCRTGCA-3’), with the

underlined portion denoting FLX Lib-L adapter sequence, italicized portion denoting the sample-

specific 8-nt barcode sequence, and bolded portion denoting 16S rRNA gene primer sequence,

10X PCR buffer without MgCl2 (Invitrogen), 2·5 mM MgCl2, 0.5 mM dNTP mix, 0·067 U/μl

Platinum® Taq DNA Polymerase (Invitrogen), and molecular grade H2O using the following

thermocycling condition: 90 seconds at 95°C for initial denaturation and UNG inactivation, 30

seconds at 95°C for denaturation, 30 seconds at 62°C for annealing, 30 seconds at 72°C for

extension, with the annealing temperature decreasing by 0.3°C for each subsequent cycle for 19

cycles, followed by 10 cycles of amplification consisting of 30 seconds at 95°C for denaturation,

30 seconds at 45°C for annealing, 30 seconds at 72°C for extension, and a final extension for 7

minutes at 72°C and cool down to 15°C. PCR products were frozen immediately at -20°C until

further processing. In each fusion PCR experiment, negative and positive extraction controls were

included, as well as PCR controls including a no-template control, a positive bacterial control (E.

coli genomic DNA at 1 pg/μl), and a human DNA control (human genomic DNA at 10 ng/μl).

The resultant fusion PCR product were analyzed using 1% E-Gel® 96 Agarose (Invitrogen) to

confirm PCR amplification and product band size. The barcoded 16S rRNA gene amplicons from

each sample underwent 4 ten-fold dilutions and were quantified using the 16S rRNA gene-based

broad-coverage qPCR described earlier. The resultant barcoded amplicons were pooled in an

equimolar fashion. The pooled barcoded 16S rRNA gene amplicon library underwent emulsion

PCR, bead enrichment and recovery, and pyrosequencing analysis on the Genome Sequencer

FLX instrument (454 Life Sciences, Branford, CT, USA)

BIOINFORMATICS METHODOLOGICAL DETAILS

A. Chimeric sequence removal. We first converted the standard flowgram format (SFF) files into

fasta sequence and quality files using a combination of in-house Perl-based wrappers and the 454

Sequencing System Software 1. Next, we identified chimeric sequences de novo using U-Search’s

cluster utility (U-Search version 5·0·144) and U-Chime at the 99% threshold (25, 26). Only non-

chimeric sequences were included in subsequent analysis.

B. Sequence barcode removal, binning, and quality filtering. We next assigned each

pyrosequence to its original sample and scanned for primer sequence using a QIIME utility (27).

Pyrosequences without valid barcode or primer were excluded. We filtered the demultiplexed

pyrosequences based on: a) length (150bp - 920bp), b) number of degenerate bases (a maximum

of six), c) mean quality score (a lower threshold of 25), and d) homopolymer length (a maximum

consecutive run of six). Lastly, we trimmed each sequence based on quality using a sliding

window of 50bp and a quality score threshold of 25.

C. Taxonomic Classification. The resultant demultiplexed and quality-checked 16S rRNA gene

sequences were classified at each taxonomic level (i.e., phylum, class, order, family, genus) at

≥80% bootstrap confidence level using a web service for the Naïve Bayesian Classifier (RDP

Release 10, Update 28) (28). Sequences classified at <80% bootstrap confidence level are

reported with the assigned taxon and a “CI<0.80” notation. The taxonomic classifications

assigned to the sequences through the RDP Classifier fall into the modern high-order bacterial

proposed by Garrity et. al (29). A total of 327,716 bacterial 16S rRNA gene sequences were

obtained and classified. Classification results for each sample are enumerated to generate an

abundance-based matrix for data analysis. Bacterial taxa that comprised 0.2% of total sequences

were included in subsequent analysis.

D. Species classifier development and validation. The Naïve-Bayesian RDP Classifier is one of

the current gold standards for high-throughput classification of bacteria l6S rRNA gene

sequences; however, at this time, it does not provide species level classification, which limits our

ability to examine Staphylococcus and of other nasal taxa at the species-level, if sufficient

resolution exists. In order to achieve this, we re-built the RDP Classifier with an external

taxonomy and curated sequences,

with the particular goal of

improving Staphylococcus species

resolution because it is of major

ecological importance in the nasal

cavity.

Thus, we developed a pipeline that

will read an external taxonomy,

create a database to maintain the

taxonomic information, build

training files from the database solution, re-train the RDP Classifier, and generate classifications

for a set of query sequences. The general workflow is depicted below:

D1. Training set curation. Fundamental to the entire process mentioned above is the acquisition of

a training set, which can be used as a model for generating taxonomic classifications.

Staphylococcus sequences that are missing or underrepresented in our training set may be

assigned incorrectly or with low confidence level. Greengenes and RDP Staphylococcus

sequences were used as the core of our training set (28, 30) and our curation is ongoing.

Overview of species-level classifier pipeline

D1a. Greengenes. The majority of the training set is comprised of sequences from the Greengenes

taxonomy. While trying to build raw training files (described in the section titled “Creating Raw

Training Files”), it became apparent that polyphyletic groups exist in the current Greengenes

taxonomy. Re-training the RDP Classifier was not possible until these groups were either

resolved, or removed from the taxonomy. Our approach to overcoming this obstacle was to insert

the polyphyletic taxonomy into a database, and then build training files from the database

solution such that a given sequence’s membership in a polyphyletic group is clearly indicated.

A taxon is considered polyphyletic if it has more than one parent. Consider the two lineage

strings below:

182310

Root;k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Alteromonadales;f__Alteromonadaceae;g__Altero

monas;s__Alteromonasmarina

250345

Root;k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Oceanospirillales;f__Alteromonadaceae;g__Marino

bacter;s__

The family Alteromonadaceae is polyphyletic because it has two parents. Our solution generates training files that

indicate this relationship in the manner demonstrated below:

182310

Root;k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__KNOWN_POLYPHYLETIC_GROUP[Alteromona

dales, Oceanospirillales];f__Alteromonadaceae;

D1b. Ribosomal Database Project. A set of training sequences was also acquired from the RDP

project. This set is comprised of solely of type strains assigned to the genus Staphylococcus to

increase the confidence of species-level assignment. The RDP sequence set underwent chimera

check using UCHIME (26).

D2. Database Design. To facilitate efficient maintenance of and access to the taxonomy, a

relational database solution was implemented. In order to optimize efficient querying of the

database, reduce space consumption, and to eliminate redundant entries within the database, a

table for each rank was implemented; the consequence of this is that it is only necessary to insert

a specific taxon once. Synonym tables were added to facilitate querying of the database in a

manner that allowed membership in a polyphyletic group to be reflected in the resulting training

files. The database design is provided in the diagram attached with this document:

To insert sequences into the database, the insert_taxonomy.py is utilized:

python insert_taxonomy.py

-t taxonomy_file.txt

-o parsed_taxonomy_file.txt

-s RDP

-b yes

-a localhost

-m 16S_TAXONOMY_RDP_STAPH

-u root

-p password

-c create_all_tables.sql

-f seqs.fasta

Argument Explanation:

* -t The taxonomy file containing all the lineage strings

* -o The name of the parsed taxonomy file that will be generated

* -s Source of the taxonomy

* -b Flag to indicate whether a new database build is to be used (values of y or

yes will indicate to do so)

* -a Database host

* -m Database name

* -u Username

* -p Password

* -c Optional argument indicating the script for creating the database tables

* -t Optional argument indicating the taxonomy dictionary that will be used as a

schema to build the database from

* -f The fasta file containing the sequences in the taxonomy

D3. Creating Training Files. The build_taxonomy.py script is used to generate training files from

the database. Usage is as follows:

python /PATH/build_taxonomy.py

-t yes

-o /PATH/training_rdp_download_1258seqs.txt

-a localhost

-d 16S_TAXONOMY_RDP_STAPH

-u root

-p password


* -t Optional argument that indicates whether or not a training file is to be

generated. Any value will indicate yes.

* -o Output file

* -s Optional argument that indicates the source of the taxonomy. This is only

used if an original taxonomy file is to be generated.

* -a Hostname

* -d Database name

* -u Username

* -p Password

The RDP Classifier training requires two raw training files as inputs: a taxonomy tree file

containing the hierarchical taxonomy information, and a sequence file with lineage strings

included in the headers. Both of these files are created with the create_raw_training_files.py

script, which performs the following steps: 1. Modify/parse the taxonomy, 2. Modify/parse the

sequence file, 3. Create the raw taxonomy tree file, and 4. Generate the updated sequence file. In

order to run this script, the following command must be executed:

python create_raw_training_files.py

<taxonomy_file>

<sequence_file>

<output_raw_taxonomy_file>

<output_raw_seq_file>


* <taxonomy_file> - The file generated by build_taxonomy.py containing

sequence ids and associated lineage strings

* <sequence_file> - The fasta file generated by build_taxonomy.py containing all

the training sequences

* <output_raw_taxonomy_file> - Output file containing hierarchical taxonomy tree

* <output_raw_seq_file> - Output sequence file with lineage included in the

headers

D4. Modifying the Taxonomy. A hierarchical taxonomy tree file will be generated as the output;

however, for the tree to be valid, certain modifications to the taxonomy must be made. It is a

strict requirement that all sequences in the taxonomy must not only have names for all ranks, but

they must also all be classified down to the same level. Consider the sequence 152262, which has

a lineage of:

k__Bacteria;p__Chlamydiae;c__Chlamydiae;o__Chlamydiales;f__;g__

Our script will parse this lineage string such that the following, valid string is generated:

k__Bacteria;p__Chlamydiae;c__Chlamydiae;o__Chlamydiales;f__Bacteria.Chlamydiae.Chlamydiae.Chlamy

diales.unclassified_family;

g__Bacteria.Chlamydiae.Chlamydiae.Chlamydiales.Bacteria.Chlamydiae.Chlamydiae.Chlamydiales.unclassi

fied_family.unclassified_genus;

s__Bacteria.Chlamydiae.Chlamydiae.Chlamydiales.Bacteria.Chlamydiae.Chlamydiae.Chlamydiales.unclassif

ied_family.Bacteria.Chlamydiae.Chlamydiae.Chlamydiales.Bacteria.Chlamydiae.Chlamydiae.Chlamydiales.

unclassified_family.unclassified_genus.unclassified_species

D5. Modifying the Sequence File. All sequences in the representative sequence file are modified

such that they are in the format:

domain;phylum;class;order;family;genus;species

An example sequence and associated lineage is provided below:

>X73443 Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Clostridium

nnnnnnngagatttgatcctggctcaggatgaacgctggccggccgtgcttacacatgcagtcgaacgaagcgcttaaactggatttcttcggattgaagttttt

gctgactgagtggcggacgggtgagtaacgcgtgggtaacctgcctcatacagggggataacagttagaaatgactgctaataccnnataagcgcacagtg

ctgcatggcacagtgtaaaaactccggtggtatgagatggacccgcgtctgattagctagttggtggggt

D6. The Taxonomy Tree File

create_raw_training_files.py modifies the taxonomy, and then generates the hierarchical

taxonomy tree file from the revised taxonomy. The format for the taxonomy tree is depicted

below:

taxid*taxon name*parent taxid*depth*rank

taxid, the parent taxid, and depth should be in integer format. depth indicates the depth from the root taxon. An

example tree is given below:

1*Bacteria*0*0*domain

765*Firmicutes*1*1*phylum

766*Clostridia*765*2*class

767*Clostridiales*766*3*order

768*Clostridiaceae*767*4*family

769*Clostridium*768*5*genus

160*Proteobacteria*1*1*phylum

433*Gammaproteobacteria*160*2*class

586*Vibrionales*433*3*order

587*Vibrionaceae*586*4*family

588*Vibrio*587*5*genus

592*Photobacterium*587*5*genus

552*Pseudomonadales*433*3*order

553*Pseudomonadaceae*552*4*family

554*Pseudomonas*553*5*genus

604*Enterobacteriales*433*3*order

605*Enterobacteriaceae*604*4*family

617*Enterobacter*605*5*genus

161*Alphaproteobacteria*160*2*class

260*Rhizobiales*161*3*order

261*Rhizobiaceae*260*4*family

262*Rhizobium*261*5*genus

D7. Re-training the RDP Classifier. To re-train the classifier, it is necessary to create parsed

training files from the raw training data. Assuming the two raw files are created in mydir/mydata:

mytaxon.txt and mytrainseq.fasta, the user will need to run the command to create parsed training

files:

mkdir /PATH/mydata/mydata_trained

java -Xmx1g

-cp /PATH/rdp_classifier-version.jar

edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker

/PATH/mydata/mytaxon.txt

/PATH/mydata/mytrainseq.fasta

1

version1

test

/PATH/mydata/mydata_trained


* <mydata/mytaxon.txt> Contains the hierarchical taxonomy information

* <mydata/mytrainseq.fasta> Contains the raw training sequences

* <1> The trainset_no to mark the training files generated

* <version1> Holds the modification information of the taxonomy

* <mydata_trained> Specifies the output directory

Four parsed training files will be created and saved into directory mydata_trained:

* bergeyTrainingTree.xml

* genus_wordConditionalProbList.txt

* logWordPrior.txt

* wordConditionalProbIndexArr.txt

After this is accomplished, the rRNAClassifier.properties file (found with the RDP source code)

needs to be copied into the directory containing the files mentioned above. Effectively, these files

will serve as the model from which classifications will be generated.

D8. Generating Classifications. To classify sequences, the user can either choose to execute the

RDP Classifier source code by itself, or to use the species_classification_generator.py script. To

execute the RDP Source Code, the user will need to execute the following command:

java -Xmx1g -jar /PATH/rdp_classifier-version.jar

-t /PATH/mydata/rRNAClassifier.properties

-q /PATH/sampledata/testQuerySeq.fasta

-o /PATH/testquery.out


* -jar The RDP jar file to use

* -t The rRNAClassifier.properties file

* -q Query sequence file

* -o Output file

NOTE: If the -t option is not used, the classifier will use the standard training set, and species-level classifications will

not be generated. The species_classification_generator.py script does more than just produce the classifications. It

parses headers of the .fna files produced by our current version of the pyro sequencing pipeline, runs the RDP

Classifier on the sequences contained in the .fna files, parses the output, upload the classification results into a

database, and generates the .xls files. To run this script the user will need to execute the following command:

python species_classification_generator.py

-d seqs/

-r 090712_omoss_test

-s localhost

-m PYRO_SEQ_CLASSIFICATIONS

-u root

-p password

-j /PATH/rdp_classifier-2.4.jar

-c /PATH/ClassificationReporter.jar

-t /PATH/rRNAClassifier.properties


* -d Data directory containing the sequences to be classified

* -r The name of the run that will be used to identify the database tables

containing the results of the classification

* -s MySQL hostname

* -m MySQL database to hold classifications

* -u MySQL username

* -p User password

* -j Path to the RDP Classifier jar file

* -c Path to the ClassificationReporter.jar file

* -t Path to the rRNAClassifier.properties file

D10. Testing the Re-trained Classifier. To assess the accuracy of the re-trained RDP classifier,

multiple controlled tests were performed using 16S rRNA gene sequences from the training set

and from Genbank.

D10a. Initial Testing. All sequences from Staphylococcus epidermidis and Staphylococcus aureus

from the training set were compiled into two groups. Each set of sequences was classified using

the re-trained classifier. Initial statistical analyses indicated that 23% of the S. aureus, and 50% of

the S. epidermidis sequences were assigned incorrectly at the species level using full set.

Species Correctly assigned Sequences Input Sequences

S. aureus 281 (76.8%) 366

S. epidermidis 116 (49.8%) 233

D10b. Optimizing the Staphylococcus training set. To assess whether the misclassifications could

be attributed to erroneous designations in the training taxonomy, the RDP training set was

checked by first clustering the sequences at 97% threshold using UCLUST and the highest quality

sequence from each cluster was checked against the Genbank 16S rRNA sequence database

(Bacteria and Archaea) by BLAST. The top hit was identified and compared to the original RDP

assignment. Sequences with non-matching taxonomic assignments were removed from the

training set and the classifier was re-trained. Further testing revealed that this significantly

improved the classification; now, 96.1% of S aureus and 89.5% of S. epidermidis sequences are

accurately classified to the species-level.

Species Correctly assigned Sequences Input Sequences

S. aureus 274 (96.1%) 285

S. epidermidis 86 (89.5%) 96

E. Additional analysis of Staphylococcus sequences. For sequences that were assigned to

Staphylococcus but had species assignment at <0.80 confidence level, we dereplicated the

sequences at 97% similarity threshold using UCHIME, then manually extracted representative

sequences of each cluster from the dereplication, verified if they were S. aureus using BLAST.

This showed that sequences assigned to S. aureus < 0.80 and S. auricularis < 0.80 were S. aureus,

which we included as S. aureus sequences in subsequent analysis.

NASAL MICROBIOME ANALYSES

DEFINITIONS AND METRICS

Nasal community state type (i.e., nasal CST): The major nasal microbiota profiles, as identified

by hierarchal clustering.

Nasal bacterial density: The amount of nasal bacteria present in an individual’s nasal cavity,

which in this study was estimated based on the total number of 16S rRNA gene copies detected

per swab.

Prevalence: The proportion of study population found to have a variable of interest, such as a

particular nasal CST or nasal bacterial taxon.

Proportional abundance: Proportion of an individual’s nasal microbiota comprised a specific

nasal bacterial taxon. Using the taxonomically-classified sequence data, we calculated the

proportional abundance for each nasal bacterial taxon as: (Number of sequences assigned to the

taxon from the sample)/(Total number of sequences from the sample).

Absolute abundance: The counts of a specific nasal bacterial taxon comprising an individual’s

nasal microbiota. We combined proportional abundance with nasal bacterial density to calculate

taxon absolute abundance as: (Proportional abundance of the taxon from the sample) x (nasal

bacterial density of the sample).

Nasal microbiota composition: An individual’s nasal microbiota characterized by the nasal

bacterial taxa present, reported in either proportional abundance or absolute abundance.

Presence/absence of S. aureus by sequencing: Detection of >= 2 sequences assigned to S. aureus

is categorized as presence of S. aureus by sequencing, whereas singletons or no S. aureus

sequences are categorized as absence. Detection of S. aureus by sequencing is affected by high

total nasal bacterial density. S. aureus sequences may also be assigned incorrectly by our custom

RDP species-level classifier if the S. aureus sequence type is missing or underrepresented in our

training set.

Staphylococcus aureus absolute abundance: Absolute abundance of S. aureus is calculated as the

product of nasal bacterial density and proportional abundance of S. aureus. The assessment of S.

aureus by sequencing is affected by high total nasal bacterial density. S. aureus sequences may

also be assigned incorrectly by our custom RDP species-level classifier if the S. aureus sequence

type is missing or underrepresented in our training set.

ECOLOGICAL ANALYSES

1. Characterization of nasal bacterial density. We reported the range, median, and inter-quartile

range of participants’ nasal bacterial density, calculated using R (version 3.0.1) (19). Boxplots of

nasal bacterial density for each nasal CST was also generated using R.

2. Assignment of nasal community state types. To identify community state types (CSTs), we

used proportional abundance data (Euclidean distance) in hierarchal clustering by Ward linkage

using cutree through an iterative process as previously described (31). Comparisons of the 6-, 7-,

and 8-CST solutions revealed that seven-CST solution to be the most parsimonious and effective.

Heatmap visualization was then generated using nasal microbiota composition (in proportional

abundance) from each participant, grouped by nasal CST assignment (Figure 1B).

3. Identification of indicator taxa for nasal community state types. We identified the nasal

bacterial taxa uniquely associated with each nasal CST using indicator analysis from the labdsv

package (R package version 1.6-1) (22). The indicator species analysis is an objective assessment

of a particular taxon’s representation of an environment or a study group. A taxon’s indicator

value (IV) for a study group is determined based on its proportional abundance and prevalence in

the given study group. The IV ranges from 0 to 1, with 0 as no indication to 1 as perfect

indication. To test the null hypothesis of no difference between our observation and what can be

observed by chance, IV null distributions were built by Monte Carlo procedure using 1,000

resampled datasets with randomized study group assignments. The P-value for each observed IV

was determined based on its location within the null distribution and adjusted for false-discovery.

A significance level of P = 0.10 was used and results are shown in Table S1.

4. Association between host genetics and nasal microbiota composition. We assessed the

correlation between nasal microbiota composition and host genetics based on nasal CST

concordance in twin pairs and difference in pairwise ecological distance between twin types. We

calculated the nasal CST concordance for monozygotic and dizygotic twin pairs, where a twin

pair having identical nasal CST assignments marks concordance. We computed the pairwise

ecological distance among all study participants based on the nasal microbiota composition

(proportional abundance) in three distance metrics: Jaccard’s, Bray-Curtis, and Euclidean. Using

a bootstrap-based approach, we calculated the difference in pairwise distance in three

experiments of 1,000 iterations: a) PairwiseDistMZ (Male or Female)– PairwiseDistDZ (Any Sex), b)

PairwiseDistMZ (Male or Female)– PairwiseDistDZ (Same Sex), and c) PairwiseDistMZ (Male or Female)–

PairwiseDistRandom pair (Same Sex). The correlation in twin pairs would be considered statistically

significant if the bootstrapped 95% confidence interval of the difference in pairwise distance does

not cross zero. Results are shown in Table S3.

5. Visualization of nasal microbiota composition by nasal CST and for each twin type. We also

visualized the overall nasal microbiota composition by nasal CST (Figure 1C) and for each twin

type (Figure S1A-C) using proportional abundance data in Euclidean distance by non-metric

multidimensional scaling (nMDS), which is a non-parametric ordination technique to reduce a

highly multidimensional community composition data into a two-dimensional ordination plot.

The nMDS ordination and visualization were generated using the vegan package (R package

version 2.1-10) (21).

6. Association between host genetics and of nasal bacterial density. We assessed the correlation

between nasal bacterial density and host genetics using intra-class correlation coefficient (ICC) in

R. We determined correlation of nasal bacterial density (log10) for monozygotic twin pairs and

dizygotic twin pairs based on sex- and age-adjusted ICC. The resultant ICC represents the

fraction of total variance that is due to variation between groups, calculated using the pooled

mean and standard deviation; consequently, the larger the ICC, the smaller the within-twin

variation, and vice versa. The correlation in twin pairs was statistically significant if the 95%

confidence interval of the ICC does not cross zero.

7. Heritability of nasal bacterial density. A standard biometrical heritability analysis was

performed for nasal bacterial density in log10 to estimate the relative contribution of genetic and

environmental factors. The twin study leverages the fact that monozygotic (MZ) twins share all

their genes, whereas dizygotic (DZ) twins share approximately 50% of their genes as other types

of siblings. As such, biometrical heritability analysis separates total phenotype variance (V) into

four variance compartments: V = A + D + C + E, where A refers to additive genetic effects, D

refers to genetics effects due to dominance, C refers to shared environmental effects, and E refers

to non-shared environmental effects (32).

We could not simultaneously estimate the effects of D and C because they are confounded.

Therefore, we fitted separate ACE and ADE models. We also fitted sub-models AE, DE, CE, and

E, as the simpler models may sufficiently explain the data. We chose the non-nested model with

the lowest Akike’s Information Criteria (AIC), and we selected the most parsimonious nested

model with χ2 likelihood ratio p > 0.05. All analyses were performed using the R package mets:

Analysis of Multivariate Event Times (version 0.2.6) (33).

After testing the assumptions of equal regression, intercept, and residual variance for twin 1 and

twin 2 as well as for MZ and DZ twins, we found that the ADE model had the lowest AIC and

that it could be further reduced to a AE model because of its lower log likelihood ratio; however,

the AE model could not be reduced to an E model (Table S2). Taken together, our results showed

both heritability and non-shared environmental influences on nasal bacterial density, with a

smaller heritability effect (29.8%, 95%CI: 6%-54%) and a larger non-shared environmental effect

(70.1%, 95%CI: 46%-94%).

Of note, the final and intermediate models, as shown in Table S4 included adjustment for sex and

age. While age was not associated with nasal bacterial density and its inclusion had no significant

impact on modeling outcome, sex emerged as a significant factor as reported in the main text.

8. Association between nasal bacterial density and host factors including sex. The median and

quantile of nasal bacterial density by sex, history of atopic disease and psoriasis, and by current

smoking status were calculated using R. We also plotted the nasal bacterial density (log10) as

scattered plots with median (Figure 3A). Difference in nasal bacterial density based on each host

factor was compared using two non-parametric tests: the Wilcoxon-ranked sum and Kolmogorov-

Smirnov test, with a significance level of α = 0.05, with results as shown in Table S5.

To determine if the significant sex difference in nasal bacterial density could be explained by

CST prevalence, the nasal CST prevalence for men versus women is as shown in Table S2, which

we compared by χ2 test. We further assessed if men and women have significantly different nasal

bacterial density, irrespective of nasal CST using quasi-Poisson model comparing the outcome of

nasal bacterial density, stratified by the seven nasal CSTs. Women with CST3 was used as the

reference and the results are shown in Table S6, which showed that men and women had

significantly different nasal bacterial density even after adjusting for nasal CST.

9. Association of nasal bacterial density with microbiota composition: We compared the nasal

bacterial density across CSTs by analysis of variance (ANOVA) in R and reported the median

nasal bacterial density and interquartile range for each nasal CST in Table S2. The significant

difference in sex-adjusted nasal bacterial density across nasal CSTs can also be seen in Table S6.

10. Decision tree analysis: Using decision tree analysis with recursive partitioning and splitting

by information criteria using the rpart package (24), a derivation model was built used a

simulated population of 100 randomly-drawn (without replacement) individuals. Using the

derivation set, two outcomes were determined: S. aureus presence/absence and S. aureus absolute

abundance (log10) in five categories (Category 1-5). Among the nasal bacterial taxa detected,

those significantly associated with S. aureus nasal colonization were incorporated in the

derivation model, which included taxa with conflicting associations in earlier studies (6-10). The

derivation decision tree model incorporated the absolute abundances (log10) of the following nasal

taxa: Anaerococcus, Finegoldia, Peptoniphilus, Dolosigranulum, Corynebacterium, Unclassified

Corynebacteriaceae, Propionibacterium acnes, Propionibacterium granulosum, Simonsiella <

0.80, Staphylococcus epidermidis (including <0.80), and Moraxella.

A model was derived for each outcome of interest and the branches were trimmed down to

include only those with 10 or more individuals in each terminal node (except for

Simonsiella<0.80, which was an early 2nd node). A predicted outcome was assigned to each

terminal node (i.e., as predicting either S. aureus presence or absence or as predicting a S. aureus

absolute abundance category) (Figures 3A and S3A)

Validation test for the predicative thresholds was conducted using 10 additional simulated

populations of 100 randomly-drawn (without replacement) individuals. The validation results

were determined to support the initial model if the predicative thresholds produced results that are

more similar to the predicted outcome than the underlying simulated population (Figures 3B and

S3B-C).

11. Correlation between S. aureus absolute abundance and culture outcome. We calculated S.

aureus absolute abundance among non-CST1 individuals with S. aureus detectable by DNA

sequencing. We divided these individuals into four categories based on ten-fold differences in S.

aureus absolute abundance (<104, 104-<105, 105-<106, 106-107). We plotted the histograms of

each S. aureus absolute abundance category in men and women (Figure 3B), which showed that

women most often fell into to two lowest absolute abundance categories (<104 and 104-<105)

while men were more likely to have the middle two categories (104-<105 and 105-<106). The

correlation between S. aureus nasal culture and S. aureus absolute abundance category was

shown in Figure 3C. The relationship between S. aureus nasal culture (outcome) to other

variables including S. aureus absolute abundance category, sex, history of atopic disease and

psoriasis, and current smoking status was assessed using a multivariate linear regression model,

which showed that sex was not a significant predictor of S. aureus culture outcome (P = 0.79),

after adjusting for S. aureus absolute abundance category (P < 0.001), where the model indicated

that with each ten-fold increase in S. aureus absolute abundance, the probability of having a

positive S. aureus culture increases by 30.4% (F-statistic 7.19 on 68 degrees of freedom, Model P

< 0.001).

SUPPLEMENTARY FIGURE LEGENDS

Figure S1A-C. Correlation of nasal microbiota composition among monozygotic and among

same-sex and opposite-sex dizygotic twin pairs in non-metric multidimensional scaling (nMDS)

ordination plots. In nMDS plots, each data point represents an individual’s microbiota at one time

point. Each twin pair is connected by a solid line, which showed that the nasal microbiota in

monozygotic twin pairs (Fig. S1A) had low CST concordance, as same-sex (Fig. S1B) and

opposite-sex dizygotic twins (Fig. S1C).

Figure S2A-B. Rates of S. aureus nasal colonization by sequencing and by culture and S. aureus

absolute abundance for the seven nasal CSTs. Rate of S. aureus nasal colonization varied across

nasal CSTs as detected based on sequencing and by culture. In general, sequencing detection

revealed higher S. aureus prevalence than culturing, except in CST2, where sequencing had lower

sensitivity, most likely due to insufficient reads in the context of high total bacterial density (Fig.

S2A). As shown by boxplots of S. aureus absolute abundance, CST1 and CST6 had the highest S.

aureus absolute abundance, whereas CST5 had the lowest S. aureus absolute abundance.

Figure S3A-B. Results from decision tree model derivation and validation showing threshold-

dependent relationships between the absolute abundances of nasal commensals and S. aureus.

Absolute abundance of S. aureus can be divided into five categories, ranging from Category 1

(i.e., not detected) to Category 5 (i.e., 106-107 S. aureus 16S rRNA gene absolute abundance).

The S. aureus absolute abundance categories for the derivation group of 100 are as shown (Fig.

S3B), which was used to build a model to predict S. aureus absolute abundance (Fig. S3A). The

model showed that the most informative split was a threshold of 1.2 x 106 Dolosigranulum 16S

rRNA gene copies per swab, which predicted Category 1 (n = 21/25, 84.0%) (Node 1 Left).

Corynebacterium had a similarly negative relationship to S. aureus absolute abundance, where

among individuals who had below-threshold abundance of Dolosigranulum and P. acnes, having

≥ 3.5 x 105 Corynebacterium predicts low S. aureus absolute abundance, i.e., Category 2 (Node 3

Left), whereas having < 3.5 x 105 Corynebacterium predicts high S. aureus absolute abundance,

i.e., Category 5 (Node 3 Right). In contrast, absolute abundance of S. epidermidis and S. aureus

were positively correlated among individuals with low Dolosigranulum and high P. acnes.

Validation testing using 10 randomly drawn groups of 100 supported that threshold-based

relationships between Dolosigranulum, P. acnes, S. epidermidis, Corynebacterium and S. aureus

absolute abundance (Fig. 3C).

SUPPLEMENTARY TABLES

Table S1. The indicator genera for each nasal CST identified based on proportional abundance,

where CI<0.80 denotes bacteria taxa assigned with <80% bootstrap confidence level.

CST Indicator Taxa

Average

Proportional

Abundance

(SD)

Indicator

Value

Unadjusted

p-value

FDR-

adjusted*

p-value

1

Staphylococcus aureus 0.38 (0.13) 0.82 1.00E-04 3.57E-04

Staphylococcus aureus OTU1 0.31 (0.10) 0.80 1.00E-04 3.57E-04

Staphylococcus aureus OTU2 0.06 (0.04) 0.84 1.00E-04 3.57E-04

Staphylococcus lugdunensis CI<0.8 0.01 (0.01) 0.42 3.00E-04 8.82E-04

2

Escherichia unclassified 0.29 (0.44) 0.46 3.00E-04 8.82E-04

Enterobacteriaceae unclassified CI<0.8 0.15 (0.27) 0.37 7.00E-04 1.84E-03

Klebsiella CI<0.8 0.04 (0.08) 0.43 1.80E-03 4.29E-03

Proteus vulgaris CI<0.8 0.03 (0.07) 0.23 2.70E-03 5.87E-03

Proteus vulgaris 0.09 (0.22) 0.19 5.90E-03 1.23E-02

Raoultella CI<0.8 0.04 (0.10) 0.17 2.65E-02 4.88E-02

Erwinia CI<0.8 0.08 (0.16) 0.30 3.55E-02 5.92E-02

Averyella CI<0.8 0.06 (0.16) 0.12 4.24E-02 6.84E-02

3

Staphylococcus epidermidis 0.26 (0.11) 0.50 1.00E-04 3.57E-04

Staphylococcus capitis CI<0.8 0.02 (0.01) 0.43 1.00E-04 3.57E-04

Staphylococcus caprae CI<0.8 0.01 (0.01) 0.41 1.00E-04 3.57E-04

Staphylococcus epidermidis CI<0.8 0.14 (0.06) 0.49 1.00E-04 3.57E-04

Staphylococcus pettenkoferi CI<0.8 0.005 (0.004) 0.41 1.00E-04 3.57E-04

Staphylococcus warneri CI<0.8 0.03 (0.02) 0.44 1.00E-04 3.57E-04

Staphylococcus hominis CI<0.8 0.02 (0.02) 0.44 3.00E-04 8.82E-04

Staphylococcus pasteuri CI<0.8 0.01 (0.01) 0.35 5.00E-04 1.39E-03

Anaerococcus unclassified 0.01 (0.01) 0.25 1.76E-02 3.52E-02

Stenotrophomonas unclassified 0.05 (0.07) 0.24 2.04E-02 3.92E-02

Staphylococcus haemolyticus CI<0.8 0.01 (0.01) 0.28 2.73E-02 4.88E-02

4 Propionibacterium acnes 0.12 (0.15) 0.35 2.70E-03 5.87E-03

Corynebacteriaceae unclassified 0.01 (0.02) 0.25 3.48E-02 5.92E-02

5

Corynebacterium tuberculostearicum 0.01 (0.01) 0.52 1.00E-04 3.57E-04

Corynebacterium unclassified 0.54 (0.12) 0.51 1.00E-04 3.57E-04

Corynebacterium unclassified CI<0.8 0.10 (0.04) 0.48 1.00E-04 3.57E-04

Corynebacterium tuberculostearicum CI<0.8 0.02 (0.01) 0.32 1.30E-03 3.25E-03

6 Moraxella unclassified 0.55 (0.10) 0.81 1.00E-04 3.57E-04

7 Dolosigranulum unclassified 0.41 (0.20) 0.56 1.00E-04 3.57E-04

*FDR-adjusted: adjusted by false-discovery rate

Table S2. Nasal bacterial density (median and interquartile range) of each nasal CST and the

prevalence of each CST by sex.

Nasal bacterial density Sex*

Median Q1-Q3

Female

(n = 102)

Male

(n = 76)

Total

(n = 178)

16S rRNA gene copies per swab Number (%)

CST1 5.33E+06 4.01E+06-9.39E+06 16 (15.7) 6 (7.9) 22

CST2 4.10E+07 2.03E+07-3.88E+08 10 (9.8) 6 (7.9) 16

CST3 2.06E+06 1.49E+06-5.28E+06 25 (24.5) 15 (19.7) 40

CST4 2.22E+06 1.10E+06-1.81E+08 27 (26.4) 24 (31.5) 51

CST5 4.81E+06 2.00E+06-1.46E+07 9 (8.8) 11 (14.5) 20

CST6 1.37E+07 2.46E+06-1.78E+07 4 (3.9) 6 (7.9) 10

CST7 5.15E+06 2.13E+06-1.24E+07 11 (10.8) 8 (10.5) 19

*Comparison of nasal CST distribution in men versus women resulted in χ2 = 7.8, df = 6, p = 0.25

Table S3. Comparison of within-twin nasal microbiota composition correlation between

monozygotic and dizygotic twins based on pairwise ecological distance (Jaccard’s, Bray-Curtis,

and Euclidean distances). A 2.5%-97.5% confidence level (i.e., 95% confidence interval) that

does not intersect zero indicates a significant correlation in nasal microbiota composition in twin

pairs.

MZ versus DZ MZ versus randomly-selected, same

sex non-twin pairs

Female Male Female Male Overall

Jaccard's

Mean -0.037 0.109 -0.064 -0.041 -0.049

2.5% CL -0.15 -0.047 -0.178 -0.131 -0.122

97.5% CL 0.07 0.251 0.047 0.063 0.022

Bray-Curtis

Mean -0.033 0.109 -0.076 -0.061 -0.063

2.5% CL -0.166 -0.082 -0.216 -0.182 -0.154

97.5% CL 0.112 0.279 0.055 0.057 0.026

Euclidean

Mean -130257 94896 -50014 -26934 -32343

2.5% CL -388536 13490 -221823 -169013 -145271

97.5% CL 76979 235086 95268 118081 78943

Table S4. Results from standard biometrical heritability analysis using a polygenic model to

determine the contribution of additive genetic effects (A), genetics effects due to dominance (D),

shared environmental effects (C), and non-shared environmental effects (E) to nasal bacterial

density.

Models Correlation

within MZ

Correlation

within DZ A (%) D (%) C (%) E (%)

log

likelihood

P-

value* AIC

U** 0.42 -0.06

0 0 0 0 -205.7 423.4 (0.12, 0.65) (-0.35, 0.23)

ACE 0.30¤ 0.15 30

0 0

(-,-)

70 -207.2

426.5

(0.05, 0.51) (0.03, 0.27) (6, 54) (46, 94)

ADE 0.38 0.1 0

(-,-)

38 0

61.9 -206.4 424.8

(0.13, 0.59) (0.04, 0.15) (15,62) (38, 85)

AE§ 0.30¤ 0.15 30

0 0 70

-207.2 0.19 424.5 (0.05, 0.51) (0.03, 0.27) (6, 54) (46, 94)

E 0 0 0 1 -209.3 0.04 426.6

*P-value from comparing forcing correlation of MZ to be twice the correlation of DZ in the polygenetic model. The AE model was compared to ADE model and the E model was compared to the AE model.

** U model is a model with equal regression, intercept, and residual variance for twin 1 and twin 2 as well as for MZ and DZ twins §The AE model was selected as the final model.

Table S5. The associations between nasal bacterial density and host factors, including sex,

assessed by Wilcoxon-Ranked Sum and Kolmogorov-Smirnov tests.

Median

Inter-Quartile Range Wilcoxon

p-value

Kolmogorov-

Smirnov

p-value 25th 75th

Overall

4.07E+06 7.08E+06 n/a n/a

Sex

Women (n = 102) 2.97E+06 1.33E+06 9.11E+06

Men (n = 76) 7.94E+06 2.20E+06 4.30E+07

Difference

p < 0.001 p = 0.005

History of atopic diseases

Yes (n = 54) 4.46E+06 1.90E+06 1.50E+07

No (n = 124) 4.39E+06 1.50E+06 1.80E+07

Difference

p = 0.47 p = 0.35

History of psoriasis

Yes (n=15) 6.41E+06 4.80E+06 2.10E+07

No (n=158) 3.69E+06 1.57E+06 1.63E+07

Difference

p = 0.22 p = 0.12

Current smoker

Yes (n=33) 3.23E+06 1.66E+06 9.59E+06

No (n=145) 4.44E+06 1.59E+06 1.96E+07

Difference p = 0.61 p = 0.53

Table S6. Comparison of nasal bacterial density by sex, adjusted for nasal CST in a quasi-Poisson model.

The resultant model explains 19% of total deviance with 170 degrees of freedom.

Men*** Women***

CST1**

(S. aureus) 1.20E+07 4.68E+06

CST2***

(Enterobacteriaceae) 8.72E+07 3.04E+07

CST3 ***

(S. epidermidis) 4.95E+06

2.03E+06

(Reference)

CST4***

(Propionibacterium) 2.26E+07 8.51E+06

CST5

(Corynebacterium) 9.30E+06 3.68E+06

CST6*

(Moraxella) 1.58E+07 6.09E+06

CST7

(Dolosigranulum) 9.87E+06 3.90E+06

*p <0.10 , **p <0.05 , ***p <0.001

Figure S1A-C. Correlation of nasal microbiota composition among monozygotic and among same-sex

and opposite-sex dizygotic twin pairs in non-metric multidimensional scaling (nMDS) ordination plots.

-0.5 0.0 0.5

-1.0

-0.5

0.0

0.5

1.0

nMD

S a

xis

2

Type 1Type 2Type 3Type 4Type 5Type 6Type 7MZSSOS

-0.5 0.0 0.5

-1.0

-0.5

0.0

0.5

1.0

nMDS axis 1

nMD

S a

xis

2


B

C

-0.5 0.0 0.5

-1.0

-0.5

0.0

0.5

1.0

nMD

S a

xis

2


A

Figure S2A-B. Rates of S. aureus nasal colonization by sequencing and by culture and S. aureus absolute

abundance for the seven nasal CSTs.

CST1(n = 22)

CST2(n = 16)

CST3(n = 40)

CST4(n = 51)

CST5(n = 20)

CST6(n = 10)

CST7(n = 19)

24

68

10

Nasal community state types (CSTs)

S.

au

reu

s 1

6S

rR

NA

gen

e

co

pie

s p

er

sw

ab

(lo

g10

)

1

1

2

1

2

1

2

1

2

1

2

1

2

2

1

21

2

1

2

1

2

1

2

1

2

Culture (%) Bl(+) / Wh(-)

Sequencing (%) Bl(+) / Wh(-)

A

B

Figure S3A-B. Results from decision tree model derivation and validation showing threshold-dependent

relationships between the absolute abundances of nasal commensals and S. aureus.

|

Dolosigranulum

Propionibacterium.acnes

S..epidermidis Corynebacterium spp.

S. aureus (-)

(Category 1)

(n = 21/25)

>=1.2 x 10^6 < 1.2 x 10^6

>=8.1 x 10^4 < 8.1 x 10^4

<1.3 x 10^6 >=1.3 x 10^6 >=3.5 x 10^5 <3.5 x 10^5

Node 2 Right Node 2 Left Node 3 Left

(Category 2)

(n = 4/8)

(Category 5)

(n = 14/28)

(Category 3)

(n = 6/9)

(Category Not-5)

(n = 30)

A category 1: Not detected

category 2 : <10^4

category 3: 10^4-<10^5

category 4: 10^5-<10^6

category 5: 10^6-<10^7

B1

2

3 4

5

1

2345

12

345

1

2

4

51

23

4

Node 3 Right

1

24

5

Node 1 Left

S. aureus absolute abundance

1 2 3 4 5 6 7 8 9 10

02

040

60

80

10

0

1 2 3 4 5 6 7 8 9 10

01

020

30

40

50 category 1

category 2

category 3

category 4

category 5

1 2 3 4 5 6 7 8 9 10

01

020

30

40

50

1 2 3 4 5 6 7 8 9 10

Node 2 Left

01

02

03

04

05

0

1 2 3 4 5 6 7 8 9 10

01

02

03

04

05

0

1 2 3 4 5 6 7 8 9 10

01

02

03

04

05

0

Co

un

t

Trial Trial

Trial Trial Trial

Co

un

t

Co

un

tC

ou

nt

Co

un

tC

ou

nt

category 1

category 2

category 3

category 4

category 5

category 1

category 2

category 3

category 4

category 5

category 1

category 2

category 3

category 4

category 5

category 1

category 2

category 3

category 4

category 5

Node 3 Left Node 3 Right

Trial

Overall Node 1 Left Node 2 Right

C

Documents

Supplementary Materials for...the current gold standards for high-throughput classification of bacteria l6S rRNA gene sequences; however, at this time, it does not provide species