Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
www.advances.sciencemag.org/cgi/content/full/1/5/e1400216/DC1
Supplementary Materials for
Staphylococcus aureus and the ecology of the nasal microbiome
Cindy M. Liu, Lance B. Price, Bruce A. Hungate, Alison G. Abraham, Lisbeth A. Larsen,
Kaare Christensen, Marc Stegger, Robert Skov, Paal Skytt Andersen
Published 5 June 2015, Sci. Adv. 1, e1400216 (2015)
DOI: 10.1126/sciadv.1400216
This PDF file includes:
Fig. S1. Correlation of nasal microbiota composition among monozygotic and
among same-sex and opposite-sex dizygotic twin pairs in non-metric
multidimensional scaling ordination plots.
Fig. S2. Rates of S. aureus nasal colonization by sequencing and by culture and S.
aureus absolute abundance for the seven nasal CSTs.
Fig. S3. Results from decision tree model derivation and validation showing
threshold-dependent relationships between the absolute abundances of nasal
commensals and S. aureus.
Table S1. The indicator genera for each nasal CST identified on the basis of
proportional abundance, where CI < 0.80 denotes bacteria taxa assigned with
<80% bootstrap confidence level.
Table S2. Nasal bacterial density (median and IQR) of each nasal CST and the
prevalence of each CST by sex.
Table S3. Comparison of within-twin nasal microbiota composition correlation
between monozygotic and dizygotic twins based on pairwise ecological distance
(Jaccard, Bray-Curtis, and Euclidean distances).
Table S4. Results from standard biometrical heritability analysis using a
polygenic model to determine the contribution of additive genetic effects (A),
genetics effects due to dominance (D), shared environmental effects (C), and
nonshared environmental effects (E) to nasal bacterial density.
Table S5. The associations between nasal bacterial density and host factors,
including sex, assessed by Wilcoxon rank sum and Kolmogorov-Smirnov tests.
Table S6. Comparison of nasal bacterial density by sex, adjusted for nasal CST in
a quasi-Poisson model.
References (25–33)
SUPPLEMENTARY MATERIALS
LABORATORY METHODS
A. DNA isolation and purification. All samples from each subject were processed in the same
batch to control for inter-run variation in lysis and purification. The combined chemical and
mechanical lysis was performed as follows: Swab samples were thawed at 4°C and 100μl of swab
eluent from each sample was transferred to pre-labeled PCT MicroTube (Pressure Bioscience,
Inc., South Easton, MA, USA) containing 50μl of RLT lysis buffer (Qiagen, Valencia, CA, USA)
and capped with a 150μl PCT MicroCap (Pressure Bioscience, Inc.). Each capped MicroTube
was loaded onto the MicroTube holder and undergo mechanical lysis on the Barocycler NEP
3229 instrument (Pressure Bioscience, Inc.) using the following pressure cycling conditions at
25°C: increase pressure to 35,000 pounds per square inch (psi) for 15 seconds, then decrease to
14·696 psi for 15 seconds and repeat for 19 more cycles. The lysate was added to 550μl of RLT
lysis buffer and purified using the AllPrep DNA/RNA Mini Kit following manufacturer’s
instructions. DNA elution was performed with 100μl of Buffer EB. The purified DNA was used
in subsequent qPCR and pyrosequencing analysis.
B. 16S rRNA gene-based broad-coverage qPCR. Each 16S qPCR reaction was performed in 10
µl reaction volumes in PRISMTM 384-well Clear Optical Reaction Plates (Applied Biosystems by
Life Technologies, Grand Island, NY, USA) using methods as described previously (17). An in-
run standard curve spanning 102-108 in serial 10-fold dilutions was included in all runs and all
samples were analyzed in triplicate reactions. Raw experimental data, including the cycle
threshold (Ct) and gene copy number values for each reaction were exported from the Sequence
Detection Systems v2·3 software (Applied Biosystems) using a manual Ct threshold of 0·05 and
automatic baseline. The Ct standard deviation for each sample was further processed, where
samples with Ct standard deviation ≥ 0.25 were examined for outliers, defined as a single
replicate with Ct-value that is ≥ 0.25 away from the remaining two replicates, which were then
removed. The processed data was then used to calculate the finalized Ct-value, as well as the 16S
rRNA gene copy number by plotting the Ct-value against linear regression of the in-run standard
curve.
C. Generation of 16S rRNA gene V3V6 amplicons. Amplification of the V3V6 region of the 16S
rRNA genes in each DNA sample was performed in a 96-well format using 50 μl reactions and
thermocycling conditions as previously described (18). In each optimized 50 μl reaction, 10 μl of
DNA was added to 40 μl of PCR reaction mix with a final concentration of 400 nM of each broad
range fusion forward primer (5’- CCATCTCATCCCTGCGTGTCTCCGA-
CTCAGnnnnnnnnCCTACGGGDGGCWGCA-3’) and fusion reverse primer (5’-
CCTATCCCCTGTGTGCCTTGGCAGTCTCAGCTGACGACRRCCRTGCA-3’), with the
underlined portion denoting FLX Lib-L adapter sequence, italicized portion denoting the sample-
specific 8-nt barcode sequence, and bolded portion denoting 16S rRNA gene primer sequence,
10X PCR buffer without MgCl2 (Invitrogen), 2·5 mM MgCl2, 0.5 mM dNTP mix, 0·067 U/μl
Platinum® Taq DNA Polymerase (Invitrogen), and molecular grade H2O using the following
thermocycling condition: 90 seconds at 95°C for initial denaturation and UNG inactivation, 30
seconds at 95°C for denaturation, 30 seconds at 62°C for annealing, 30 seconds at 72°C for
extension, with the annealing temperature decreasing by 0.3°C for each subsequent cycle for 19
cycles, followed by 10 cycles of amplification consisting of 30 seconds at 95°C for denaturation,
30 seconds at 45°C for annealing, 30 seconds at 72°C for extension, and a final extension for 7
minutes at 72°C and cool down to 15°C. PCR products were frozen immediately at -20°C until
further processing. In each fusion PCR experiment, negative and positive extraction controls were
included, as well as PCR controls including a no-template control, a positive bacterial control (E.
coli genomic DNA at 1 pg/μl), and a human DNA control (human genomic DNA at 10 ng/μl).
The resultant fusion PCR product were analyzed using 1% E-Gel® 96 Agarose (Invitrogen) to
confirm PCR amplification and product band size. The barcoded 16S rRNA gene amplicons from
each sample underwent 4 ten-fold dilutions and were quantified using the 16S rRNA gene-based
broad-coverage qPCR described earlier. The resultant barcoded amplicons were pooled in an
equimolar fashion. The pooled barcoded 16S rRNA gene amplicon library underwent emulsion
PCR, bead enrichment and recovery, and pyrosequencing analysis on the Genome Sequencer
FLX instrument (454 Life Sciences, Branford, CT, USA)
BIOINFORMATICS METHODOLOGICAL DETAILS
A. Chimeric sequence removal. We first converted the standard flowgram format (SFF) files into
fasta sequence and quality files using a combination of in-house Perl-based wrappers and the 454
Sequencing System Software 1. Next, we identified chimeric sequences de novo using U-Search’s
cluster utility (U-Search version 5·0·144) and U-Chime at the 99% threshold (25, 26). Only non-
chimeric sequences were included in subsequent analysis.
B. Sequence barcode removal, binning, and quality filtering. We next assigned each
pyrosequence to its original sample and scanned for primer sequence using a QIIME utility (27).
Pyrosequences without valid barcode or primer were excluded. We filtered the demultiplexed
pyrosequences based on: a) length (150bp - 920bp), b) number of degenerate bases (a maximum
of six), c) mean quality score (a lower threshold of 25), and d) homopolymer length (a maximum
consecutive run of six). Lastly, we trimmed each sequence based on quality using a sliding
window of 50bp and a quality score threshold of 25.
C. Taxonomic Classification. The resultant demultiplexed and quality-checked 16S rRNA gene
sequences were classified at each taxonomic level (i.e., phylum, class, order, family, genus) at
≥80% bootstrap confidence level using a web service for the Naïve Bayesian Classifier (RDP
Release 10, Update 28) (28). Sequences classified at <80% bootstrap confidence level are
reported with the assigned taxon and a “CI<0.80” notation. The taxonomic classifications
assigned to the sequences through the RDP Classifier fall into the modern high-order bacterial
proposed by Garrity et. al (29). A total of 327,716 bacterial 16S rRNA gene sequences were
obtained and classified. Classification results for each sample are enumerated to generate an
abundance-based matrix for data analysis. Bacterial taxa that comprised 0.2% of total sequences
were included in subsequent analysis.
D. Species classifier development and validation. The Naïve-Bayesian RDP Classifier is one of
the current gold standards for high-throughput classification of bacteria l6S rRNA gene
sequences; however, at this time, it does not provide species level classification, which limits our
ability to examine Staphylococcus and of other nasal taxa at the species-level, if sufficient
resolution exists. In order to achieve this, we re-built the RDP Classifier with an external
taxonomy and curated sequences,
with the particular goal of
improving Staphylococcus species
resolution because it is of major
ecological importance in the nasal
cavity.
Thus, we developed a pipeline that
will read an external taxonomy,
create a database to maintain the
taxonomic information, build
training files from the database solution, re-train the RDP Classifier, and generate classifications
for a set of query sequences. The general workflow is depicted below:
D1. Training set curation. Fundamental to the entire process mentioned above is the acquisition of
a training set, which can be used as a model for generating taxonomic classifications.
Staphylococcus sequences that are missing or underrepresented in our training set may be
assigned incorrectly or with low confidence level. Greengenes and RDP Staphylococcus
sequences were used as the core of our training set (28, 30) and our curation is ongoing.
Overview of species-level classifier pipeline
D1a. Greengenes. The majority of the training set is comprised of sequences from the Greengenes
taxonomy. While trying to build raw training files (described in the section titled “Creating Raw
Training Files”), it became apparent that polyphyletic groups exist in the current Greengenes
taxonomy. Re-training the RDP Classifier was not possible until these groups were either
resolved, or removed from the taxonomy. Our approach to overcoming this obstacle was to insert
the polyphyletic taxonomy into a database, and then build training files from the database
solution such that a given sequence’s membership in a polyphyletic group is clearly indicated.
A taxon is considered polyphyletic if it has more than one parent. Consider the two lineage
strings below:
182310
Root;k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Alteromonadales;f__Alteromonadaceae;g__Altero
monas;s__Alteromonasmarina
250345
Root;k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Oceanospirillales;f__Alteromonadaceae;g__Marino
bacter;s__
The family Alteromonadaceae is polyphyletic because it has two parents. Our solution generates training files that
indicate this relationship in the manner demonstrated below:
182310
Root;k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__KNOWN_POLYPHYLETIC_GROUP[Alteromona
dales, Oceanospirillales];f__Alteromonadaceae;
D1b. Ribosomal Database Project. A set of training sequences was also acquired from the RDP
project. This set is comprised of solely of type strains assigned to the genus Staphylococcus to
increase the confidence of species-level assignment. The RDP sequence set underwent chimera
check using UCHIME (26).
D2. Database Design. To facilitate efficient maintenance of and access to the taxonomy, a
relational database solution was implemented. In order to optimize efficient querying of the
database, reduce space consumption, and to eliminate redundant entries within the database, a
table for each rank was implemented; the consequence of this is that it is only necessary to insert
a specific taxon once. Synonym tables were added to facilitate querying of the database in a
manner that allowed membership in a polyphyletic group to be reflected in the resulting training
files. The database design is provided in the diagram attached with this document:
To insert sequences into the database, the insert_taxonomy.py is utilized:
python insert_taxonomy.py
-t taxonomy_file.txt
-o parsed_taxonomy_file.txt
-s RDP
-b yes
-a localhost
-m 16S_TAXONOMY_RDP_STAPH
-u root
-p password
-c create_all_tables.sql
-f seqs.fasta
Argument Explanation:
* -t The taxonomy file containing all the lineage strings
* -o The name of the parsed taxonomy file that will be generated
* -s Source of the taxonomy
* -b Flag to indicate whether a new database build is to be used (values of y or
yes will indicate to do so)
* -a Database host
* -m Database name
* -u Username
* -p Password
* -c Optional argument indicating the script for creating the database tables
* -t Optional argument indicating the taxonomy dictionary that will be used as a
schema to build the database from
* -f The fasta file containing the sequences in the taxonomy
D3. Creating Training Files. The build_taxonomy.py script is used to generate training files from
the database. Usage is as follows:
python /PATH/build_taxonomy.py
-t yes
-o /PATH/training_rdp_download_1258seqs.txt
-a localhost
-d 16S_TAXONOMY_RDP_STAPH
-u root
-p password
Argument Explanation:
* -t Optional argument that indicates whether or not a training file is to be
generated. Any value will indicate yes.
* -o Output file
* -s Optional argument that indicates the source of the taxonomy. This is only
used if an original taxonomy file is to be generated.
* -a Hostname
* -d Database name
* -u Username
* -p Password
The RDP Classifier training requires two raw training files as inputs: a taxonomy tree file
containing the hierarchical taxonomy information, and a sequence file with lineage strings
included in the headers. Both of these files are created with the create_raw_training_files.py
script, which performs the following steps: 1. Modify/parse the taxonomy, 2. Modify/parse the
sequence file, 3. Create the raw taxonomy tree file, and 4. Generate the updated sequence file. In
order to run this script, the following command must be executed:
python create_raw_training_files.py
<taxonomy_file>
<sequence_file>
<output_raw_taxonomy_file>
<output_raw_seq_file>
Argument Explanation:
* <taxonomy_file> - The file generated by build_taxonomy.py containing
sequence ids and associated lineage strings
* <sequence_file> - The fasta file generated by build_taxonomy.py containing all
the training sequences
* <output_raw_taxonomy_file> - Output file containing hierarchical taxonomy tree
* <output_raw_seq_file> - Output sequence file with lineage included in the
headers
D4. Modifying the Taxonomy. A hierarchical taxonomy tree file will be generated as the output;
however, for the tree to be valid, certain modifications to the taxonomy must be made. It is a
strict requirement that all sequences in the taxonomy must not only have names for all ranks, but
they must also all be classified down to the same level. Consider the sequence 152262, which has
a lineage of:
k__Bacteria;p__Chlamydiae;c__Chlamydiae;o__Chlamydiales;f__;g__
Our script will parse this lineage string such that the following, valid string is generated:
k__Bacteria;p__Chlamydiae;c__Chlamydiae;o__Chlamydiales;f__Bacteria.Chlamydiae.Chlamydiae.Chlamy
diales.unclassified_family;
g__Bacteria.Chlamydiae.Chlamydiae.Chlamydiales.Bacteria.Chlamydiae.Chlamydiae.Chlamydiales.unclassi
fied_family.unclassified_genus;
s__Bacteria.Chlamydiae.Chlamydiae.Chlamydiales.Bacteria.Chlamydiae.Chlamydiae.Chlamydiales.unclassif
ied_family.Bacteria.Chlamydiae.Chlamydiae.Chlamydiales.Bacteria.Chlamydiae.Chlamydiae.Chlamydiales.
unclassified_family.unclassified_genus.unclassified_species
D5. Modifying the Sequence File. All sequences in the representative sequence file are modified
such that they are in the format:
domain;phylum;class;order;family;genus;species
An example sequence and associated lineage is provided below:
>X73443 Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Clostridium
nnnnnnngagatttgatcctggctcaggatgaacgctggccggccgtgcttacacatgcagtcgaacgaagcgcttaaactggatttcttcggattgaagttttt
gctgactgagtggcggacgggtgagtaacgcgtgggtaacctgcctcatacagggggataacagttagaaatgactgctaataccnnataagcgcacagtg
ctgcatggcacagtgtaaaaactccggtggtatgagatggacccgcgtctgattagctagttggtggggt
D6. The Taxonomy Tree File
create_raw_training_files.py modifies the taxonomy, and then generates the hierarchical
taxonomy tree file from the revised taxonomy. The format for the taxonomy tree is depicted
below:
taxid*taxon name*parent taxid*depth*rank
taxid, the parent taxid, and depth should be in integer format. depth indicates the depth from the root taxon. An
example tree is given below:
1*Bacteria*0*0*domain
765*Firmicutes*1*1*phylum
766*Clostridia*765*2*class
767*Clostridiales*766*3*order
768*Clostridiaceae*767*4*family
769*Clostridium*768*5*genus
160*Proteobacteria*1*1*phylum
433*Gammaproteobacteria*160*2*class
586*Vibrionales*433*3*order
587*Vibrionaceae*586*4*family
588*Vibrio*587*5*genus
592*Photobacterium*587*5*genus
552*Pseudomonadales*433*3*order
553*Pseudomonadaceae*552*4*family
554*Pseudomonas*553*5*genus
604*Enterobacteriales*433*3*order
605*Enterobacteriaceae*604*4*family
617*Enterobacter*605*5*genus
161*Alphaproteobacteria*160*2*class
260*Rhizobiales*161*3*order
261*Rhizobiaceae*260*4*family
262*Rhizobium*261*5*genus
D7. Re-training the RDP Classifier. To re-train the classifier, it is necessary to create parsed
training files from the raw training data. Assuming the two raw files are created in mydir/mydata:
mytaxon.txt and mytrainseq.fasta, the user will need to run the command to create parsed training
files:
mkdir /PATH/mydata/mydata_trained
java -Xmx1g
-cp /PATH/rdp_classifier-version.jar
edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker
/PATH/mydata/mytaxon.txt
/PATH/mydata/mytrainseq.fasta
1
version1
test
/PATH/mydata/mydata_trained
Argument Explanation:
* <mydata/mytaxon.txt> Contains the hierarchical taxonomy information
* <mydata/mytrainseq.fasta> Contains the raw training sequences
* <1> The trainset_no to mark the training files generated
* <version1> Holds the modification information of the taxonomy
* <mydata_trained> Specifies the output directory
Four parsed training files will be created and saved into directory mydata_trained:
* bergeyTrainingTree.xml
* genus_wordConditionalProbList.txt
* logWordPrior.txt
* wordConditionalProbIndexArr.txt
After this is accomplished, the rRNAClassifier.properties file (found with the RDP source code)
needs to be copied into the directory containing the files mentioned above. Effectively, these files
will serve as the model from which classifications will be generated.
D8. Generating Classifications. To classify sequences, the user can either choose to execute the
RDP Classifier source code by itself, or to use the species_classification_generator.py script. To
execute the RDP Source Code, the user will need to execute the following command:
java -Xmx1g -jar /PATH/rdp_classifier-version.jar
-t /PATH/mydata/rRNAClassifier.properties
-q /PATH/sampledata/testQuerySeq.fasta
-o /PATH/testquery.out
Argument Explanation:
* -jar The RDP jar file to use
* -t The rRNAClassifier.properties file
* -q Query sequence file
* -o Output file
NOTE: If the -t option is not used, the classifier will use the standard training set, and species-level classifications will
not be generated. The species_classification_generator.py script does more than just produce the classifications. It
parses headers of the .fna files produced by our current version of the pyro sequencing pipeline, runs the RDP
Classifier on the sequences contained in the .fna files, parses the output, upload the classification results into a
database, and generates the .xls files. To run this script the user will need to execute the following command:
python species_classification_generator.py
-d seqs/
-r 090712_omoss_test
-s localhost
-m PYRO_SEQ_CLASSIFICATIONS
-u root
-p password
-j /PATH/rdp_classifier-2.4.jar
-c /PATH/ClassificationReporter.jar
-t /PATH/rRNAClassifier.properties
Argument Explanation:
* -d Data directory containing the sequences to be classified
* -r The name of the run that will be used to identify the database tables
containing the results of the classification
* -s MySQL hostname
* -m MySQL database to hold classifications
* -u MySQL username
* -p User password
* -j Path to the RDP Classifier jar file
* -c Path to the ClassificationReporter.jar file
* -t Path to the rRNAClassifier.properties file
D10. Testing the Re-trained Classifier. To assess the accuracy of the re-trained RDP classifier,
multiple controlled tests were performed using 16S rRNA gene sequences from the training set
and from Genbank.
D10a. Initial Testing. All sequences from Staphylococcus epidermidis and Staphylococcus aureus
from the training set were compiled into two groups. Each set of sequences was classified using
the re-trained classifier. Initial statistical analyses indicated that 23% of the S. aureus, and 50% of
the S. epidermidis sequences were assigned incorrectly at the species level using full set.
Species Correctly assigned Sequences Input Sequences
S. aureus 281 (76.8%) 366
S. epidermidis 116 (49.8%) 233
D10b. Optimizing the Staphylococcus training set. To assess whether the misclassifications could
be attributed to erroneous designations in the training taxonomy, the RDP training set was
checked by first clustering the sequences at 97% threshold using UCLUST and the highest quality
sequence from each cluster was checked against the Genbank 16S rRNA sequence database
(Bacteria and Archaea) by BLAST. The top hit was identified and compared to the original RDP
assignment. Sequences with non-matching taxonomic assignments were removed from the
training set and the classifier was re-trained. Further testing revealed that this significantly
improved the classification; now, 96.1% of S aureus and 89.5% of S. epidermidis sequences are
accurately classified to the species-level.
Species Correctly assigned Sequences Input Sequences
S. aureus 274 (96.1%) 285
S. epidermidis 86 (89.5%) 96
E. Additional analysis of Staphylococcus sequences. For sequences that were assigned to
Staphylococcus but had species assignment at <0.80 confidence level, we dereplicated the
sequences at 97% similarity threshold using UCHIME, then manually extracted representative
sequences of each cluster from the dereplication, verified if they were S. aureus using BLAST.
This showed that sequences assigned to S. aureus < 0.80 and S. auricularis < 0.80 were S. aureus,
which we included as S. aureus sequences in subsequent analysis.
NASAL MICROBIOME ANALYSES
DEFINITIONS AND METRICS
Nasal community state type (i.e., nasal CST): The major nasal microbiota profiles, as identified
by hierarchal clustering.
Nasal bacterial density: The amount of nasal bacteria present in an individual’s nasal cavity,
which in this study was estimated based on the total number of 16S rRNA gene copies detected
per swab.
Prevalence: The proportion of study population found to have a variable of interest, such as a
particular nasal CST or nasal bacterial taxon.
Proportional abundance: Proportion of an individual’s nasal microbiota comprised a specific
nasal bacterial taxon. Using the taxonomically-classified sequence data, we calculated the
proportional abundance for each nasal bacterial taxon as: (Number of sequences assigned to the
taxon from the sample)/(Total number of sequences from the sample).
Absolute abundance: The counts of a specific nasal bacterial taxon comprising an individual’s
nasal microbiota. We combined proportional abundance with nasal bacterial density to calculate
taxon absolute abundance as: (Proportional abundance of the taxon from the sample) x (nasal
bacterial density of the sample).
Nasal microbiota composition: An individual’s nasal microbiota characterized by the nasal
bacterial taxa present, reported in either proportional abundance or absolute abundance.
Presence/absence of S. aureus by sequencing: Detection of >= 2 sequences assigned to S. aureus
is categorized as presence of S. aureus by sequencing, whereas singletons or no S. aureus
sequences are categorized as absence. Detection of S. aureus by sequencing is affected by high
total nasal bacterial density. S. aureus sequences may also be assigned incorrectly by our custom
RDP species-level classifier if the S. aureus sequence type is missing or underrepresented in our
training set.
Staphylococcus aureus absolute abundance: Absolute abundance of S. aureus is calculated as the
product of nasal bacterial density and proportional abundance of S. aureus. The assessment of S.
aureus by sequencing is affected by high total nasal bacterial density. S. aureus sequences may
also be assigned incorrectly by our custom RDP species-level classifier if the S. aureus sequence
type is missing or underrepresented in our training set.
ECOLOGICAL ANALYSES
1. Characterization of nasal bacterial density. We reported the range, median, and inter-quartile
range of participants’ nasal bacterial density, calculated using R (version 3.0.1) (19). Boxplots of
nasal bacterial density for each nasal CST was also generated using R.
2. Assignment of nasal community state types. To identify community state types (CSTs), we
used proportional abundance data (Euclidean distance) in hierarchal clustering by Ward linkage
using cutree through an iterative process as previously described (31). Comparisons of the 6-, 7-,
and 8-CST solutions revealed that seven-CST solution to be the most parsimonious and effective.
Heatmap visualization was then generated using nasal microbiota composition (in proportional
abundance) from each participant, grouped by nasal CST assignment (Figure 1B).
3. Identification of indicator taxa for nasal community state types. We identified the nasal
bacterial taxa uniquely associated with each nasal CST using indicator analysis from the labdsv
package (R package version 1.6-1) (22). The indicator species analysis is an objective assessment
of a particular taxon’s representation of an environment or a study group. A taxon’s indicator
value (IV) for a study group is determined based on its proportional abundance and prevalence in
the given study group. The IV ranges from 0 to 1, with 0 as no indication to 1 as perfect
indication. To test the null hypothesis of no difference between our observation and what can be
observed by chance, IV null distributions were built by Monte Carlo procedure using 1,000
resampled datasets with randomized study group assignments. The P-value for each observed IV
was determined based on its location within the null distribution and adjusted for false-discovery.
A significance level of P = 0.10 was used and results are shown in Table S1.
4. Association between host genetics and nasal microbiota composition. We assessed the
correlation between nasal microbiota composition and host genetics based on nasal CST
concordance in twin pairs and difference in pairwise ecological distance between twin types. We
calculated the nasal CST concordance for monozygotic and dizygotic twin pairs, where a twin
pair having identical nasal CST assignments marks concordance. We computed the pairwise
ecological distance among all study participants based on the nasal microbiota composition
(proportional abundance) in three distance metrics: Jaccard’s, Bray-Curtis, and Euclidean. Using
a bootstrap-based approach, we calculated the difference in pairwise distance in three
experiments of 1,000 iterations: a) PairwiseDistMZ (Male or Female)– PairwiseDistDZ (Any Sex), b)
PairwiseDistMZ (Male or Female)– PairwiseDistDZ (Same Sex), and c) PairwiseDistMZ (Male or Female)–
PairwiseDistRandom pair (Same Sex). The correlation in twin pairs would be considered statistically
significant if the bootstrapped 95% confidence interval of the difference in pairwise distance does
not cross zero. Results are shown in Table S3.
5. Visualization of nasal microbiota composition by nasal CST and for each twin type. We also
visualized the overall nasal microbiota composition by nasal CST (Figure 1C) and for each twin
type (Figure S1A-C) using proportional abundance data in Euclidean distance by non-metric
multidimensional scaling (nMDS), which is a non-parametric ordination technique to reduce a
highly multidimensional community composition data into a two-dimensional ordination plot.
The nMDS ordination and visualization were generated using the vegan package (R package
version 2.1-10) (21).
6. Association between host genetics and of nasal bacterial density. We assessed the correlation
between nasal bacterial density and host genetics using intra-class correlation coefficient (ICC) in
R. We determined correlation of nasal bacterial density (log10) for monozygotic twin pairs and
dizygotic twin pairs based on sex- and age-adjusted ICC. The resultant ICC represents the
fraction of total variance that is due to variation between groups, calculated using the pooled
mean and standard deviation; consequently, the larger the ICC, the smaller the within-twin
variation, and vice versa. The correlation in twin pairs was statistically significant if the 95%
confidence interval of the ICC does not cross zero.
7. Heritability of nasal bacterial density. A standard biometrical heritability analysis was
performed for nasal bacterial density in log10 to estimate the relative contribution of genetic and
environmental factors. The twin study leverages the fact that monozygotic (MZ) twins share all
their genes, whereas dizygotic (DZ) twins share approximately 50% of their genes as other types
of siblings. As such, biometrical heritability analysis separates total phenotype variance (V) into
four variance compartments: V = A + D + C + E, where A refers to additive genetic effects, D
refers to genetics effects due to dominance, C refers to shared environmental effects, and E refers
to non-shared environmental effects (32).
We could not simultaneously estimate the effects of D and C because they are confounded.
Therefore, we fitted separate ACE and ADE models. We also fitted sub-models AE, DE, CE, and
E, as the simpler models may sufficiently explain the data. We chose the non-nested model with
the lowest Akike’s Information Criteria (AIC), and we selected the most parsimonious nested
model with χ2 likelihood ratio p > 0.05. All analyses were performed using the R package mets:
Analysis of Multivariate Event Times (version 0.2.6) (33).
After testing the assumptions of equal regression, intercept, and residual variance for twin 1 and
twin 2 as well as for MZ and DZ twins, we found that the ADE model had the lowest AIC and
that it could be further reduced to a AE model because of its lower log likelihood ratio; however,
the AE model could not be reduced to an E model (Table S2). Taken together, our results showed
both heritability and non-shared environmental influences on nasal bacterial density, with a
smaller heritability effect (29.8%, 95%CI: 6%-54%) and a larger non-shared environmental effect
(70.1%, 95%CI: 46%-94%).
Of note, the final and intermediate models, as shown in Table S4 included adjustment for sex and
age. While age was not associated with nasal bacterial density and its inclusion had no significant
impact on modeling outcome, sex emerged as a significant factor as reported in the main text.
8. Association between nasal bacterial density and host factors including sex. The median and
quantile of nasal bacterial density by sex, history of atopic disease and psoriasis, and by current
smoking status were calculated using R. We also plotted the nasal bacterial density (log10) as
scattered plots with median (Figure 3A). Difference in nasal bacterial density based on each host
factor was compared using two non-parametric tests: the Wilcoxon-ranked sum and Kolmogorov-
Smirnov test, with a significance level of α = 0.05, with results as shown in Table S5.
To determine if the significant sex difference in nasal bacterial density could be explained by
CST prevalence, the nasal CST prevalence for men versus women is as shown in Table S2, which
we compared by χ2 test. We further assessed if men and women have significantly different nasal
bacterial density, irrespective of nasal CST using quasi-Poisson model comparing the outcome of
nasal bacterial density, stratified by the seven nasal CSTs. Women with CST3 was used as the
reference and the results are shown in Table S6, which showed that men and women had
significantly different nasal bacterial density even after adjusting for nasal CST.
9. Association of nasal bacterial density with microbiota composition: We compared the nasal
bacterial density across CSTs by analysis of variance (ANOVA) in R and reported the median
nasal bacterial density and interquartile range for each nasal CST in Table S2. The significant
difference in sex-adjusted nasal bacterial density across nasal CSTs can also be seen in Table S6.
10. Decision tree analysis: Using decision tree analysis with recursive partitioning and splitting
by information criteria using the rpart package (24), a derivation model was built used a
simulated population of 100 randomly-drawn (without replacement) individuals. Using the
derivation set, two outcomes were determined: S. aureus presence/absence and S. aureus absolute
abundance (log10) in five categories (Category 1-5). Among the nasal bacterial taxa detected,
those significantly associated with S. aureus nasal colonization were incorporated in the
derivation model, which included taxa with conflicting associations in earlier studies (6-10). The
derivation decision tree model incorporated the absolute abundances (log10) of the following nasal
taxa: Anaerococcus, Finegoldia, Peptoniphilus, Dolosigranulum, Corynebacterium, Unclassified
Corynebacteriaceae, Propionibacterium acnes, Propionibacterium granulosum, Simonsiella <
0.80, Staphylococcus epidermidis (including <0.80), and Moraxella.
A model was derived for each outcome of interest and the branches were trimmed down to
include only those with 10 or more individuals in each terminal node (except for
Simonsiella<0.80, which was an early 2nd node). A predicted outcome was assigned to each
terminal node (i.e., as predicting either S. aureus presence or absence or as predicting a S. aureus
absolute abundance category) (Figures 3A and S3A)
Validation test for the predicative thresholds was conducted using 10 additional simulated
populations of 100 randomly-drawn (without replacement) individuals. The validation results
were determined to support the initial model if the predicative thresholds produced results that are
more similar to the predicted outcome than the underlying simulated population (Figures 3B and
S3B-C).
11. Correlation between S. aureus absolute abundance and culture outcome. We calculated S.
aureus absolute abundance among non-CST1 individuals with S. aureus detectable by DNA
sequencing. We divided these individuals into four categories based on ten-fold differences in S.
aureus absolute abundance (<104, 104-<105, 105-<106, 106-107). We plotted the histograms of
each S. aureus absolute abundance category in men and women (Figure 3B), which showed that
women most often fell into to two lowest absolute abundance categories (<104 and 104-<105)
while men were more likely to have the middle two categories (104-<105 and 105-<106). The
correlation between S. aureus nasal culture and S. aureus absolute abundance category was
shown in Figure 3C. The relationship between S. aureus nasal culture (outcome) to other
variables including S. aureus absolute abundance category, sex, history of atopic disease and
psoriasis, and current smoking status was assessed using a multivariate linear regression model,
which showed that sex was not a significant predictor of S. aureus culture outcome (P = 0.79),
after adjusting for S. aureus absolute abundance category (P < 0.001), where the model indicated
that with each ten-fold increase in S. aureus absolute abundance, the probability of having a
positive S. aureus culture increases by 30.4% (F-statistic 7.19 on 68 degrees of freedom, Model P
< 0.001).
SUPPLEMENTARY FIGURE LEGENDS
Figure S1A-C. Correlation of nasal microbiota composition among monozygotic and among
same-sex and opposite-sex dizygotic twin pairs in non-metric multidimensional scaling (nMDS)
ordination plots. In nMDS plots, each data point represents an individual’s microbiota at one time
point. Each twin pair is connected by a solid line, which showed that the nasal microbiota in
monozygotic twin pairs (Fig. S1A) had low CST concordance, as same-sex (Fig. S1B) and
opposite-sex dizygotic twins (Fig. S1C).
Figure S2A-B. Rates of S. aureus nasal colonization by sequencing and by culture and S. aureus
absolute abundance for the seven nasal CSTs. Rate of S. aureus nasal colonization varied across
nasal CSTs as detected based on sequencing and by culture. In general, sequencing detection
revealed higher S. aureus prevalence than culturing, except in CST2, where sequencing had lower
sensitivity, most likely due to insufficient reads in the context of high total bacterial density (Fig.
S2A). As shown by boxplots of S. aureus absolute abundance, CST1 and CST6 had the highest S.
aureus absolute abundance, whereas CST5 had the lowest S. aureus absolute abundance.
Figure S3A-B. Results from decision tree model derivation and validation showing threshold-
dependent relationships between the absolute abundances of nasal commensals and S. aureus.
Absolute abundance of S. aureus can be divided into five categories, ranging from Category 1
(i.e., not detected) to Category 5 (i.e., 106-107 S. aureus 16S rRNA gene absolute abundance).
The S. aureus absolute abundance categories for the derivation group of 100 are as shown (Fig.
S3B), which was used to build a model to predict S. aureus absolute abundance (Fig. S3A). The
model showed that the most informative split was a threshold of 1.2 x 106 Dolosigranulum 16S
rRNA gene copies per swab, which predicted Category 1 (n = 21/25, 84.0%) (Node 1 Left).
Corynebacterium had a similarly negative relationship to S. aureus absolute abundance, where
among individuals who had below-threshold abundance of Dolosigranulum and P. acnes, having
≥ 3.5 x 105 Corynebacterium predicts low S. aureus absolute abundance, i.e., Category 2 (Node 3
Left), whereas having < 3.5 x 105 Corynebacterium predicts high S. aureus absolute abundance,
i.e., Category 5 (Node 3 Right). In contrast, absolute abundance of S. epidermidis and S. aureus
were positively correlated among individuals with low Dolosigranulum and high P. acnes.
Validation testing using 10 randomly drawn groups of 100 supported that threshold-based
relationships between Dolosigranulum, P. acnes, S. epidermidis, Corynebacterium and S. aureus
absolute abundance (Fig. 3C).
SUPPLEMENTARY TABLES
Table S1. The indicator genera for each nasal CST identified based on proportional abundance,
where CI<0.80 denotes bacteria taxa assigned with <80% bootstrap confidence level.
CST Indicator Taxa
Average
Proportional
Abundance
(SD)
Indicator
Value
Unadjusted
p-value
FDR-
adjusted*
p-value
1
Staphylococcus aureus 0.38 (0.13) 0.82 1.00E-04 3.57E-04
Staphylococcus aureus OTU1 0.31 (0.10) 0.80 1.00E-04 3.57E-04
Staphylococcus aureus OTU2 0.06 (0.04) 0.84 1.00E-04 3.57E-04
Staphylococcus lugdunensis CI<0.8 0.01 (0.01) 0.42 3.00E-04 8.82E-04
2
Escherichia unclassified 0.29 (0.44) 0.46 3.00E-04 8.82E-04
Enterobacteriaceae unclassified CI<0.8 0.15 (0.27) 0.37 7.00E-04 1.84E-03
Klebsiella CI<0.8 0.04 (0.08) 0.43 1.80E-03 4.29E-03
Proteus vulgaris CI<0.8 0.03 (0.07) 0.23 2.70E-03 5.87E-03
Proteus vulgaris 0.09 (0.22) 0.19 5.90E-03 1.23E-02
Raoultella CI<0.8 0.04 (0.10) 0.17 2.65E-02 4.88E-02
Erwinia CI<0.8 0.08 (0.16) 0.30 3.55E-02 5.92E-02
Averyella CI<0.8 0.06 (0.16) 0.12 4.24E-02 6.84E-02
3
Staphylococcus epidermidis 0.26 (0.11) 0.50 1.00E-04 3.57E-04
Staphylococcus capitis CI<0.8 0.02 (0.01) 0.43 1.00E-04 3.57E-04
Staphylococcus caprae CI<0.8 0.01 (0.01) 0.41 1.00E-04 3.57E-04
Staphylococcus epidermidis CI<0.8 0.14 (0.06) 0.49 1.00E-04 3.57E-04
Staphylococcus pettenkoferi CI<0.8 0.005 (0.004) 0.41 1.00E-04 3.57E-04
Staphylococcus warneri CI<0.8 0.03 (0.02) 0.44 1.00E-04 3.57E-04
Staphylococcus hominis CI<0.8 0.02 (0.02) 0.44 3.00E-04 8.82E-04
Staphylococcus pasteuri CI<0.8 0.01 (0.01) 0.35 5.00E-04 1.39E-03
Anaerococcus unclassified 0.01 (0.01) 0.25 1.76E-02 3.52E-02
Stenotrophomonas unclassified 0.05 (0.07) 0.24 2.04E-02 3.92E-02
Staphylococcus haemolyticus CI<0.8 0.01 (0.01) 0.28 2.73E-02 4.88E-02
4 Propionibacterium acnes 0.12 (0.15) 0.35 2.70E-03 5.87E-03
Corynebacteriaceae unclassified 0.01 (0.02) 0.25 3.48E-02 5.92E-02
5
Corynebacterium tuberculostearicum 0.01 (0.01) 0.52 1.00E-04 3.57E-04
Corynebacterium unclassified 0.54 (0.12) 0.51 1.00E-04 3.57E-04
Corynebacterium unclassified CI<0.8 0.10 (0.04) 0.48 1.00E-04 3.57E-04
Corynebacterium tuberculostearicum CI<0.8 0.02 (0.01) 0.32 1.30E-03 3.25E-03
6 Moraxella unclassified 0.55 (0.10) 0.81 1.00E-04 3.57E-04
7 Dolosigranulum unclassified 0.41 (0.20) 0.56 1.00E-04 3.57E-04
*FDR-adjusted: adjusted by false-discovery rate
Table S2. Nasal bacterial density (median and interquartile range) of each nasal CST and the
prevalence of each CST by sex.
Nasal bacterial density Sex*
Median Q1-Q3
Female
(n = 102)
Male
(n = 76)
Total
(n = 178)
16S rRNA gene copies per swab Number (%)
CST1 5.33E+06 4.01E+06-9.39E+06 16 (15.7) 6 (7.9) 22
CST2 4.10E+07 2.03E+07-3.88E+08 10 (9.8) 6 (7.9) 16
CST3 2.06E+06 1.49E+06-5.28E+06 25 (24.5) 15 (19.7) 40
CST4 2.22E+06 1.10E+06-1.81E+08 27 (26.4) 24 (31.5) 51
CST5 4.81E+06 2.00E+06-1.46E+07 9 (8.8) 11 (14.5) 20
CST6 1.37E+07 2.46E+06-1.78E+07 4 (3.9) 6 (7.9) 10
CST7 5.15E+06 2.13E+06-1.24E+07 11 (10.8) 8 (10.5) 19
*Comparison of nasal CST distribution in men versus women resulted in χ2 = 7.8, df = 6, p = 0.25
Table S3. Comparison of within-twin nasal microbiota composition correlation between
monozygotic and dizygotic twins based on pairwise ecological distance (Jaccard’s, Bray-Curtis,
and Euclidean distances). A 2.5%-97.5% confidence level (i.e., 95% confidence interval) that
does not intersect zero indicates a significant correlation in nasal microbiota composition in twin
pairs.
MZ versus DZ MZ versus randomly-selected, same
sex non-twin pairs
Female Male Female Male Overall
Jaccard's
Mean -0.037 0.109 -0.064 -0.041 -0.049
2.5% CL -0.15 -0.047 -0.178 -0.131 -0.122
97.5% CL 0.07 0.251 0.047 0.063 0.022
Bray-Curtis
Mean -0.033 0.109 -0.076 -0.061 -0.063
2.5% CL -0.166 -0.082 -0.216 -0.182 -0.154
97.5% CL 0.112 0.279 0.055 0.057 0.026
Euclidean
Mean -130257 94896 -50014 -26934 -32343
2.5% CL -388536 13490 -221823 -169013 -145271
97.5% CL 76979 235086 95268 118081 78943
Table S4. Results from standard biometrical heritability analysis using a polygenic model to
determine the contribution of additive genetic effects (A), genetics effects due to dominance (D),
shared environmental effects (C), and non-shared environmental effects (E) to nasal bacterial
density.
Models Correlation
within MZ
Correlation
within DZ A (%) D (%) C (%) E (%)
log
likelihood
P-
value* AIC
U** 0.42 -0.06
0 0 0 0 -205.7 423.4 (0.12, 0.65) (-0.35, 0.23)
ACE 0.30¤ 0.15 30
0 0
(-,-)
70 -207.2
426.5
(0.05, 0.51) (0.03, 0.27) (6, 54) (46, 94)
ADE 0.38 0.1 0
(-,-)
38 0
61.9 -206.4 424.8
(0.13, 0.59) (0.04, 0.15) (15,62) (38, 85)
AE§ 0.30¤ 0.15 30
0 0 70
-207.2 0.19 424.5 (0.05, 0.51) (0.03, 0.27) (6, 54) (46, 94)
E 0 0 0 1 -209.3 0.04 426.6
*P-value from comparing forcing correlation of MZ to be twice the correlation of DZ in the polygenetic model. The AE model was compared to ADE model and the E model was compared to the AE model.
** U model is a model with equal regression, intercept, and residual variance for twin 1 and twin 2 as well as for MZ and DZ twins §The AE model was selected as the final model.
Table S5. The associations between nasal bacterial density and host factors, including sex,
assessed by Wilcoxon-Ranked Sum and Kolmogorov-Smirnov tests.
Median
Inter-Quartile Range Wilcoxon
p-value
Kolmogorov-
Smirnov
p-value 25th 75th
Overall
4.07E+06 7.08E+06 n/a n/a
Sex
Women (n = 102) 2.97E+06 1.33E+06 9.11E+06
Men (n = 76) 7.94E+06 2.20E+06 4.30E+07
Difference
p < 0.001 p = 0.005
History of atopic diseases
Yes (n = 54) 4.46E+06 1.90E+06 1.50E+07
No (n = 124) 4.39E+06 1.50E+06 1.80E+07
Difference
p = 0.47 p = 0.35
History of psoriasis
Yes (n=15) 6.41E+06 4.80E+06 2.10E+07
No (n=158) 3.69E+06 1.57E+06 1.63E+07
Difference
p = 0.22 p = 0.12
Current smoker
Yes (n=33) 3.23E+06 1.66E+06 9.59E+06
No (n=145) 4.44E+06 1.59E+06 1.96E+07
Difference p = 0.61 p = 0.53
Table S6. Comparison of nasal bacterial density by sex, adjusted for nasal CST in a quasi-Poisson model.
The resultant model explains 19% of total deviance with 170 degrees of freedom.
Men*** Women***
CST1**
(S. aureus) 1.20E+07 4.68E+06
CST2***
(Enterobacteriaceae) 8.72E+07 3.04E+07
CST3 ***
(S. epidermidis) 4.95E+06
2.03E+06
(Reference)
CST4***
(Propionibacterium) 2.26E+07 8.51E+06
CST5
(Corynebacterium) 9.30E+06 3.68E+06
CST6*
(Moraxella) 1.58E+07 6.09E+06
CST7
(Dolosigranulum) 9.87E+06 3.90E+06
*p <0.10 , **p <0.05 , ***p <0.001
Figure S1A-C. Correlation of nasal microbiota composition among monozygotic and among same-sex
and opposite-sex dizygotic twin pairs in non-metric multidimensional scaling (nMDS) ordination plots.
-0.5 0.0 0.5
-1.0
-0.5
0.0
0.5
1.0
nMD
S a
xis
2
Type 1Type 2Type 3Type 4Type 5Type 6Type 7MZSSOS
-0.5 0.0 0.5
-1.0
-0.5
0.0
0.5
1.0
nMDS axis 1
nMD
S a
xis
2
Type 1Type 2Type 3Type 4Type 5Type 6Type 7MZSSOS
B
C
-0.5 0.0 0.5
-1.0
-0.5
0.0
0.5
1.0
nMD
S a
xis
2
Type 1Type 2Type 3Type 4Type 5Type 6Type 7MZSSOS
A
Figure S2A-B. Rates of S. aureus nasal colonization by sequencing and by culture and S. aureus absolute
abundance for the seven nasal CSTs.
CST1(n = 22)
CST2(n = 16)
CST3(n = 40)
CST4(n = 51)
CST5(n = 20)
CST6(n = 10)
CST7(n = 19)
24
68
10
Nasal community state types (CSTs)
S.
au
reu
s 1
6S
rR
NA
gen
e
co
pie
s p
er
sw
ab
(lo
g10
)
1
1
2
1
2
1
2
1
2
1
2
1
2
2
1
21
2
1
2
1
2
1
2
1
2
Culture (%) Bl(+) / Wh(-)
Sequencing (%) Bl(+) / Wh(-)
A
B
Figure S3A-B. Results from decision tree model derivation and validation showing threshold-dependent
relationships between the absolute abundances of nasal commensals and S. aureus.
|
Dolosigranulum
Propionibacterium.acnes
S..epidermidis Corynebacterium spp.
S. aureus (-)
(Category 1)
(n = 21/25)
>=1.2 x 10^6 < 1.2 x 10^6
>=8.1 x 10^4 < 8.1 x 10^4
<1.3 x 10^6 >=1.3 x 10^6 >=3.5 x 10^5 <3.5 x 10^5
Node 2 Right Node 2 Left Node 3 Left
(Category 2)
(n = 4/8)
(Category 5)
(n = 14/28)
(Category 3)
(n = 6/9)
(Category Not-5)
(n = 30)
A category 1: Not detected
category 2 : <10^4
category 3: 10^4-<10^5
category 4: 10^5-<10^6
category 5: 10^6-<10^7
B1
2
3 4
5
1
2345
12
345
1
2
4
51
23
4
Node 3 Right
1
24
5
Node 1 Left
S. aureus absolute abundance
1 2 3 4 5 6 7 8 9 10
02
040
60
80
10
0
1 2 3 4 5 6 7 8 9 10
01
020
30
40
50 category 1
category 2
category 3
category 4
category 5
1 2 3 4 5 6 7 8 9 10
01
020
30
40
50
1 2 3 4 5 6 7 8 9 10
Node 2 Left
01
02
03
04
05
0
1 2 3 4 5 6 7 8 9 10
01
02
03
04
05
0
1 2 3 4 5 6 7 8 9 10
01
02
03
04
05
0
Co
un
t
Trial Trial
Trial Trial Trial
Co
un
t
Co
un
tC
ou
nt
Co
un
tC
ou
nt
category 1
category 2
category 3
category 4
category 5
category 1
category 2
category 3
category 4
category 5
category 1
category 2
category 3
category 4
category 5
category 1
category 2
category 3
category 4
category 5
Node 3 Left Node 3 Right
Trial
Overall Node 1 Left Node 2 Right
C