Supporting InformationNasser et al. 10.1073/pnas.1403138111SI Materials and MethodsDetecting and Correcting Errors in the SF370 and MGAS5005 Genomes.Polymorphisms between the Illumina single-end (SE) sequencereads and the deposited reference genome sequences were iden-tified using variant ascertainment algorithm (VAAL) (v 46233)(1). To allow for manual inspection of the polymorphisms, theIllumina SE resequencing reads were mapped to the respectivereference genomes, using Mosaik (v 1.1). The Tablet (v 1.13.07.1)(2) sequence viewer was used to visually inspect ACE/BAMalignment files generated with Mosaik to manually curate thepolymorphisms identified with VAAL. The confirmed errors inthe reference genome sequences were edited and corrected usingthe Artemis (v 15.0) (3) annotator.
Identification of Polymorphisms Between Strains SF370 and MGAS5005.Polymorphisms were identified using VAAL by mapping the Illu-mina SE sequence reads from SF370 to the corrected MGAS5005genome sequence and vice versa. Polymorphisms were also inde-pendently identified by aligning the two corrected genome se-quences to each other, using MUMmer (v 3.0) (4). Polymorphismsidentified with VAAL and MUMmer were combined. SF370Illumina SE sequence reads were mapped to the correctedMGAS5005 genome sequence and vice versa, using Mosaik, andACE/BAM alignment files were generated. Sequence divergencewas too extensive in the 2.6-kb slo-to-metB region to permit ac-curate read mapping; therefore, polymorphisms in the slo-to-metBregion were manually identified from sequences aligned usingClustalW (v 2.1) (5). Polymorphisms identified by both VAAL andMUMmer were judged to be correct. The Tablet sequence viewerwas used to visually inspect the ACE/BAM alignment files tomanually inspect polymorphisms identified by either VAAL orMUMmer, but not both.
Phylogenetic Inference.Genetic relationships were inferred amongstrains based on concatenated SNPs by the method of neighbor-joining, as implemented in SplitsTree (v 4.13.1) (6). ConcatenatedSNP multisequence FASTA files were generated using Prephix(v 3.1.1) and Phrecon (v 4.1) (https://github.com/codinghedgehog/).Trees were viewed and modified using Dendroscope (v 3.2.10)(7). A chronogram was generated using the neighbor-joiningtree and the historic record temporal metadata for all 3,443MGAS5005-like strains using Path-O-Gen (v 1.4; tree.bio.ed.ac.uk/software/pathogen). The chronograms were used to assess theclock-likeliness and estimate the substitution rate and time tomost recent common ancestor, using Path-O-Gen. Pairwise ge-netic distances between strains and between groups of strains
(i.e., SF370-like vs. MGAS5005-like) were determined usingMEGA (v 6.0) (8).
SNP Distribution Assessment.Moving window SNP frequency plotswere generated using the R statistical package.
Assessment of Gene and Mobile Genetic Element Content. IlluminaSE sequence reads for all 3,443 MGAS5005-like strains weremapped to the corrected MGAS5005 genome, using Mosaik.Similarly, sequence reads from all 172 SF370-like strains weremapped to the corrected SF370 genome. Common gene content(i.e., the core genome) and variably present gene content (i.e., thenoncore accessory or dispensable genome) were identified among21 complete GAS genomes of 13 emm serotypes, using thePangenome Ortholog Clustering Tool (PanOCT) (v 1.9) (9).Redundancy in the PanOCT-determined dispensable gene con-tent was assessed using BLAST reciprocal best hit, and re-dundant content was removed. A known GAS gene contentpseudopangenome (i.e., the known GAS pangenome) was con-structed by appending onto the end of the MGAS5005 genomeall unique, dispensable gene content not represented in theMGAS5005 genome. The dispensable gene content was added inorder of serotype, meaning serotype M2 dispensable gene con-tent not present in the sequenced M1 genomes was added beforecontent coming from serotype M3, followed by serotype M4, andso on. Sequence reads from all 3,615 strains were mapped to theMGAS5005-centric GAS known-pangenome sequence usingMosaik. Reads per gene were determined using Cufflinks(v 2.1.1) (10) and normalized for the depth of sequencing by thegene fragments per kilobase per million mapped reads (FPKM)method. FPKM values for the reference MGAS5005 and SF370complete genomes relative to the known GAS pangenome wereused to determine read count ranges that were consistent withgene presence or absence. These ranges were then applied to the3,613 other strains that did not have completely closed genomesequences to make calls regarding gene presence or absence todetermine gene content. The gene content determinations werethen used to make mobile genetic element content determi-nations. Mobile genetic elements for which the majority of thegene content was present and associated with a congruent in-tegrase gene were considered present. Strains having high-quality reads not mapping to the GAS known-pangenome wereassembled de novo using EDENA (v 3.130110) (11), and theresulting contigs were compared with the National Center forBiotechnology Information nonredundant database to assess thenature of the unmapping reads.
1. Nusbaum C, et al. (2009) Sensitive, specific polymorphism discovery in bacteria usingmassively parallel sequencing. Nat Methods 6(1):67–69.
2. Milne I, et al. (2013) Using Tablet for visual exploration of second-generationsequencing data. Brief Bioinform 14(2):193–202.
3. Carver T, Harris SR, BerrimanM, Parkhill J, McQuillan JA (2012) Artemis: An integratedplatform for visualization and analysis of high-throughput sequence-based experimentaldata. Bioinformatics 28(4):464–469.
4. Kurtz S, et al. (2004) Versatile and open software for comparing large genomes.Genome Biol 5(2):R12.
5. Larkin MA, et al. (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23(21):2947–2948.
6. Huson DH, Bryant D (2006) Application of phylogenetic networks in evolutionarystudies. Mol Biol Evol 23(2):254–267.
7. Huson DH, Scornavacca C (2012) Dendroscope 3: An interactive tool for rootedphylogenetic trees and networks. Syst Biol 61(6):1061–1067.
8. Tamura K, Stecher G, Peterson D, Filipski A, Kumar S (2013) MEGA6: MolecularEvolutionary Genetics Analysis version 6.0. Mol Biol Evol 30(12):2725–2729.
9. Fouts DE, Brinkac L, Beck E, Inman J, Sutton G (2012) PanOCT: Automated clusteringof orthologs using conserved gene neighborhood for pan-genomic analysis ofbacterial strains and closely related species. Nucleic Acids Res 40(22):e172.
10. Trapnell C, et al. (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7(3):562–578.
11. Hernandez D, François P, Farinelli L, Osterås M, Schrenzel J (2008) De novo bacterialgenome sequencing: Millions of very short reads assembled on a desktop computer.Genome Res 18(5):802–809.
Nasser et al. www.pnas.org/cgi/content/short/1403138111 1 of 4
9 16 15 6
7914
524
082
5
0
50
100
0
50
100
0
50
100
0
50
100
0
50
100
150
200
250
0
50
100
0
50
100
0
50
100
0
50
100
0
50
100
1969
1971
1973
1975
1977
1979
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2001
2003
2005
2007
2009
2011
2013
Canada, ONn = 346
Denmarkn = 436
East Germanyn = 155
Finlandn = 509
Finlandn = 597
Icelandn = 50
Norwayn = 215
Swedenn = 482
United States, GAn = 340
United States, MNn = 474
Year of Isolation
Num
ber o
f Iso
late
s
157
110
1 120
43
23
2 3 2 2 1 1 1 7 2
18
5944
19 840
2240 38
53 40 286
7 1 4 7 6 2 8 16 12 6 5 6 10 12 11 3 2 8 16 10 1 2
952
246 5 15 28
5473
233 2 5 6 9
2554 49
23 21 23
5 1 2 4 5 1 1 3 1 1 2 6 1 1 1 4 6 2 1 2
19 2148
73
54
1 2 6 123
636
5434 34 21 10 10
7436
104
9 10 11
1 10 23 23 10 16 3 1641 28 1637 27 22 18
49
8015 24 20 824
49 4318 32
58 6022 21
Fig. S1. Temporal and geographic distribution of the strain study set. Graphed for each of the nine geographic regions studied is the number of isolates peryear. The 3,615 strains sequenced temporally span from 1969 through 2013, a period of 45 y, with the exception of a few older historic isolates. Invasiveinfection isolates are shown in red and pharyngitis isolates, from Finland only, are in green. Not graphed are 11 isolates that came from other locations (e.g.,Australia, Czech Republic, England, etc.), were isolated before 1969, or are of unknown provenance (see Dataset S1 for a complete strain study set listing).
Nasser et al. www.pnas.org/cgi/content/short/1403138111 2 of 4
0
200
400
600
800
1000
0 0.5 1.0 1.5 2.0 2.5 2.8 Mb
0 0.5 1.0 1.5 2.0 2.5 2.8 Mb
Cov
erag
e
0
1000
2000
3000
Cov
erag
e
GAS Known-Pangenome
MGAS5005 vs GAS pseudo-pangenome
SF370 vs GAS pseudo-pangenome
5005
.1, speA
2
5005
.3, sdaD
2
370.
1, speC-spd
137
0.2,
speH-speI
370.
3, spd
337
0.4,
?
5005
.2, spd
3
Core and dispensable genetic contentfrom the 1.84 Mb MGAS5005 genome
Dispensable genetic content from20 additional complete GAS genomes
Fig. S2. Assessment of gene content by read mapping to the known GAS pangenome. Gene content was assessed for each strain by mapping the sequencingreads to a GAS pseudopangenome constructed of the MGAS5005 genome plus all other unique dispensable gene content identified in the complete genomesequences of 20 additional GAS strains of 13 serotypes using PanOCT. Most of the gene content was derived from mobile genetic elements not found inMGAS5005. The number of reads mapping to each gene was determined using Cufflinks and normalized using the FPKM method. FPKM values were used tomake a gene presence or absence determination. Mobile genetic elements for which the preponderance of gene content was present were determined to bepresent. Graphed for reference M1 strains MGAS5005 and SF370 is the depth of coverage of sequence reads mapped to the GAS known gene contentpseudopangenome, using Mosaik. Evident in the upper panel for the MGAS5005 sequence data is a lack of reads mapping to the SF370 prophages (withthe exception of 370.3 encoding spd3, which is similar to 5005.2 encoding spd3); similarly, in the lower panel is a lack of SF370 reads mapping to theMGAS5005 prophage.
0
20
30
50
10
40
Pol
ymor
phis
m p
er K
b
Pharyngitis Strainsn = 594
mean1 SD
Invasive Strainsn = 504
mean1 SD
sic
ropB
covRcovS
sic
hasAropB
hasB
Fig. S3. Difference in allele frequencies between invasive infection and pharyngitis strain sets. Graphed for the Finland contemporary invasive infectionisolates and pharyngitis isolates are the polymorphisms, per kilobase (i.e., the frequency of polymorphisms), for every gene of the core genome. Each strain sethas a subset of the genes that have a significantly greater number of polymorphisms than expected for a random distribution. Between the invasive andpharyngitis strains sets there is some commonality in genes with a higher allele frequency, such as for the sic and ropB genes, but also some distinct differences,such as for the covRS and hasAB genes.
Nasser et al. www.pnas.org/cgi/content/short/1403138111 3 of 4
Other Supporting Information Files
Dataset S1 (PDF)
Nasser et al. www.pnas.org/cgi/content/short/1403138111 4 of 4