4
Supporting Information Nasser et al. 10.1073/pnas.1403138111 SI Materials and Methods Detecting and Correcting Errors in the SF370 and MGAS5005 Genomes. Polymorphisms between the Illumina single-end (SE) sequence reads and the deposited reference genome sequences were iden- tified using variant ascertainment algorithm (VAAL) (v 46233) (1). To allow for manual inspection of the polymorphisms, the Illumina SE resequencing reads were mapped to the respective reference genomes, using Mosaik (v 1.1). The Tablet (v 1.13.07.1) (2) sequence viewer was used to visually inspect ACE/BAM alignment files generated with Mosaik to manually curate the polymorphisms identified with VAAL. The confirmed errors in the reference genome sequences were edited and corrected using the Artemis (v 15.0) (3) annotator. Identification of Polymorphisms Between Strains SF370 and MGAS5005. Polymorphisms were identified using VAAL by mapping the Illu- mina SE sequence reads from SF370 to the corrected MGAS5005 genome sequence and vice versa. Polymorphisms were also inde- pendently identified by aligning the two corrected genome se- quences to each other, using MUMmer (v 3.0) (4). Polymorphisms identified with VAAL and MUMmer were combined. SF370 Illumina SE sequence reads were mapped to the corrected MGAS5005 genome sequence and vice versa, using Mosaik, and ACE/BAM alignment files were generated. Sequence divergence was too extensive in the 2.6-kb slo-to-metB region to permit ac- curate read mapping; therefore, polymorphisms in the slo-to-metB region were manually identified from sequences aligned using ClustalW (v 2.1) (5). Polymorphisms identified by both VAAL and MUMmer were judged to be correct. The Tablet sequence viewer was used to visually inspect the ACE/BAM alignment files to manually inspect polymorphisms identified by either VAAL or MUMmer, but not both. Phylogenetic Inference. Genetic relationships were inferred among strains based on concatenated SNPs by the method of neighbor- joining, as implemented in SplitsTree (v 4.13.1) (6). Concatenated SNP multisequence FASTA files were generated using Prephix (v 3.1.1) and Phrecon (v 4.1) (https://github.com/codinghedgehog/). Trees were viewed and modified using Dendroscope (v 3.2.10) (7). A chronogram was generated using the neighbor-joining tree and the historic record temporal metadata for all 3,443 MGAS5005-like strains using Path-O-Gen (v 1.4; tree.bio.ed.ac. uk/software/pathogen). The chronograms were used to assess the clock-likeliness and estimate the substitution rate and time to most recent common ancestor, using Path-O-Gen. Pairwise ge- netic distances between strains and between groups of strains (i.e., SF370-like vs. MGAS5005-like) were determined using MEGA (v 6.0) (8). SNP Distribution Assessment. Moving window SNP frequency plots were generated using the R statistical package. Assessment of Gene and Mobile Genetic Element Content. Illumina SE sequence reads for all 3,443 MGAS5005-like strains were mapped to the corrected MGAS5005 genome, using Mosaik. Similarly, sequence reads from all 172 SF370-like strains were mapped to the corrected SF370 genome. Common gene content (i.e., the core genome) and variably present gene content (i.e., the noncore accessory or dispensable genome) were identified among 21 complete GAS genomes of 13 emm serotypes, using the Pangenome Ortholog Clustering Tool (PanOCT) (v 1.9) (9). Redundancy in the PanOCT-determined dispensable gene con- tent was assessed using BLAST reciprocal best hit, and re- dundant content was removed. A known GAS gene content pseudopangenome (i.e., the known GAS pangenome) was con- structed by appending onto the end of the MGAS5005 genome all unique, dispensable gene content not represented in the MGAS5005 genome. The dispensable gene content was added in order of serotype, meaning serotype M2 dispensable gene con- tent not present in the sequenced M1 genomes was added before content coming from serotype M3, followed by serotype M4, and so on. Sequence reads from all 3,615 strains were mapped to the MGAS5005-centric GAS known-pangenome sequence using Mosaik. Reads per gene were determined using Cufflinks (v 2.1.1) (10) and normalized for the depth of sequencing by the gene fragments per kilobase per million mapped reads (FPKM) method. FPKM values for the reference MGAS5005 and SF370 complete genomes relative to the known GAS pangenome were used to determine read count ranges that were consistent with gene presence or absence. These ranges were then applied to the 3,613 other strains that did not have completely closed genome sequences to make calls regarding gene presence or absence to determine gene content. The gene content determinations were then used to make mobile genetic element content determi- nations. Mobile genetic elements for which the majority of the gene content was present and associated with a congruent in- tegrase gene were considered present. Strains having high- quality reads not mapping to the GAS known-pangenome were assembled de novo using EDENA (v 3.130110) (11), and the resulting contigs were compared with the National Center for Biotechnology Information nonredundant database to assess the nature of the unmapping reads. 1. Nusbaum C, et al. (2009) Sensitive, specific polymorphism discovery in bacteria using massively parallel sequencing. Nat Methods 6(1):6769. 2. Milne I, et al. (2013) Using Tablet for visual exploration of second-generation sequencing data. Brief Bioinform 14(2):193202. 3. Carver T, Harris SR, Berriman M, Parkhill J, McQuillan JA (2012) Artemis: An integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Bioinformatics 28(4):464469. 4. Kurtz S, et al. (2004) Versatile and open software for comparing large genomes. Genome Biol 5(2):R12. 5. Larkin MA, et al. (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23(21): 29472948. 6. Huson DH, Bryant D (2006) Application of phylogenetic networks in evolutionary studies. Mol Biol Evol 23(2):254267. 7. Huson DH, Scornavacca C (2012) Dendroscope 3: An interactive tool for rooted phylogenetic trees and networks. Syst Biol 61(6):10611067. 8. Tamura K, Stecher G, Peterson D, Filipski A, Kumar S (2013) MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Mol Biol Evol 30(12):27252729. 9. Fouts DE, Brinkac L, Beck E, Inman J, Sutton G (2012) PanOCT: Automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic Acids Res 40(22):e172. 10. Trapnell C, et al. (2012) Differential gene and transcript expression analysis of RNA- seq experiments with TopHat and Cufflinks. Nat Protoc 7(3):562578. 11. Hernandez D, François P, Farinelli L, Osterås M, Schrenzel J (2008) De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer. Genome Res 18(5):802809. Nasser et al. www.pnas.org/cgi/content/short/1403138111 1 of 4

Supporting Information - PNAS · reads and the deposited reference genome sequences were iden-tified using variant ascertainment algorithm (VAAL) (v 46233) ... R12. 5. Larkin MA,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Supporting Information - PNAS · reads and the deposited reference genome sequences were iden-tified using variant ascertainment algorithm (VAAL) (v 46233) ... R12. 5. Larkin MA,

Supporting InformationNasser et al. 10.1073/pnas.1403138111SI Materials and MethodsDetecting and Correcting Errors in the SF370 and MGAS5005 Genomes.Polymorphisms between the Illumina single-end (SE) sequencereads and the deposited reference genome sequences were iden-tified using variant ascertainment algorithm (VAAL) (v 46233)(1). To allow for manual inspection of the polymorphisms, theIllumina SE resequencing reads were mapped to the respectivereference genomes, using Mosaik (v 1.1). The Tablet (v 1.13.07.1)(2) sequence viewer was used to visually inspect ACE/BAMalignment files generated with Mosaik to manually curate thepolymorphisms identified with VAAL. The confirmed errors inthe reference genome sequences were edited and corrected usingthe Artemis (v 15.0) (3) annotator.

Identification of Polymorphisms Between Strains SF370 and MGAS5005.Polymorphisms were identified using VAAL by mapping the Illu-mina SE sequence reads from SF370 to the corrected MGAS5005genome sequence and vice versa. Polymorphisms were also inde-pendently identified by aligning the two corrected genome se-quences to each other, using MUMmer (v 3.0) (4). Polymorphismsidentified with VAAL and MUMmer were combined. SF370Illumina SE sequence reads were mapped to the correctedMGAS5005 genome sequence and vice versa, using Mosaik, andACE/BAM alignment files were generated. Sequence divergencewas too extensive in the 2.6-kb slo-to-metB region to permit ac-curate read mapping; therefore, polymorphisms in the slo-to-metBregion were manually identified from sequences aligned usingClustalW (v 2.1) (5). Polymorphisms identified by both VAAL andMUMmer were judged to be correct. The Tablet sequence viewerwas used to visually inspect the ACE/BAM alignment files tomanually inspect polymorphisms identified by either VAAL orMUMmer, but not both.

Phylogenetic Inference.Genetic relationships were inferred amongstrains based on concatenated SNPs by the method of neighbor-joining, as implemented in SplitsTree (v 4.13.1) (6). ConcatenatedSNP multisequence FASTA files were generated using Prephix(v 3.1.1) and Phrecon (v 4.1) (https://github.com/codinghedgehog/).Trees were viewed and modified using Dendroscope (v 3.2.10)(7). A chronogram was generated using the neighbor-joiningtree and the historic record temporal metadata for all 3,443MGAS5005-like strains using Path-O-Gen (v 1.4; tree.bio.ed.ac.uk/software/pathogen). The chronograms were used to assess theclock-likeliness and estimate the substitution rate and time tomost recent common ancestor, using Path-O-Gen. Pairwise ge-netic distances between strains and between groups of strains

(i.e., SF370-like vs. MGAS5005-like) were determined usingMEGA (v 6.0) (8).

SNP Distribution Assessment.Moving window SNP frequency plotswere generated using the R statistical package.

Assessment of Gene and Mobile Genetic Element Content. IlluminaSE sequence reads for all 3,443 MGAS5005-like strains weremapped to the corrected MGAS5005 genome, using Mosaik.Similarly, sequence reads from all 172 SF370-like strains weremapped to the corrected SF370 genome. Common gene content(i.e., the core genome) and variably present gene content (i.e., thenoncore accessory or dispensable genome) were identified among21 complete GAS genomes of 13 emm serotypes, using thePangenome Ortholog Clustering Tool (PanOCT) (v 1.9) (9).Redundancy in the PanOCT-determined dispensable gene con-tent was assessed using BLAST reciprocal best hit, and re-dundant content was removed. A known GAS gene contentpseudopangenome (i.e., the known GAS pangenome) was con-structed by appending onto the end of the MGAS5005 genomeall unique, dispensable gene content not represented in theMGAS5005 genome. The dispensable gene content was added inorder of serotype, meaning serotype M2 dispensable gene con-tent not present in the sequenced M1 genomes was added beforecontent coming from serotype M3, followed by serotype M4, andso on. Sequence reads from all 3,615 strains were mapped to theMGAS5005-centric GAS known-pangenome sequence usingMosaik. Reads per gene were determined using Cufflinks(v 2.1.1) (10) and normalized for the depth of sequencing by thegene fragments per kilobase per million mapped reads (FPKM)method. FPKM values for the reference MGAS5005 and SF370complete genomes relative to the known GAS pangenome wereused to determine read count ranges that were consistent withgene presence or absence. These ranges were then applied to the3,613 other strains that did not have completely closed genomesequences to make calls regarding gene presence or absence todetermine gene content. The gene content determinations werethen used to make mobile genetic element content determi-nations. Mobile genetic elements for which the majority of thegene content was present and associated with a congruent in-tegrase gene were considered present. Strains having high-quality reads not mapping to the GAS known-pangenome wereassembled de novo using EDENA (v 3.130110) (11), and theresulting contigs were compared with the National Center forBiotechnology Information nonredundant database to assess thenature of the unmapping reads.

1. Nusbaum C, et al. (2009) Sensitive, specific polymorphism discovery in bacteria usingmassively parallel sequencing. Nat Methods 6(1):67–69.

2. Milne I, et al. (2013) Using Tablet for visual exploration of second-generationsequencing data. Brief Bioinform 14(2):193–202.

3. Carver T, Harris SR, BerrimanM, Parkhill J, McQuillan JA (2012) Artemis: An integratedplatform for visualization and analysis of high-throughput sequence-based experimentaldata. Bioinformatics 28(4):464–469.

4. Kurtz S, et al. (2004) Versatile and open software for comparing large genomes.Genome Biol 5(2):R12.

5. Larkin MA, et al. (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23(21):2947–2948.

6. Huson DH, Bryant D (2006) Application of phylogenetic networks in evolutionarystudies. Mol Biol Evol 23(2):254–267.

7. Huson DH, Scornavacca C (2012) Dendroscope 3: An interactive tool for rootedphylogenetic trees and networks. Syst Biol 61(6):1061–1067.

8. Tamura K, Stecher G, Peterson D, Filipski A, Kumar S (2013) MEGA6: MolecularEvolutionary Genetics Analysis version 6.0. Mol Biol Evol 30(12):2725–2729.

9. Fouts DE, Brinkac L, Beck E, Inman J, Sutton G (2012) PanOCT: Automated clusteringof orthologs using conserved gene neighborhood for pan-genomic analysis ofbacterial strains and closely related species. Nucleic Acids Res 40(22):e172.

10. Trapnell C, et al. (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7(3):562–578.

11. Hernandez D, François P, Farinelli L, Osterås M, Schrenzel J (2008) De novo bacterialgenome sequencing: Millions of very short reads assembled on a desktop computer.Genome Res 18(5):802–809.

Nasser et al. www.pnas.org/cgi/content/short/1403138111 1 of 4

Page 2: Supporting Information - PNAS · reads and the deposited reference genome sequences were iden-tified using variant ascertainment algorithm (VAAL) (v 46233) ... R12. 5. Larkin MA,

9 16 15 6

7914

524

082

5

0

50

100

0

50

100

0

50

100

0

50

100

0

50

100

150

200

250

0

50

100

0

50

100

0

50

100

0

50

100

0

50

100

1969

1971

1973

1975

1977

1979

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

Canada, ONn = 346

Denmarkn = 436

East Germanyn = 155

Finlandn = 509

Finlandn = 597

Icelandn = 50

Norwayn = 215

Swedenn = 482

United States, GAn = 340

United States, MNn = 474

Year of Isolation

Num

ber o

f Iso

late

s

157

110

1 120

43

23

2 3 2 2 1 1 1 7 2

18

5944

19 840

2240 38

53 40 286

7 1 4 7 6 2 8 16 12 6 5 6 10 12 11 3 2 8 16 10 1 2

952

246 5 15 28

5473

233 2 5 6 9

2554 49

23 21 23

5 1 2 4 5 1 1 3 1 1 2 6 1 1 1 4 6 2 1 2

19 2148

73

54

1 2 6 123

636

5434 34 21 10 10

7436

104

9 10 11

1 10 23 23 10 16 3 1641 28 1637 27 22 18

49

8015 24 20 824

49 4318 32

58 6022 21

Fig. S1. Temporal and geographic distribution of the strain study set. Graphed for each of the nine geographic regions studied is the number of isolates peryear. The 3,615 strains sequenced temporally span from 1969 through 2013, a period of 45 y, with the exception of a few older historic isolates. Invasiveinfection isolates are shown in red and pharyngitis isolates, from Finland only, are in green. Not graphed are 11 isolates that came from other locations (e.g.,Australia, Czech Republic, England, etc.), were isolated before 1969, or are of unknown provenance (see Dataset S1 for a complete strain study set listing).

Nasser et al. www.pnas.org/cgi/content/short/1403138111 2 of 4

Page 3: Supporting Information - PNAS · reads and the deposited reference genome sequences were iden-tified using variant ascertainment algorithm (VAAL) (v 46233) ... R12. 5. Larkin MA,

0

200

400

600

800

1000

0 0.5 1.0 1.5 2.0 2.5 2.8 Mb

0 0.5 1.0 1.5 2.0 2.5 2.8 Mb

Cov

erag

e

0

1000

2000

3000

Cov

erag

e

GAS Known-Pangenome

MGAS5005 vs GAS pseudo-pangenome

SF370 vs GAS pseudo-pangenome

5005

.1, speA

2

5005

.3, sdaD

2

370.

1, speC-spd

137

0.2,

speH-speI

370.

3, spd

337

0.4,

?

5005

.2, spd

3

Core and dispensable genetic contentfrom the 1.84 Mb MGAS5005 genome

Dispensable genetic content from20 additional complete GAS genomes

Fig. S2. Assessment of gene content by read mapping to the known GAS pangenome. Gene content was assessed for each strain by mapping the sequencingreads to a GAS pseudopangenome constructed of the MGAS5005 genome plus all other unique dispensable gene content identified in the complete genomesequences of 20 additional GAS strains of 13 serotypes using PanOCT. Most of the gene content was derived from mobile genetic elements not found inMGAS5005. The number of reads mapping to each gene was determined using Cufflinks and normalized using the FPKM method. FPKM values were used tomake a gene presence or absence determination. Mobile genetic elements for which the preponderance of gene content was present were determined to bepresent. Graphed for reference M1 strains MGAS5005 and SF370 is the depth of coverage of sequence reads mapped to the GAS known gene contentpseudopangenome, using Mosaik. Evident in the upper panel for the MGAS5005 sequence data is a lack of reads mapping to the SF370 prophages (withthe exception of 370.3 encoding spd3, which is similar to 5005.2 encoding spd3); similarly, in the lower panel is a lack of SF370 reads mapping to theMGAS5005 prophage.

0

20

30

50

10

40

Pol

ymor

phis

m p

er K

b

Pharyngitis Strainsn = 594

mean1 SD

Invasive Strainsn = 504

mean1 SD

sic

ropB

covRcovS

sic

hasAropB

hasB

Fig. S3. Difference in allele frequencies between invasive infection and pharyngitis strain sets. Graphed for the Finland contemporary invasive infectionisolates and pharyngitis isolates are the polymorphisms, per kilobase (i.e., the frequency of polymorphisms), for every gene of the core genome. Each strain sethas a subset of the genes that have a significantly greater number of polymorphisms than expected for a random distribution. Between the invasive andpharyngitis strains sets there is some commonality in genes with a higher allele frequency, such as for the sic and ropB genes, but also some distinct differences,such as for the covRS and hasAB genes.

Nasser et al. www.pnas.org/cgi/content/short/1403138111 3 of 4

Page 4: Supporting Information - PNAS · reads and the deposited reference genome sequences were iden-tified using variant ascertainment algorithm (VAAL) (v 46233) ... R12. 5. Larkin MA,

Other Supporting Information Files

Dataset S1 (PDF)

Nasser et al. www.pnas.org/cgi/content/short/1403138111 4 of 4